The SSE-ops lack a huge amount of very usefull functions. Basically every single mathematical relevant function is absent, worst, trying to re-create them and reach the same precision as the FPU-routines results in magnitude slower code. Sometimes one can find a usefull short-cut, but not very often.
Here I tried to make a cbrt(x), the cubic root. It’s basically a power-function, which is equivalent to the following euler-form:
double pow(double n, double r) {
if (n > 0.0)
return exp(r / log ( n));
else
return -exp(r / log (-n));
}
The problem is, exp(x) and log(x) aren’t available either. Basically any function can be approximated either by Rational-Polynoms, Taylor-Series or some special Newtonian way. All of them are infinitely iterative, but mostly you won’t need infinite precision. So pow(x,y), exp(x) and log(x) can all be approximated by more or less equally complex polynoms - so it’s a bad deal to replace one approximation by two. We’ll stay with directly approximating cbrt(n).
Here two Rational-Polynom approximations:
// order 3/4
double cbrt(double x) {
return
99.942970125 *x
+ 4591.64560671 * x*x
+ 14742.828103353 * x*x*x
/* ------------------------------ */ /
1
+ 491.698363982 * x
+ 9105.81746731 * x*x*x
+ 11071.95822942296354 * x*x*x
- 1236.0573805268561 * x*x*x*x
;
}
// order 4/4
double cbrt(double x) {
return
173.791188868 * x
+ 20254.442314157 * x*x
+ 212906.097079561 * x*x*x
+ 195085.639644873 * x*x*x*x
/* ------------------------------ */ /
1
+ 1147.163861142 * x
+ 57143.213576431 * x*x
+ 278351.638227136 * x*x*x
+ 91776.954562751 * x*x*x*x
;
}
Problem is, they are too inexact. As we do have sqrt(x) in SSE we can try the much faster (quadratically) converging Newtonian iterative approach (it’s Halley’s method):
2x^3 + 4r
x' = ----------- x
4x^3 + 2r
2(1x^3 + 2r)
x' = -------------- x
2(2x^3 + 1r)
x^3 + 2r
x' = ----------- x
2x^3 + r
Resulting in the following C-implementation:
double cbrt(double n) {
const long int t = 2; // two
const long int f = 4; // four
const double s = fsign(n); n = fabs(n);
const double X = sqrt(n);
double
c = (X * X * X),
x = ((c * t) + (n * f)) * (X)
/ ((c * f) + (n * t));
for (int i = 1; i < iterations; i++)
c = (x * x * x),
x = ((c * t) + (n * f)) * (x)
/ ((c * f) + (n * t));
return x * s;
}
Unrolled this leads to the relative good and quick SSE-variant:
movaps xmm7, xmmX /* store sign */
andps xmmX, xmmword ptr [ absVPUf ]
andps xmm7, xmmword ptr [ sgnVPUf ]
movaps xmm5, xmmX /* xmm5 = r for Nth iteration only */
movaps xmm6, xmmX /* */
addps xmm6, xmm6 /* xmm6 = 2r for Nth iteration only */
/*sqrtps xmmX, xmmX * sqrt(x) */
rsqrtps xmm4, xmmX /* xmm4 = 1/sqrt(x), first pass */
mulps xmmX, xmm4 /* sqrt(x) */
movaps xmm3, xmmX /* 1st iteration */
mulps xmm3, xmmX /* xmmX = x */
mulps xmm3, xmmX /* xmm3 = xmm4 = x * x * x */
movaps xmm4, xmm3 /* c */
addps xmm4, xmm4 /* 2c */
addps xmm3, xmm6 /* c + 2r */
addps xmm4, xmm5 /* 2c + r */
mulps xmmX, xmm3 /* ( c + 2r) / (2c + r) * x’ */
divps xmmX, xmm4 /* ( c + 2r) / (2c + r) */
/*rcpps xmm4, xmm4 * 1 / (2c + r) */
/*mulps xmmX, xmm4 * ( c + 2r) / (2c + r) */
movaps xmm3, xmmX /* 2nd iteration */
mulps xmm3, xmmX /* xmmX = x */
mulps xmm3, xmmX /* xmm3 = xmm4 = x * x * x */
movaps xmm4, xmm3 /* c */
addps xmm4, xmm4 /* 2c */
addps xmm3, xmm6 /* c + 2r */
addps xmm4, xmm5 /* 2c + r */
mulps xmmX, xmm3 /* ( c + 2r) / (2c + r) * x’ */
divps xmmX, xmm4 /* ( c + 2r) / (2c + r) */
/*rcpps xmm4, xmm4 * 1 / (2c + r) */
/*mulps xmmX, xmm4 * ( c + 2r) / (2c + r) */
movaps xmm3, xmmX /* 3rd iteration */
mulps xmm3, xmmX /* xmmX = x */
mulps xmm3, xmmX /* xmm3 = xmm4 = x * x * x */
movaps xmm4, xmm3 /* c */
addps xmm4, xmm4 /* 2c */
addps xmm3, xmm6 /* c + 2r */
addps xmm4, xmm5 /* 2c + r */
mulps xmmX, xmm3 /* ( c + 2r) / (2c + r) * x’ */
divps xmmX, xmm4 /* ( c + 2r) / (2c + r) */
/*rcpps xmm4, xmm4 * 1 / (2c + r) */
/*mulps xmmX, xmm4 * ( c + 2r) / (2c + r) */
orps xmmX, xmm7 /* restore sign */
The iteration is unrolled can be repeated as much as you want to raise precision. The only issue is with the sign-masking as this requires memory access, otherwise the function is entirely variable free. The initial approximation via sqrt(x) is not required to be exact, picture it as a seed (you can even fill in a random variable, the function still will converge) so it can utilize the very fast reciprocal square-root. The divisions can be replaced by reciprocals too, but the precision suffers so much it isn’t worth the speedup (rcp is very very imprecise, much less than the 3DNow pfrcp, and there is no refinement-function in SSE like in 3DNow) as you’d have to add additional iterations to compensate for it and the repeated block is slower than a single division.