Fröhling

Niels Fröhling about [X]HTML, JS and technical tidbits

  • Home
  • About
02. January 2010

Cubic root approximation

Ethatron in Approximations

The SSE-ops lack a huge amount of very usefull functions. Basically every single mathematical relevant function is absent, worst, trying to re-create them and reach the same precision as the FPU-routines results in magnitude slower code. Sometimes one can find a usefull short-cut, but not very often.

Here I tried to make a cbrt(x), the cubic root. It’s basically a power-function, which is equivalent to the following euler-form:

double pow(double n, double r) {
    if (n > 0.0)
      return  exp(r / log ( n));
    else
      return -exp(r / log (-n));
}

The problem is, exp(x) and log(x) aren’t available either. Basically any function can be approximated either by Rational-Polynoms, Taylor-Series or some special Newtonian way. All of them are infinitely iterative, but mostly you won’t need infinite precision. So pow(x,y), exp(x) and log(x) can all be approximated by more or less equally complex polynoms - so it’s a bad deal to replace one approximation by two. We’ll stay with directly approximating cbrt(n).

Here two Rational-Polynom approximations:

// order 3/4
double cbrt(double x) {
return

      99.942970125 *x
 +  4591.64560671 * x*x
 + 14742.828103353 * x*x*x

/* ------------------------------ */ /

       1
 +   491.698363982 * x
 +  9105.81746731 * x*x*x
 + 11071.95822942296354 * x*x*x
 -  1236.0573805268561  * x*x*x*x

;
}
// order 4/4
double cbrt(double x) {
return

     173.791188868 * x
+  20254.442314157 * x*x
+ 212906.097079561 * x*x*x
+ 195085.639644873 * x*x*x*x

/* ------------------------------ */ /

       1
+   1147.163861142 * x
+  57143.213576431 * x*x
+ 278351.638227136 * x*x*x
+  91776.954562751 * x*x*x*x

;
}

Problem is, they are too inexact. As we do have sqrt(x) in SSE we can try the much faster (quadratically) converging Newtonian iterative approach (it’s Halley’s method):

       2x^3 + 4r
 x' = ----------- x
       4x^3 + 2r

       2(1x^3 + 2r)
 x' = -------------- x
       2(2x^3 + 1r)

        x^3 + 2r
 x' = ----------- x
       2x^3 +  r

Resulting in the following C-implementation:

double cbrt(double n) {
  const long int t = 2;	// two
  const long int f = 4;	// four

  const double s = fsign(n); n = fabs(n);
  const double X = sqrt(n);

  double
  c = (X * X * X),
  x = ((c * t) + (n * f)) * (X)
    / ((c * f) + (n * t));

  for (int i = 1; i < iterations; i++)
    c = (x * x * x),
    x = ((c * t) + (n * f)) * (x)
      / ((c * f) + (n * t));

  return x * s;
}

Unrolled this leads to the relative good and quick SSE-variant:

  movaps  xmm7, xmmX  /* store sign				 */
  andps	  xmmX, xmmword ptr [ absVPUf ]
  andps	  xmm7, xmmword ptr [ sgnVPUf ]

  movaps xmm5,	xmmX  /* xmm5 =	 r	  for Nth iteration only */
  movaps  xmm6,	xmmX  /*					 */
  addps	  xmm6,	xmm6  /* xmm6 =	2r	  for Nth iteration only */

/*sqrtps  xmmX,	xmmX   * sqrt(x)				 */
  rsqrtps xmm4,	xmmX  /* xmm4 = 1/sqrt(x),	first pass	 */
  mulps	  xmmX,	xmm4  /* sqrt(x)				 */

  movaps  xmm3,	xmmX  /* 1st iteration				 */
  mulps	  xmm3,	xmmX  /* xmmX =	x				 */
  mulps	  xmm3,	xmmX  /* xmm3 =	xmm4 = x * x * x		 */
  movaps  xmm4,	xmm3  /*  c					 */
  addps	  xmm4,	xmm4  /* 2c					 */
  addps	  xmm3,	xmm6  /*  c + 2r				 */
  addps	  xmm4,	xmm5  /* 2c +  r				 */
  mulps	  xmmX,	xmm3  /* ( c + 2r) / (2c +  r) * x’		 */
  divps	  xmmX,	xmm4  /* ( c + 2r) / (2c +  r)			 */
/*rcpps	  xmm4,	xmm4   *         1 / (2c +  r)			 */
/*mulps	  xmmX,	xmm4   * ( c + 2r) / (2c +  r)			 */

  movaps  xmm3,	xmmX  /* 2nd iteration				 */
  mulps	  xmm3,	xmmX  /* xmmX =	x				 */
  mulps	  xmm3,	xmmX  /* xmm3 =	xmm4 = x * x * x		 */
  movaps  xmm4,	xmm3  /*  c					 */
  addps	  xmm4,	xmm4  /* 2c					 */
  addps	  xmm3,	xmm6  /*  c + 2r				 */
  addps	  xmm4,	xmm5  /* 2c +  r				 */
  mulps	  xmmX,	xmm3  /* ( c + 2r) / (2c +  r) * x’		 */
  divps	  xmmX,	xmm4  /* ( c + 2r) / (2c +  r)			 */
/*rcpps	  xmm4,	xmm4   *         1 / (2c +  r)			 */
/*mulps	  xmmX,	xmm4   * ( c + 2r) / (2c +  r)			 */

  movaps  xmm3,	xmmX  /* 3rd iteration				 */
  mulps	  xmm3,	xmmX  /* xmmX =	x				 */
  mulps	  xmm3,	xmmX  /* xmm3 =	xmm4 = x * x * x		 */
  movaps  xmm4,	xmm3  /*  c					 */
  addps	  xmm4,	xmm4  /* 2c					 */
  addps	  xmm3,	xmm6  /*  c + 2r				 */
  addps	  xmm4,	xmm5  /* 2c +  r				 */
  mulps	  xmmX,	xmm3  /* ( c + 2r) / (2c +  r) * x’		 */
  divps	  xmmX,	xmm4  /* ( c + 2r) / (2c +  r)			 */
/*rcpps	  xmm4,	xmm4   *         1 / (2c +  r)			 */
/*mulps	  xmmX,	xmm4   * ( c + 2r) / (2c +  r)			 */

  orps	  xmmX,	xmm7  /* restore sign				 */

The iteration is unrolled can be repeated as much as you want to raise precision. The only issue is with the sign-masking as this requires memory access, otherwise the function is entirely variable free. The initial approximation via sqrt(x) is not required to be exact, picture it as a seed (you can even fill in a random variable, the function still will converge) so it can utilize the very fast reciprocal square-root. The divisions can be replaced by reciprocals too, but the precision suffers so much it isn’t worth the speedup (rcp is very very imprecise, much less than the 3DNow pfrcp, and there is no refinement-function in SSE like in 3DNow) as you’d have to add additional iterations to compensate for it and the repeated block is slower than a single division.

Kein Kommentar

Tags

  • Algorithms Assembler bash C++ Homemade Javascript MMX Optimization Rants TYPO3

Categories

    • Applications
      • TYPO3
    • TidBits
      • Algorithms
      • Approximations
      • Equivalence
      • Fixes
      • Javascript
      • Optimizations
      • Scripts
    • Uncategorized

Meta

    • Log in
    • Entries RSS
    • Comments RSS
    • WordPress.org

Recent Posts

    • jQuery’s animate is short-thought
    • movntq alignment
    • OP-equivalent series (pavgd)
    • OP-equivalent series (psubq)
    • OP-equivalent series (paddq)

 

  • September 2010
    M T W T F S S
    « Aug    
     12345
    6789101112
    13141516171819
    20212223242526
    27282930  

Archives

    • August 2010
    • April 2010
    • February 2010
    • January 2010
    • July 2009
    • February 2009
    • December 2008
    • November 2008
    • August 2008
    • July 2008
    • May 2008
© 2010 Wired by Fröhling
Design von Dezzain Studio
Übersetzt von Htwo
Nature Pictures | Bamboo Blinds