<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>Fröhling</title>
	<atom:link href="http://blog.frohling.biz/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.frohling.biz</link>
	<description>Niels Fröhling about [X]HTML, JS and technical tidbits</description>
	<pubDate>Sat, 21 Aug 2010 19:13:15 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
	<language>en</language>
			<item>
		<title>jQuery&#8217;s animate is short-thought</title>
		<link>http://blog.frohling.biz/2010/08/jquerys-animate-is-short-thought/</link>
		<comments>http://blog.frohling.biz/2010/08/jquerys-animate-is-short-thought/#comments</comments>
		<pubDate>Sat, 21 Aug 2010 18:35:12 +0000</pubDate>
		<dc:creator>Ethatron</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[Rants]]></category>

		<guid isPermaLink="false">http://blog.frohling.biz/?p=26</guid>
		<description><![CDATA[<br/>
Currently I&#8217;m experimenting with jQuery (always being told prototype.js is evil). In my project I have to make quite some complex animations and I tried to detect if I can map them to the animate() syntax. Uh, and all hell broke loose. Here is my 24h experience:

You can&#8217;t animate regular DOM properties, like image width/height, [...]]]></description>
			<content:encoded><![CDATA[
<br/><p>Currently I&#8217;m experimenting with jQuery (always being told prototype.js is evil). In my project I have to make quite some complex animations and I tried to detect if I can map them to the animate() syntax. Uh, and all hell broke loose. Here is my 24h experience:</p>
<ol>
<li>You can&#8217;t animate regular DOM properties, like image width/height, which is much more smooth than styles.</li>
<li><span style="text-decoration: line-through;">It does not hold custom states, nor allows you to animate a virtual property &#8216;counter&#8217; out of that you could folk all kinds of changes in the step-function. You HAVE TO animate a real css-style. Thus you fall back to your own timers and loose the easing-framework.</span> <a title="Fun with jQuery’s “animate()”" href="http://james.padolsey.com/javascript/fun-with-jquerys-animate/" target="_blank">Found it.</a> But not optimal &#8230;<span style="text-decoration: line-through;"><br />
</span></li>
<li>You can not hold any states, if you stop an animation you will loose the information about the position of the animation between [0.0,1.0], and any other. You have to track that via the step-function and calculation through the &#8216;now&#8217; time-stamp, OR you calculate the position directly on the modified CSS-value.</li>
<li>Each modified style enters the effects-queue as a seperate function, or expressed differently: each property given to animate to change becomes an entire seperate animation!</li>
<li>animate&#8217;s step function is called per step per property! Simultanious animation of multiple values becomes together with 4) almost impossible! A simple zoom function (width+height on content, left+top on parent relative to childs size-change) requires you to write approx. 50 lines!</li>
<li>You can&#8217;t pause/resume animations (no states), you have to recalculate the position (see 3), recalculate how much time is missing and restart the animation.</li>
<li>width/height animations force overflow hidden and display block. In 2010 I&#8217;d like to animate overflow visible and display inline-block too! Maybe even tables-cols if you don&#8217;t mind &#8230;</li>
<li>You can&#8217;t define proxy-animations like &#8220;font-size = height / 10&#8243;, normally I&#8217;d say it should be possible to reference any animateable attribute/style.</li>
<li>You can&#8217;t because of that also forget about the &#8220;font-size += delta-height&#8221; type.</li>
<li>All calculations are done in floating point, as no browser supports <a title="CSS Sub-pixel Background Misalignments" href="http://acko.net/blog/css-sub-pixel-background-misalignments">sub-pixel accuracy</a> you&#8217;l have horrible shift while zooming an image for example. The &#8220;zoom&#8221; effect anyway should be integral part of animate. Either way [left,left+width] are on a pixel-precision grid, if left becomes round down, left+width has to be round up (for zooming that is), of course trying to acount for this leads to headbanging on the keybord, because the animations are all seperate &#8230; and the step function is called per property &#8230;</li>
</ol>
<p>Grr. Such a small function and so much issues.</p>

]]></content:encoded>
			<wfw:commentRss>http://blog.frohling.biz/2010/08/jquerys-animate-is-short-thought/feed/</wfw:commentRss>
		</item>
		<item>
		<title>movntq alignment</title>
		<link>http://blog.frohling.biz/2010/04/movntq-alignment/</link>
		<comments>http://blog.frohling.biz/2010/04/movntq-alignment/#comments</comments>
		<pubDate>Mon, 12 Apr 2010 10:26:53 +0000</pubDate>
		<dc:creator>Ethatron</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[Assembler]]></category>

		<category><![CDATA[Rants]]></category>

		<guid isPermaLink="false">http://blog.frohling.biz/?p=25</guid>
		<description><![CDATA[<br/>
Uh, sometimes you don&#8217;t know what&#8217;s in the mind of intel guys.
They state in their documentation that the movntq-op (taken over from MMXExt) has to take place on 16-byte aligned memory. What??? Either it&#8217;s a copy &#38; paste error, it&#8217;s probably suppose to mean 8-byte aligned memory, or you got a command which can&#8217;t be [...]]]></description>
			<content:encoded><![CDATA[
<br/><p>Uh, sometimes you don&#8217;t know what&#8217;s in the mind of intel guys.</p>
<p>They state in their documentation that the movntq-op (taken over from MMXExt) has to take place on 16-byte aligned memory. What??? Either it&#8217;s a copy &amp; paste error, it&#8217;s probably suppose to mean 8-byte aligned memory, or you got a command which can&#8217;t be executed.</p>
<p>Besides, no op on MMX / 3DNow need alignment, so I think it&#8217;s even entirely wong &#8230; either way. Nice work intel.</p>

]]></content:encoded>
			<wfw:commentRss>http://blog.frohling.biz/2010/04/movntq-alignment/feed/</wfw:commentRss>
		</item>
		<item>
		<title>OP-equivalent series (pavgd)</title>
		<link>http://blog.frohling.biz/2010/02/op-equivalent-series-pavgd/</link>
		<comments>http://blog.frohling.biz/2010/02/op-equivalent-series-pavgd/#comments</comments>
		<pubDate>Tue, 16 Feb 2010 00:53:07 +0000</pubDate>
		<dc:creator>Ethatron</dc:creator>
		
		<category><![CDATA[Equivalence]]></category>

		<category><![CDATA[Assembler]]></category>

		<guid isPermaLink="false">http://blog.frohling.biz/?p=24</guid>
		<description><![CDATA[<br/>
pavgd (32bit average [(a + b + 1) >> 1]:
/* add (a + b), and compare overflow */
movq	mm6, mmc
paddd	mmc, mmd
psubd	mm6, 0x80000000
psubd	mmc, 0x80000000
pcmpgtd	mm6, mmc
paddd	mmc, 0x80000000

/* add (ab + 1), and compare overflow */
pcmpeqd	mm5, mm5
pcmpeqd	mmd, mmd
pcmpeqd	mm5, mmc
psubd	mmc, mmd

/* shift carry in */
por	mm6, mm5
pslld	mm6, 31
psrld	mmc, 1
por	mmc, mm6


]]></description>
			<content:encoded><![CDATA[
<br/><p><strong>pavgd (32bit average [(a + b + 1) >> 1]:</strong></p>
<pre><code style="color: white; font-family: monospace;">/* add (a + b), and compare overflow */
movq	mm6, mmc
paddd	mmc, mmd
psubd	mm6, 0x80000000
psubd	mmc, 0x80000000
pcmpgtd	mm6, mmc
paddd	mmc, 0x80000000

/* add (ab + 1), and compare overflow */
pcmpeqd	mm5, mm5
pcmpeqd	mmd, mmd
pcmpeqd	mm5, mmc
psubd	mmc, mmd

/* shift carry in */
por	mm6, mm5
pslld	mm6, 31
psrld	mmc, 1
por	mmc, mm6
</code></pre>

]]></content:encoded>
			<wfw:commentRss>http://blog.frohling.biz/2010/02/op-equivalent-series-pavgd/feed/</wfw:commentRss>
		</item>
		<item>
		<title>OP-equivalent series (psubq)</title>
		<link>http://blog.frohling.biz/2010/01/op-equivalent-series-psubq/</link>
		<comments>http://blog.frohling.biz/2010/01/op-equivalent-series-psubq/#comments</comments>
		<pubDate>Thu, 14 Jan 2010 10:48:55 +0000</pubDate>
		<dc:creator>Ethatron</dc:creator>
		
		<category><![CDATA[Equivalence]]></category>

		<category><![CDATA[Assembler]]></category>

		<guid isPermaLink="false">http://blog.frohling.biz/?p=23</guid>
		<description><![CDATA[<br/>
psubq (64bit - 64bit = 64bit):
movq	mm7, mmc
psubd	mmc, mms
movq	mm6, mmc
psubd	mm7, 0x80000000
psubd	mm6, 0x80000000
pcmpgtd	mm6, mm7
psllq	mm6, 32
psubd	mmc, mm6


]]></description>
			<content:encoded><![CDATA[
<br/><p><strong>psubq (64bit - 64bit = 64bit):</strong></p>
<pre><code style="color: white; font-family: monospace;">movq	mm7, mmc
psubd	mmc, mms
movq	mm6, mmc
psubd	mm7, 0x80000000
psubd	mm6, 0x80000000
pcmpgtd	mm6, mm7
psllq	mm6, 32
psubd	mmc, mm6
</code></pre>

]]></content:encoded>
			<wfw:commentRss>http://blog.frohling.biz/2010/01/op-equivalent-series-psubq/feed/</wfw:commentRss>
		</item>
		<item>
		<title>OP-equivalent series (paddq)</title>
		<link>http://blog.frohling.biz/2010/01/op-equivalent-series-paddq/</link>
		<comments>http://blog.frohling.biz/2010/01/op-equivalent-series-paddq/#comments</comments>
		<pubDate>Fri, 08 Jan 2010 07:46:29 +0000</pubDate>
		<dc:creator>Ethatron</dc:creator>
		
		<category><![CDATA[Equivalence]]></category>

		<category><![CDATA[Assembler]]></category>

		<guid isPermaLink="false">http://blog.frohling.biz/?p=22</guid>
		<description><![CDATA[<br/>
paddq (64bit + 64bit = 64bit):
movq	mm7, mmc
paddd	mmc, mma
psubd	mm7, 0x80000000
psubd	mmc, 0x80000000
pcmpgtd	mm7, mmc
paddd	mmc, 0x80000000
psllq	mm7, 32
psubd	mmc, mm7


]]></description>
			<content:encoded><![CDATA[
<br/><p><strong>paddq (64bit + 64bit = 64bit):</strong></p>
<pre><code style="color: white; font-family: monospace;">movq	mm7, mmc
paddd	mmc, mma
psubd	mm7, 0x80000000
psubd	mmc, 0x80000000
pcmpgtd	mm7, mmc
paddd	mmc, 0x80000000
psllq	mm7, 32
psubd	mmc, mm7
</code></pre>

]]></content:encoded>
			<wfw:commentRss>http://blog.frohling.biz/2010/01/op-equivalent-series-paddq/feed/</wfw:commentRss>
		</item>
		<item>
		<title>OP-equivalent series (packusqd, packssqud)</title>
		<link>http://blog.frohling.biz/2010/01/op-equivalent-series-packusqd/</link>
		<comments>http://blog.frohling.biz/2010/01/op-equivalent-series-packusqd/#comments</comments>
		<pubDate>Fri, 08 Jan 2010 05:37:43 +0000</pubDate>
		<dc:creator>Ethatron</dc:creator>
		
		<category><![CDATA[Equivalence]]></category>

		<category><![CDATA[Assembler]]></category>

		<guid isPermaLink="false">http://blog.frohling.biz/?p=21</guid>
		<description><![CDATA[<br/>
packusqd (saturated clamp from unsigned long long to unsigned long),
packssqud (saturated clamp from signed long long to unsigned long):
There is a condition for unsigned long long inputs, range is &#8220;only&#8221; [0x0000000000000000, 0x7FFFFFFFFFFF].
There is a no condition for signed long long inputs.
// from-scratch, no helper available

    /* -1 */
    pcmpeqd	mm6, [...]]]></description>
			<content:encoded><![CDATA[
<br/><p><strong>packusqd (saturated clamp from unsigned long long to unsigned long),</strong><br />
<strong>packssqud (saturated clamp from signed long long to unsigned long):</strong></p>
<p>There is a condition for unsigned long long inputs, range is &#8220;only&#8221; [0x0000000000000000, 0x7FFFFFFFFFFF].<br />
There is a no condition for signed long long inputs.</p>
<pre><code style="color: white; font-family: monospace;">// from-scratch, no helper available

    /* -1 */
    pcmpeqd	mm6, mm6

    /* x >> 32 > -1 */
    movq	mm4, mmc0
    movq	mm5, mm2

    /* no psraq/pshufd/pshufw available,
     * duplicate ((x >> 32) | x) */
    punpckhdq	mm4, mmc0
    punpckhdq	mm5, mm2

    pcmpgtd	mm4, mm6
    pcmpgtd	mm5, mm6

    pand	mmc0, mm4
    pand	mm2, mm5

    /* 0 */
    pxor	mm7, mm7

    /* x >> 32 == 0 */
    movq	mm4, mmc0
    movq	mm5, mm2

    /* no psraq/pshufd/pshufw available,
     * duplicate ((x >> 32) | x) */
    punpckhdq	mm4, mmc0
    punpckhdq	mm5, mm2

    pcmpeqd	mm4, mm7
    pcmpeqd	mm5, mm7

    pand	mmc0, mm4
    pand	mm2, mm5

    /* 0xFFFFFFFF */
    pandn	mm4, mm6
    pandn	mm5, mm6

    por		mmc0, mm4
    por		mm2, mm5

    punpckldq	mmc0, mm2
</code></pre>

]]></content:encoded>
			<wfw:commentRss>http://blog.frohling.biz/2010/01/op-equivalent-series-packusqd/feed/</wfw:commentRss>
		</item>
		<item>
		<title>OP-equivalent series (packusdw, packssduw)</title>
		<link>http://blog.frohling.biz/2010/01/op-equivalent-series-packusdw/</link>
		<comments>http://blog.frohling.biz/2010/01/op-equivalent-series-packusdw/#comments</comments>
		<pubDate>Thu, 07 Jan 2010 18:52:19 +0000</pubDate>
		<dc:creator>Ethatron</dc:creator>
		
		<category><![CDATA[Equivalence]]></category>

		<category><![CDATA[Assembler]]></category>

		<guid isPermaLink="false">http://blog.frohling.biz/?p=20</guid>
		<description><![CDATA[<br/>
packusdw (saturated clamp from unsigned long to unsigned short),
packssduw (saturated clamp from signed long to unsigned short):
There is a condition for unsigned long inputs, range is &#8220;only&#8221; [0x00000000, 0x7FFFFFFF].
There is a condition for signed long inputs, range is &#8220;only&#8221; [0x80008000, 0x7FFFFFFF]. You can go to full signed long if there would exist &#8220;psubsd&#8221;, which does [...]]]></description>
			<content:encoded><![CDATA[
<br/><p><strong>packusdw (saturated clamp from unsigned long to unsigned short),</strong><br />
<strong>packssduw (saturated clamp from signed long to unsigned short):</strong></p>
<p>There is a condition for unsigned long inputs, range is &#8220;only&#8221; [0x00000000, 0x7FFFFFFF].<br />
There is a condition for signed long inputs, range is &#8220;only&#8221; [0x80008000, 0x7FFFFFFF]. You can go to full signed long if there would exist &#8220;psubsd&#8221;, which does not.</p>
<pre><code style="color: white; font-family: monospace;">// via packssdw

psubd		xmmx, 0x00008000	// signed short in long
psubd		xmmy, 0x00008000
packssdw	xmmx, xmmy		// cast long to short
paddw		xmmy, 0x8000		// unsigned short
</code></pre>
<pre><code style="color: white; font-family: monospace;">// with punpck and variable

movdqa		xmm?1, 0x00008000
movdqa		xmm?2, xmm?1
puncklwd	xmm?2, xmm?1		// 0v0000000080008000
punckldq	xmm?2, xmm?2		// 0v8000800080008000

psubd		xmmx, xmm?1
psubd		xmmy, xmm?1
packssdw	xmmx, xmmy
paddw		xmmy, xmm?2
</code></pre>
<pre><code style="color: white; font-family: monospace;">// with pshufw and variable

movdqa		xmm?1, 0x00008000
pshuflw		xmm?2, xmm?1, 2|2|0|0	// 0v?????????80008000
pshufhw		xmm?2, xmm?1, 2|2|0|0	// 0v8000800080008000

psubd		xmmx, xmm?1
psubd		xmmy, xmm?1
packssdw	xmmx, xmmy
paddw		xmmy, xmm?2
</code></pre>
<pre><code style="color: white; font-family: monospace;">// with pshufw and variable and no memory access

pcmpeqd		xmm?1, xmm?1		// 0xFFFFFFFF
pslld		xmm?1, 31		// 0x80000000
pslrd		xmm?1, 16		// 0x00008000
pshuflw		xmm?2, xmm?1, 2|2|0|0	// 0v?????????80008000
pshufhw		xmm?2, xmm?1, 2|2|0|0	// 0v8000800080008000

psubd		xmmx, xmm?1
psubd		xmmy, xmm?1
packssdw	xmmx, xmmy
paddw		xmmy, xmm?2
</code></pre>

]]></content:encoded>
			<wfw:commentRss>http://blog.frohling.biz/2010/01/op-equivalent-series-packusdw/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Cubic root approximation</title>
		<link>http://blog.frohling.biz/2010/01/cubic-root-approximation/</link>
		<comments>http://blog.frohling.biz/2010/01/cubic-root-approximation/#comments</comments>
		<pubDate>Sat, 02 Jan 2010 22:26:59 +0000</pubDate>
		<dc:creator>Ethatron</dc:creator>
		
		<category><![CDATA[Approximations]]></category>

		<category><![CDATA[Assembler]]></category>

		<guid isPermaLink="false">http://blog.frohling.biz/?p=19</guid>
		<description><![CDATA[<br/>
The SSE-ops lack a huge amount of very usefull functions. Basically every single mathematical relevant function is absent, worst, trying to re-create them and reach the same precision as the FPU-routines results in magnitude slower code. Sometimes one can find a usefull short-cut, but not very often.
Here I tried to make a cbrt(x), the cubic [...]]]></description>
			<content:encoded><![CDATA[
<br/><p>The SSE-ops lack a huge amount of very usefull functions. Basically every single mathematical relevant function is absent, worst, trying to re-create them and reach the same precision as the FPU-routines results in magnitude slower code. Sometimes one can find a usefull short-cut, but not very often.</p>
<p>Here I tried to make a <em>cbrt(x)</em>, the cubic root. It&#8217;s basically a power-function, which is equivalent to the following euler-form:</p>
<pre name="code" class="cpp">double pow(double n, double r) {
    if (n &gt; 0.0)
      return  exp(r / log ( n));
    else
      return -exp(r / log (-n));
}
</pre>
<p>The problem is, <em>exp(x)</em> and <em>log(x)</em> aren&#8217;t available either. Basically any function can be approximated either by Rational-Polynoms, Taylor-Series or some special Newtonian way. All of them are infinitely iterative, but mostly you won&#8217;t need infinite precision. So <em>pow(x,y)</em>, <em>exp(x)</em> and <em>log(x)</em> can all be approximated by more or less equally complex polynoms - so it&#8217;s a bad deal to replace one approximation by two. We&#8217;ll stay with directly approximating <em>cbrt(n)</em>.</p>
<p>Here two Rational-Polynom approximations:</p>
<pre name="code" class="cpp">// order 3/4
double cbrt(double x) {
return

      99.942970125 *x
 +  4591.64560671 * x*x
 + 14742.828103353 * x*x*x

/* ------------------------------ */ /

       1
 +   491.698363982 * x
 +  9105.81746731 * x*x*x
 + 11071.95822942296354 * x*x*x
 -  1236.0573805268561  * x*x*x*x

;
}
</pre>
<pre name="code" class="cpp">// order 4/4
double cbrt(double x) {
return

     173.791188868 * x
+  20254.442314157 * x*x
+ 212906.097079561 * x*x*x
+ 195085.639644873 * x*x*x*x

/* ------------------------------ */ /

       1
+   1147.163861142 * x
+  57143.213576431 * x*x
+ 278351.638227136 * x*x*x
+  91776.954562751 * x*x*x*x

;
}
</pre>
<p>Problem is, they are too inexact. As we do have <em>sqrt(x)</em> in SSE we can try the much faster (quadratically) converging <a href="http://www.mathpath.org/Algor/cuberoot/cube.root.newton.htm">Newtonian iterative</a> approach (it&#8217;s <a href="http://en.wikipedia.org/wiki/Halley%27s_method">Halley&#8217;s method</a>):</p>
<pre><code style="color: white; font-family: monospace;">       2x^3 + 4r
 x' = ----------- x
       4x^3 + 2r

       2(1x^3 + 2r)
 x' = -------------- x
       2(2x^3 + 1r)

        x^3 + 2r
 x' = ----------- x
       2x^3 +  r
</code></pre>
<p>Resulting in the following C-implementation:</p>
<pre name="code" class="cpp">double cbrt(double n) {
  const long int t = 2;	// two
  const long int f = 4;	// four

  const double s = fsign(n); n = fabs(n);
  const double X = sqrt(n);

  double
  c = (X * X * X),
  x = ((c * t) + (n * f)) * (X)
    / ((c * f) + (n * t));

  for (int i = 1; i &lt; iterations; i++)
    c = (x * x * x),
    x = ((c * t) + (n * f)) * (x)
      / ((c * f) + (n * t));

  return x * s;
}
</pre>
<p>Unrolled this leads to the relative good and quick SSE-variant:</p>
<pre><code style="color: white; font-family: monospace;">  movaps  xmm7, xmmX  /* store sign				 */
  andps	  xmmX, xmmword ptr [ absVPUf ]
  andps	  xmm7, xmmword ptr [ sgnVPUf ]

  movaps xmm5,	xmmX  /* xmm5 =	 r	  for Nth iteration only */
  movaps  xmm6,	xmmX  /*					 */
  addps	  xmm6,	xmm6  /* xmm6 =	2r	  for Nth iteration only */

<span style="color: red;">/*sqrtps  xmmX,	xmmX   * sqrt(x)				 */</span>
<span style="color: #FF7F7F;">  rsqrtps xmm4,	xmmX  /* xmm4 = 1/sqrt(x),	first pass	 */
  mulps	  xmmX,	xmm4  /* sqrt(x)				 */</span>

<strong>  movaps  xmm3,	xmmX  /* 1st iteration				 */
  mulps	  xmm3,	xmmX  /* xmmX =	x				 */
  mulps	  xmm3,	xmmX  /* xmm3 =	xmm4 = x * x * x		 */
  movaps  xmm4,	xmm3  /*  c					 */
  addps	  xmm4,	xmm4  /* 2c					 */
  addps	  xmm3,	xmm6  /*  c + 2r				 */
  addps	  xmm4,	xmm5  /* 2c +  r				 */
  mulps	  xmmX,	xmm3  /* ( c + 2r) / (2c +  r) * x&#8217;		 */
<span style="color: #FF7F7F;">  divps	  xmmX,	xmm4  /* ( c + 2r) / (2c +  r)			 */</span>
<span style="color: red;">/*rcpps	  xmm4,	xmm4   *         1 / (2c +  r)			 */</span>
<span style="color: red;">/*mulps	  xmmX,	xmm4   * ( c + 2r) / (2c +  r)			 */</span>
</strong>
  movaps  xmm3,	xmmX  /* 2nd iteration				 */
  mulps	  xmm3,	xmmX  /* xmmX =	x				 */
  mulps	  xmm3,	xmmX  /* xmm3 =	xmm4 = x * x * x		 */
  movaps  xmm4,	xmm3  /*  c					 */
  addps	  xmm4,	xmm4  /* 2c					 */
  addps	  xmm3,	xmm6  /*  c + 2r				 */
  addps	  xmm4,	xmm5  /* 2c +  r				 */
  mulps	  xmmX,	xmm3  /* ( c + 2r) / (2c +  r) * x&#8217;		 */
<span style="color: #FF7F7F;">  divps	  xmmX,	xmm4  /* ( c + 2r) / (2c +  r)			 */</span>
<span style="color: red;">/*rcpps	  xmm4,	xmm4   *         1 / (2c +  r)			 */</span>
<span style="color: red;">/*mulps	  xmmX,	xmm4   * ( c + 2r) / (2c +  r)			 */</span>

  movaps  xmm3,	xmmX  /* 3rd iteration				 */
  mulps	  xmm3,	xmmX  /* xmmX =	x				 */
  mulps	  xmm3,	xmmX  /* xmm3 =	xmm4 = x * x * x		 */
  movaps  xmm4,	xmm3  /*  c					 */
  addps	  xmm4,	xmm4  /* 2c					 */
  addps	  xmm3,	xmm6  /*  c + 2r				 */
  addps	  xmm4,	xmm5  /* 2c +  r				 */
  mulps	  xmmX,	xmm3  /* ( c + 2r) / (2c +  r) * x&#8217;		 */
<span style="color: #FF7F7F;">  divps	  xmmX,	xmm4  /* ( c + 2r) / (2c +  r)			 */</span>
<span style="color: red;">/*rcpps	  xmm4,	xmm4   *         1 / (2c +  r)			 */</span>
<span style="color: red;">/*mulps	  xmmX,	xmm4   * ( c + 2r) / (2c +  r)			 */</span>

  orps	  xmmX,	xmm7  /* restore sign				 */
</code></pre>
<p>The iteration is unrolled can be repeated as much as you want to raise precision. The only issue is with the sign-masking as this requires memory access, otherwise the function is entirely variable free. The initial approximation via <em>sqrt(x)</em> is not required to be exact, picture it as a seed (you can even fill in a random variable, the function still will converge) so it can utilize the very fast reciprocal square-root. The divisions can be replaced by reciprocals too, but the precision suffers so much it isn&#8217;t worth the speedup (rcp is very very imprecise, much less than the 3DNow pfrcp, and there is no refinement-function in SSE like in 3DNow) as you&#8217;d have to add additional iterations to compensate for it and the repeated block is slower than a single division.</p>

]]></content:encoded>
			<wfw:commentRss>http://blog.frohling.biz/2010/01/cubic-root-approximation/feed/</wfw:commentRss>
		</item>
		<item>
		<title>OP-equivalent series (padddqd)</title>
		<link>http://blog.frohling.biz/2010/01/op-equivalent-series-padddqd/</link>
		<comments>http://blog.frohling.biz/2010/01/op-equivalent-series-padddqd/#comments</comments>
		<pubDate>Sat, 02 Jan 2010 21:49:07 +0000</pubDate>
		<dc:creator>Ethatron</dc:creator>
		
		<category><![CDATA[Equivalence]]></category>

		<category><![CDATA[Assembler]]></category>

		<guid isPermaLink="false">http://blog.frohling.biz/?p=18</guid>
		<description><![CDATA[<br/>
padddq (128bit + 64bit = 128bit):
// carry emulation via pcmpgtq equivalent

movdqa		xmmx, xmma
movdqa		xmmy, xmmb
movdqa		xmmz, xmma
paddq		xmma, xmmb
pcmpgtd		xmmx, xmmb
pcmpgtd		xmmy, xmma
pcmpeqd		xmmz, xmmb
pshufd		xmmx, xmmx, 1&#124;1&#124;1&#124;1
pshufd		xmmz, xmmz, 1&#124;1&#124;1&#124;1
pand		xmmz, xmmy
por		xmmz, xmmx
punpcklqdq	xmmz, xmmz
pslldq		xmmz, 8		or pshufd
psubq		xmma, xmmz


]]></description>
			<content:encoded><![CDATA[
<br/><p><strong>padddq (128bit + 64bit = 128bit):</strong></p>
<pre><code style="color: white; font-family: monospace;">// carry emulation via pcmpgtq equivalent

movdqa		xmmx, xmma
movdqa		xmmy, xmmb
movdqa		xmmz, xmma
paddq		xmma, xmmb
pcmpgtd		xmmx, xmmb
pcmpgtd		xmmy, xmma
pcmpeqd		xmmz, xmmb
pshufd		xmmx, xmmx, 1|1|1|1
pshufd		xmmz, xmmz, 1|1|1|1
pand		xmmz, xmmy
por		xmmz, xmmx
punpcklqdq	xmmz, xmmz
pslldq		xmmz, 8		or pshufd
psubq		xmma, xmmz
</code></pre>

]]></content:encoded>
			<wfw:commentRss>http://blog.frohling.biz/2010/01/op-equivalent-series-padddqd/feed/</wfw:commentRss>
		</item>
		<item>
		<title>OP-equivalent series (pcmpgtq, pcmpeqq)</title>
		<link>http://blog.frohling.biz/2010/01/op-equivalent-series-pcmpgtq-pcmpeqq/</link>
		<comments>http://blog.frohling.biz/2010/01/op-equivalent-series-pcmpgtq-pcmpeqq/#comments</comments>
		<pubDate>Sat, 02 Jan 2010 21:45:31 +0000</pubDate>
		<dc:creator>Ethatron</dc:creator>
		
		<category><![CDATA[Equivalence]]></category>

		<category><![CDATA[Assembler]]></category>

		<guid isPermaLink="false">http://blog.frohling.biz/?p=17</guid>
		<description><![CDATA[<br/>
When you don&#8217;t have SSE4 available you have to compensate for the lack of ome essencial ops. In this series I&#8217;m going to write down equivalents of either missing or (currently) non-AMD ops.
pcmpeqq (64bit integer compare):
// equality == (hi == hi) &#038;&#038; (lo == lo)

movdqa		xmm?3, xmma
pcmpeqd		xmm?3, xmmb
pshufd		xmm?1, xmm?3, 3&#124;3&#124;1&#124;1	(b == a)
pshufd		xmm?2, xmm?3, 2&#124;2&#124;0&#124;0	(B == A)
pand		xmm?1, [...]]]></description>
			<content:encoded><![CDATA[
<br/><p>When you don&#8217;t have SSE4 available you have to compensate for the lack of ome essencial ops. In this series I&#8217;m going to write down equivalents of either missing or (currently) non-AMD ops.</p>
<p><strong>pcmpeqq (64bit integer compare):</strong></p>
<pre><code style="color: white; font-family: monospace;">// equality == (hi == hi) &#038;&#038; (lo == lo)

movdqa		xmm?3, xmma
pcmpeqd		xmm?3, xmmb
pshufd		xmm?1, xmm?3, 3|3|1|1	(b == a)
pshufd		xmm?2, xmm?3, 2|2|0|0	(B == A)
pand		xmm?1, xmm?2		(b == a) &amp;&amp; (B == A)
</code></pre>
<p><strong>pcmpgtq (64bit integer compare):</strong></p>
<pre><code style="color: white; font-family: monospace;">// greater == (hi > hi) || ((hi == hi) &#038;&#038; (lo > lo))

movdqa		xmm?1, xmma
movdqa		xmm?2, xmma
pcmpgtd		xmm?1, xmmb
pcmpeqd		xmm?2, xmmb
pshufd		xmm?3, xmm?1, 3|3|1|1	(b &gt; a)
pshufd		xmm?4, xmm?2, 3|3|1|1	(b == a)
por		xmm?2, xmm?1		(B &gt; A) || (B == A)
pshufd		xmm?2, xmm?2, 2|2|0|0	(B &gt;= A)
pand		xmm?2, xmm?4		(b == a) &amp;&amp; (B &gt;= A)
por		xmm?2, xmm?3		(b &gt; a) || ((b == a) &amp;&amp; (B &gt;= A))
</code></pre>

]]></content:encoded>
			<wfw:commentRss>http://blog.frohling.biz/2010/01/op-equivalent-series-pcmpgtq-pcmpeqq/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
