<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>Ethatron's Micro-Universe</title>
	<atom:link href="http://blog.frohling.biz/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.frohling.biz</link>
	<description>Niels Fröhling about the vast in the small, and then sometimes [X]HTML, JS and technical tidbits</description>
	<pubDate>Tue, 15 May 2012 04:53:11 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
	<language>en</language>
			<item>
		<title>Me against the machine - The first contender</title>
		<link>http://blog.frohling.biz/2012/05/me-against-the-machine-the-first-contender/</link>
		<comments>http://blog.frohling.biz/2012/05/me-against-the-machine-the-first-contender/#comments</comments>
		<pubDate>Tue, 15 May 2012 04:21:03 +0000</pubDate>
		<dc:creator>Ethatron</dc:creator>
		
		<category><![CDATA[Algorithms]]></category>

		<category><![CDATA[Optimizations]]></category>

		<category><![CDATA[Compression]]></category>

		<guid isPermaLink="false">http://blog.frohling.biz/?p=30</guid>
		<description><![CDATA[<br/>
I previously wrote about the setting we encounter in a regular (de)coding loop of an arithmetic coder. A pretty dense piece of code, just a few lines, how can we make something out of that? I tried to find algorithmical solutions to the the biggest problems first, as I knew that simply out-doing the compiler [...]]]></description>
			<content:encoded><![CDATA[
<br/><p>I <a href="http://blog.frohling.biz/2012/05/me-against-the-machine-the-setting/">previously</a> wrote about the setting we encounter in a regular (de)coding loop of an arithmetic coder. A pretty dense piece of code, just a few lines, how can we make something out of that? I tried to find algorithmical solutions to the the biggest problems first, as I knew that simply out-doing the compiler with assembler won&#8217;t bring me very far.</p>
<p>So the two problems were:</p>
<ul>
<li>3x (mostly high latency) divides</li>
<li>a bit-scan loop, with a virtual function call per step</li>
</ul>
<p>The first is really hard, we look at it later. The second is much easier.</p>
<h3><strong>Part 2: The first contender</strong></h3>
<p>Let&#8217;s remember the code:</p>
<pre name="code" class="cpp">#define	halfRange   0x00008000U	// 15 bit fractional precision
/* input code bits until ast.range has been expanded to
 * more than QUARTER. Mimics encoder.
 */
if (cst.range &lt;= halfRange) {
  /* expand range with trailing 0s (less precise) */
  do {
    cst.range += cst.range;
    bio-&gt;BitIOLeft::absorbSample(cst.data);
  } while (cst.range &lt;= halfRange);
}</pre>
<p>I have a pretty complete bit-IO class which allows me to absorb multiple bits at the same time, so we just delay the function call until after:</p>
<pre name="code" class="cpp">  do {
    cst.range += cst.range; num++;
  } while (cst.range &lt;= halfRange);

  bio-&gt;BitIOLeft::absorbSamples(num, cst.data);</pre>
<p>Now what&#8217;s left of the loop can also be written in another way, eliminating the operation on the range:</p>
<pre name="code" class="cpp">  do {
    num++;
  } while (cst.range &lt;= (halfRange &gt;&gt; num));

  cst.range &lt;&lt;= num;
  bio-&gt;BitIOLeft::absorbSamples(num, cst.data);</pre>
<p>As I was currently reading the AMD SSE4.2b spec. around that time I recognized the possibility to use a single instruction to replace that loop, leading-zero-count, which can be emulated by bit-scan-reverse on older hardware (range is never 0, no need for a failure-check):</p>
<pre name="code" class="cpp">if ((num = zeroBits&lt;unsigned short int&gt;(cst.range)) &gt; 0) {
  cst.range &lt;&lt;= num;
  bio-&gt;BitIOLeft::absorbSamples(num, cst.data);
}</pre>
<p>In theory both operations, the shift and the function, can work with num being 0, and the conditional wouldn&#8217;t be necessary, but the no-op call would be more expensive than a branch misprediction, so we let it stay. Speaking of prediction, I asked myself how likely it really is to not or to have a zero there. Let&#8217;s look what &#8220;range&#8221; is the result of:</p>
<pre name="code" class="cpp">/* Narrow the code region to the allotted to this symbol */
cst.code += (incr = (cst.range * ivl-&gt;low) / ivl-&gt;total);
cst.range = ((cst.range * ivl-&gt;high) / ivl-&gt;total) - incr;</pre>
<p>&#8220;range&#8221; comes in with the MSB set because of the previous re-normalization, and here shrinks by at least half (and clears the MSB in turn), for every incoming interval below half; surprise, surprise. If we have a statistical distribution of symbols with the MPS having a probability above 0.5, coding the MPS wouldn&#8217;t produce a cleared MSB all the time, but it would be coded almost all the time (in at least half of the cases), in which case we maybe have an unpredictable branch at probability 0.5. Though such a source is rare in predictive image compression, with a laplacian PDF where the probability of MPS+SPS together barely reach 0.4 often only 0.3. In such a case not only the sparse symbol&#8217;s probabilities cause cleared MSBs, but the MPS and SPS as well, yielding on average a very predictable 0.75 to 0.8 probability for the condition to be true (MSB cleared). The less predictable the source becomes, the more predictable the branch becomes. Likewise the more extremely predictable the source becomes (MPS &gt; 0.5), the more predictable the branch becomes. Only for geometric probabilities becomes the branch&#8217;s predictability a 0.5.</p>
<p>Okay, the 10-15 insns in a variable latency loop now became some 5 constant latency instructions, upto 15 potentially branch-prediction cache poluting erratic branches have been replaced by a single, in ordinary situation well preditable, branch. In turn the branch-prediction resources we freed here, can be utilized for the probability model which needs a lot of branching computation, thus we sped the model indirectly up as well.</p>
<p>This alone may result in a 15% speed improvement. There is a single arithmetic coder which has been utilizing this instruction sequence in any source that I know, that is the arithmetic coder from Ashford which he used in his Dr. Dobbs data compression competition entry &#8220;ash&#8221;.</p>
<p>Next time we&#8217;ll see how to pit smart code against a hardware circuit. :^)</p>
<h4>TOC:</h4>
<ol>
<li><a href="http://blog.frohling.biz/2012/05/me-against-the-machine-the-setting/">The setting</a></li>
<li>The first contender</li>
</ol>

]]></content:encoded>
			<wfw:commentRss>http://blog.frohling.biz/2012/05/me-against-the-machine-the-first-contender/feed/</wfw:commentRss>
		</item>
		<item>
		<title>How to speed up your AO-shader</title>
		<link>http://blog.frohling.biz/2012/05/how-to-speed-up-your-ao-shader/</link>
		<comments>http://blog.frohling.biz/2012/05/how-to-speed-up-your-ao-shader/#comments</comments>
		<pubDate>Mon, 14 May 2012 20:55:47 +0000</pubDate>
		<dc:creator>Ethatron</dc:creator>
		
		<category><![CDATA[Shaders]]></category>

		<category><![CDATA[Optimization]]></category>

		<guid isPermaLink="false">http://blog.frohling.biz/?p=29</guid>
		<description><![CDATA[<br/>
Ambient occlusion adds a lot of percieved realism to your game, but it also is a bandwidth and arithmetic heavy shader. While I was optimizing the shaders we implemented for OBGE I came up with a few tricks to make the AO shader a 50% faster without sacrificing detail or precision. I&#8217;ll take HBAO as [...]]]></description>
			<content:encoded><![CDATA[
<br/><p><a href="http://en.wikipedia.org/wiki/Ambient_occlusion">Ambient occlusion</a> adds a lot of percieved realism to your game, but it also is a bandwidth and arithmetic heavy shader. While I was optimizing the shaders we implemented for <a href="http://obge.paradice-insight.us/wiki/Main_Page">OBGE</a> I came up with a few tricks to make the AO shader a 50% faster without sacrificing detail or precision. I&#8217;ll take HBAO as an example here, but it works with any AO implementation.</p>
<h3>The initial implementation</h3>
<ul>
<li>store AO coefficients in the alpha-channel of the main rendertarget</li>
<li>branch out for sky fragments</li>
<li>calculate the contribution</li>
<li>AO coefficients where in human interpretable form [0.0,1.0], black represents no ambient light contribution, white full ambient light contribution</li>
<li>blur (while possibly upscaling) the coefficients with a 3&#215;3 kernel</li>
<li>merge AO and color into the final fragment</li>
</ul>
<p>Now this implementation has quite some inefficiencies:</p>
<ol>
<li>it reads/writes the full 4xFP16 several times, amounting to a huge waste of bandwidth</li>
<li>it reads and writes the same data (the color values from the renderbuffer), so you have to have 2x alterning renderbuffers, as you can&#8217;t read and write to the same buffer at the same time</li>
<li>older hardware and SM 3.0 has no return-from-branch instruction, which makes the shortcut for the sky all but almost a cosmetic line of code</li>
<li>storing coefficients in the familiar black/white form adds unnessesary operations (namely &#8220;return 1.0 - ao&#8221;), the graphics hardware has it built-in, just not as an op-code</li>
<li>bluring requires 8 texture fetches, from the full 4xFP16 renderbuffer</li>
<li>fragment merging is done in the shader, which also makes it necessary to read the full 4xFP16 renderbuffer in and then write it out</li>
</ol>
<p>The worst offender here is a huge waste of bandwidth, interestingly once we start fixing this aspect there are a lot of pieces of improvements which align naturally with it.</p>
<h3>The fixes</h3>
<h4>Bandwidth reduction</h4>
<p>The initial bandwith required is (I&#8217;m leaving out the z-buffer reads):</p>
<p style="padding-left: 30px;">4xFP16 read and write (shared alpha) +<br />
8 times 4xFP16 read (blur kernel) +<br />
4xFP16 read and write (merge) =<br />
10 times 4xFP16 read plus 2 times 4xFP16 write =<br />
<strong>120 MB/f read and 24 MB/f write (@1400&#215;1050)</strong></p>
<p>The first thing we can do is storing AO coefficients in another rendertarget, this not only allows us to use a distinct (from the main renderbuffer) precision for them, it also only needs to be 1 channel. Of course this adds memory cost, but if you have other post-processing shaders running, you may already have that buffer or you can utilize the additional renderbuffer in others as well. Additionally it eliminates the alterning renderbuffers for the main color data, if it isn&#8217;t necessary for something else.</p>
<p>Now the bandwidth utilized is this (I&#8217;m leaving out the z-buffer reads):</p>
<p style="padding-left: 30px;">1xFP16 write +<br />
9 times 1xFP16 read (blur kernel) +<br />
4xFP16 read and write (merge) =<br />
<strong>39 MB/f read and 15MB/f write (@1400&#215;1050)</strong></p>
<h4>Getting the branch done</h4>
<p>Now as we have AO coefficients seperated we don&#8217;t have any incoming data for the gather-pass, or to be more precise: we have a constant data incoming, like all white or all black, this is the matte-&#8221;color&#8221; of the AO-coefficient buffer which we can set via ClearBuffer(). Modern hardware goes so far as not to do any real memory operations when a renderbuffer is cleared, but to mark the resource as such instead and then delay the operation to the fragment-merger in the ROP-units.</p>
<p>Having a fallback or matte-&#8221;color&#8221; allows us to use a real shader-abortion instruction: clip(). We couldn&#8217;t use it before because we had to copy the color values over from the old renderbuffer into the temporary one. Normally the clip-instruction has it caveads, if utilized in a pixel-shader it forces the z-test to be delayed till after the pixel-shader, and disables all z-buffer optimizations like hierarchical-z. Just in this case we don&#8217;t have the z-test nor the z-write enabled, in consequence it doesn&#8217;t come with a penalty.</p>
<p>As fragments are processed in groups, so called wave-fronts, the time to process all fragments in that group is a constant &#8220;maximum(time-taken)&#8221; over each fragment. If there would only be an occasional &#8220;clip()&#8221; possible it&#8217;d not present any advantage, the wave-front time would still be the one of the non-clipped fragments. Though, the branch tests for sky-geometry, which is mostly one big continous area on the screen. In effect, when sky is present, entire wave-fronts shortcut and execution time drops to the one of the &#8220;test()+clip()&#8221;.</p>
<p>The gain is dependent on the presence of sky, that&#8217;s because I won&#8217;t say it achieves a stadistical X% cut of arithmetics executed, though X is the ratio of sky vs. non-sky - in an outdoor environment it can easily be 30-50% cut.</p>
<h4>More bandwidth reduction</h4>
<p>We still have the one read of the main renderbuffer lurking in the code, it&#8217;s required to do the manual fragment-merge &#8220;color * ao&#8221;.</p>
<p>While we see a slow migration of fixed function units to general purpose computation units, the ROP-units are still present and efficient as always. The unified shaders need a very broad connectivity to various parts of the chip, because they are multi-purpose, this in consequence creates a very complex network of interdependencies, raising the minimum execution time of a shader. This in itself isn&#8217;t such a wrong compromise, if you expect shader-execution time to raise, the ratio of minimum vs. average becomes very small. Though the ROP-units don&#8217;t do that complex general-purpose tasks, the data-set to operate on is perfectly regular (the rendertarget grid), they have direct connection to the memory controllers and operate pass-through-caches (no cache-pollution), the runtime of raster-ops is constant, and nowadays below GP-shader minimum time, giving it a clear performance edge. I believe it&#8217;s very unlikely that one would come op with a custom raster-op which truly deserves ROP-units to be general-purpose. I can imagine to do colorspace-colorspace conversions transparently in the ROP-stage (blend YCC-fragments with an incoming XYZ-renderbuffer into an outgoing sRGB-backbuffer) as a sensefull custom raster-op. Though it sounds too exotic.</p>
<p>So, as the ROPs are still superior to shaders in what they do, we can savely eliminate the shader-based fragment-merge with a ROP-based one. We saw earlier that the complete merge operation is &#8220;color * ao&#8221;. The corresponding raster-op is &#8220;dest.color = (zero) + (dest.color * src.alpha)&#8221;.</p>
<p>Now the bandwidth utilized is this (I&#8217;m leaving out the z-buffer reads):</p>
<p style="padding-left: 30px;">1xFP16 write +<br />
9 times 1xFP16 read (blur kernel) +<br />
4xFP16 merge =<br />
<strong>27 MB/f read</strong><strong> and 3MB/f write </strong><strong>and 12MB/f merge</strong><strong> (@1400&#215;1050)</strong></p>
<h4>Small change, big advantage</h4>
<p>Earlier we mentioned that the occlusion coefficients are stored in inverse magnitude order, that is the less occlusion the higher the value. This is primarily done for natural visualization, if we&#8217;d write out &#8220;return 1.0 - ao;&#8221; and then merge as &#8220;return color * ao;&#8221; - or do the inverse, write out  &#8220;return ao;&#8221; and then merge as &#8220;return color * (1.0 - ao);&#8221; is algorithmically equivalent. Now by definition of the term &#8220;amount of ambient occlusion&#8221; it is better to think 0.0 should represent <strong>no occlusion</strong> (the magnitude of occlusion is 0). It is also important, in the context of limited precision floating-point arithmetic, to establish the direction of rounding and the bias of any error of approximation towards <strong>no occlusion</strong>. In the cascade of operations, where results are also streamed out to a limited precision renderbuffer (possibly 8bit integer, possibly <strong>filtered</strong>), the natural bias of accumulated error is towards 0.<br />
The net effect is that, in the presence of rounding/precision errors the AO cofficients will gear towards &#8220;no occlusion&#8221;, and not &#8220;some occlusion&#8221;. Pessimistically this prevents the final output to be too dark by some ulps, it also enables us to use some extreme low precision intermediate buffers like 4bit 1 channel surfaces (or BC4), the worst which can happen is that the AO coefficient drops to zero. This is also practically possible because the intermediate results get filtered (by the 3&#215;3 blur kernel), who&#8217;s error bias also is towards 0.0.</p>
<p>So we opt to write [0.0,1.0] instead of [1.0,0.0] into the AO-buffer, and delay the inverse multiply untill directly before the fragment-merge operation. Where we directly recognize that we can implement it as raster-op: &#8220;dest.color = (zero) + (dest.color * src.invalpha)&#8221;. We got a more robust intermediate buffer, and saved an instruction.</p>
<p>We also recognize that we can eliminate the raster-op if the &#8220;src.alpha&#8221; value is 0, so we turn on alpha-test as well. This is not only the complement to our &#8220;clip()&#8221; from above for skies (sky is 0 occlusion), it also catches all cases of planar surfaces. Let&#8217;s remember that in the absense of any occluder within the sampled half-sphere aligned to the surface normal, occlusion is 0. If the planar surface is bigger than the sample-radius, its inner area&#8217;s coefficients are all 0.</p>
<p>Now, statistically, we pretty much always have a fragment merge rejection, not only from sky, but also from planar surfaces. It depends again on the complexity of the scene, but the best-case scenario has been improved even further.</p>
<h4>Even more bandwidth reduction (for AMD only on DX9)</h4>
<p>As we converted the intermediate buffer(s) to 1 channel surfaces we can utilize a a bandwidth saving feature from the pre DX10/11 days called FETCH4. ATI was the one introducing it to the DX9 API by a driver hack, but later it became part of the &#8220;gather()&#8221;-standard in DX10/11.</p>
<p>FETCH4 allows us to sample 4 pixels of a 1 channel surface (of any precision) in the same instruction. The result is saved the the 4 respective elements of the &#8220;float4&#8243;-vector. We can use it on the z-buffer if the AO-kernel is suitable (HDAO for example employs this scheme), but we can use it to implement any gaussian filtering kernel, like our 3&#215;3 blur.</p>
<p>The kernel that we used was like this:</p>
<pre>1.00	2.00	1.00
2.00	4.00	2.00
1.00	2.00	1.00</pre>
<p>9 texture fetches, 5 multiplications, 8 additions, 1 multiply (inverse division).</p>
<p>With FETCH4 we can reduce this to 4&#215;4 fetches like this:</p>
<pre>F F -  |  * * -  |  * F F  |  * * *
F F -  |  * F F  |  * F F  |  F F *
- - -  |  - F F  |  - * *  |  F F *</pre>
<p>4 texture fetches, no multiplication, 4 additions, 1 dot (horizontal sum), 1 multiply (inverse division).</p>
<p>Our intention here wasn&#8217;t so much to reduce instructions but to reduce the bandwidth utilized, but&#8217;s interesting to note how FETCH4 even simplifies arithmetics. I presume FETCH4 represents virtually 4 reads at the cost of 1 read, which can be achieved by changing the memory layout of 1 channel textures, by the reduction of memory-read latencies of spread singular vs. combined burst memory reads, or other read-combining tricks in the memory controller.</p>
<p>Now the bandwidth utilized is this (I&#8217;m leaving out the z-buffer reads):</p>
<p style="padding-left: 30px;">1xFP16 write +<br />
4 times 1xFP16 read (blur kernel) +<br />
4xFP16 merge =<br />
<strong>12 MB/f read</strong><strong> and 3MB/f write </strong><strong>and 12MB/f merge</strong><strong> (@1400&#215;1050)</strong></p>
<h4>Better kernel (just for my 5870)</h4>
<p>Because of the previous fix I just gained a few free instructions for the filtering kernel, which is still memory bandwidth limited, so the ALUs actually idle while waiting for the texture-fetches to complete.</p>
<p>I decided to give it a bit smoother/better kernel of this kind:</p>
<pre>1.41	2.00	1.41
2.00	2.34	2.00
1.41	2.00	1.41</pre>
<p>4 texture fetches, no multiplication, 4 additions, 3 dot (horizontal sum), 1 multiply (inverse division).</p>
<p>The full operation is this:</p>
<pre name="code" class="cpp">/* Point-filtered gather-square (a=[-1,-1] r=[ 0,-1] g=[-1, 0] b=[ 0, 0]) */
/* Point-filtered gather-square (a=[ 0,-1] r=[ 1,-1] g=[ 0, 0] b=[ 1, 0]) */
/* Point-filtered gather-square (a=[-1, 0] r=[ 0, 0] g=[-1, 1] b=[ 0, 1]) */
/* Point-filtered gather-square (a=[ 0, 0] r=[ 1, 0] g=[ 0, 1] b=[ 1, 1]) */
	float4 tl = FetchGS(SamplerG, coord + rcpres * float2(-1, -1)).xyzw;
	float4 tr = FetchGS(SamplerG, coord + rcpres * float2( 0, -1)).xyzw;
	float4 bl = FetchGS(SamplerG, coord + rcpres * float2(-1,  0)).xyzw;
	float4 br = FetchGS(SamplerG, coord + rcpres * float2( 0,  0)).xyzw;

	center  =       (tl.b + tr.g + bl.r + br.a);				// * 4
/* Point-filtered gather-cross  (r=[-1, 0] g=[ 1, 0] b=[ 0,-1] a=[ 0, 1]) */
	gc.xyzw = float4(tl.g + bl.a, tr.b + br.r, tl.r + tr.a, bl.b + br.g);	// * 2
/* Point-filtered gather-star   (r=[-1,-1] g=[ 1,-1] b=[-1, 1] a=[ 1, 1]) */
	gs.xyzw = float4(tl.a       , tr.r       , bl.b       , br.b       );	// * 1

	/*	1.41	2.00	1.41
	 *	2.00	2.34	2.00
	 *	1.41	2.00	1.41
	 */
	ao  = (center / 4.0) * (4.0 - 4.0 * (1.4142135623730950488016887242097 - 1.0));
	ao += dot(gc, 1.0);
	ao += dot(gs, 1.4142135623730950488016887242097);
	ao /= 16;</pre>
<h3>An average-case look-back</h3>
<p>Let&#8217;s have a look at an average case (some sky, say 25% rejections, some planar geometry, say 25% rejections):</p>
<h4>Before</h4>
<p style="padding-left: 30px;">4xFP16 read and write (shared alpha) +<br />
8 times 4xFP16 read (blur kernel) +<br />
4xFP16 read and write (merge) =<br />
10 times 4xFP16 read plus 2 times 4xFP16 write =<br />
<strong>120 MB/f read and 24 MB/f write (@1400&#215;1050)</strong></p>
<h4>After</h4>
<p style="padding-left: 30px;">75% 1xFP16 write +<br />
4 times 1xFP16 read (blur kernel) +<br />
50% 4xFP16 merge =<br />
<strong>12 MB/f read</strong><strong> and 2.25MB/f write </strong><strong>and 6MB/f merge</strong><strong> (@1400&#215;1050)</strong></p>
<p>Now, I did this kind of optimization with HBAO, implemented in OBGE. HBAO is still memory bandwidth limited (spread z-buffer reads), but now only on the AO coefficient-gathering pass, not on any of the other passes. The result:</p>
<p><strong>Before:</strong> 3.9 ms/f<br />
<strong>After:</strong> 2.3 ms/f</p>
<p>Which is a very nice result. I repeated this for ATI HDAO implementation, which uses FETCH4 for the z-buffer reads as well, and has a very small AO coefficient-gather kernel:</p>
<p><strong>Before:</strong> 2.5 ms/f<br />
<strong>After:</strong> 1.1 ms/f</p>
<p>I haven&#8217;t reduced the precision of the AO-renderbuffer, nor did I use any down-sampled render-buffer. In either case we&#8217;d not gain that much anymore because the shaders are already dominated by the gather-kernel. But if every cycle counts you may get this down to 1.5 ms and 0.6 ms respectively.</p>

]]></content:encoded>
			<wfw:commentRss>http://blog.frohling.biz/2012/05/how-to-speed-up-your-ao-shader/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Me against the machine - The setting</title>
		<link>http://blog.frohling.biz/2012/05/me-against-the-machine-the-setting/</link>
		<comments>http://blog.frohling.biz/2012/05/me-against-the-machine-the-setting/#comments</comments>
		<pubDate>Mon, 14 May 2012 03:11:17 +0000</pubDate>
		<dc:creator>Ethatron</dc:creator>
		
		<category><![CDATA[Algorithms]]></category>

		<category><![CDATA[Optimizations]]></category>

		<category><![CDATA[Compression]]></category>

		<guid isPermaLink="false">http://blog.frohling.biz/?p=28</guid>
		<description><![CDATA[<br/>
I am developing an image compressor with various backends, huffman coding for example, and arithmetic coding. If you (de)compress big images it quickly becomes apparent that coding becomes the main bottleneck of the entire (de)compressor. It is serial in nature, we can&#8217;t vectorize it naively, and we can&#8217;t efficiently parallelize it. Algorithmically it is a [...]]]></description>
			<content:encoded><![CDATA[
<br/><p>I am developing an image compressor with various backends, huffman coding for example, and arithmetic coding. If you (de)compress big images it quickly becomes apparent that coding becomes the main bottleneck of the entire (de)compressor. It is serial in nature, we can&#8217;t vectorize it naively, and we can&#8217;t efficiently parallelize it. Algorithmically it is a bit like <a href="http://en.wikipedia.org/wiki/Long_division">long-division</a>, each step depends on the results of the previous step in the iteration sequence.</p>
<p>If we don&#8217;t accept simplifications of the arithmetic which in turn lower the efficieny, the only thing you can do is to optimize the hell out of the inner arithmetic loop, which has a re-normalization loop and carries two high latency divisions (&gt; 40 cycles). This series is about my adventure of beating not only the compiler in assembler, but also the hardware division circuit of the CPU.</p>
<h3><strong>Part 1: The setting</strong></h3>
<p>The &#8220;normal&#8221; inner loop of arithmetic (de)coding a stream of incoming intervals is like this:</p>
<pre name="code" class="cpp">ivl->target =
  (prob)(((cst.code + 1) * ivl->total - 1) / cst.range);
</pre>
<p>This determines the possible location of the previously (en)coded interval. When the only (unique) interval has been found by the model, the (de)coder continues removing that range from the continous fraction the incoming data-stream represents:</p>
<pre name="code" class="cpp">register dtyp incr;

/* Narrow the code region to the allotted to this symbol */
cst.code += (incr = (cst.range * ivl->low) / ivl->total);
cst.range = ((cst.range * ivl->high) / ivl->total) - incr;

#define	halfRange   0x00008000U	// 15 bit fractional precision
/* input code bits until ast.range has been expanded to
 * more than QUARTER. Mimics encoder.
 */
if (cst.range <= halfRange) {
  /* expand range with trailing 0s (less precise) */
  do {
    cst.range += cst.range;
    bio->BitIOLeft::absorbSample(cst.data);
  } while (cst.range <= halfRange);
}
</pre>
<p>All in all, something like 20 instructions, three of them divisions - two of the divisions are likely maximum latency as the range is always renormalized to fill at least half the datatype&#8217;s range, 0&#215;8000/X is almost always maximum latency - and a virtual function call <strong>per bit</strong>.</p>
<p>Now, there are a few things that come to mind which are worth trying, we don&#8217;t know if it turns out well, but&#8217;s we&#8217;ll see the contenders in the next chapter. :^)</p>
<h4>References:</h4>
<p><a href="http://www.hpl.hp.com/techreports/2004/HPL-2004-76.pdf">Introduction to Arithmetic Coding - Theory and Practice</a></p>

]]></content:encoded>
			<wfw:commentRss>http://blog.frohling.biz/2012/05/me-against-the-machine-the-setting/feed/</wfw:commentRss>
		</item>
		<item>
		<title>jQuery&#8217;s animate is short-thought</title>
		<link>http://blog.frohling.biz/2010/08/jquerys-animate-is-short-thought/</link>
		<comments>http://blog.frohling.biz/2010/08/jquerys-animate-is-short-thought/#comments</comments>
		<pubDate>Sat, 21 Aug 2010 18:35:12 +0000</pubDate>
		<dc:creator>Ethatron</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[Rants]]></category>

		<guid isPermaLink="false">http://blog.frohling.biz/?p=26</guid>
		<description><![CDATA[<br/>
Currently I&#8217;m experimenting with jQuery (always being told prototype.js is evil). In my project I have to make quite some complex animations and I tried to detect if I can map them to the animate() syntax. Uh, and all hell broke loose. Here is my 24h experience:

You can&#8217;t animate regular DOM properties, like image width/height, [...]]]></description>
			<content:encoded><![CDATA[
<br/><p>Currently I&#8217;m experimenting with jQuery (always being told prototype.js is evil). In my project I have to make quite some complex animations and I tried to detect if I can map them to the animate() syntax. Uh, and all hell broke loose. Here is my 24h experience:</p>
<ol>
<li>You can&#8217;t animate regular DOM properties, like image width/height, which is much more smooth than styles.</li>
<li><span style="text-decoration: line-through;">It does not hold custom states, nor allows you to animate a virtual property &#8216;counter&#8217; out of that you could folk all kinds of changes in the step-function. You HAVE TO animate a real css-style. Thus you fall back to your own timers and loose the easing-framework.</span> <a title="Fun with jQuery’s “animate()”" href="http://james.padolsey.com/javascript/fun-with-jquerys-animate/" target="_blank">Found it.</a> But not optimal &#8230;<span style="text-decoration: line-through;"><br />
</span></li>
<li>You can not hold any states, if you stop an animation you will loose the information about the position of the animation between [0.0,1.0], and any other. You have to track that via the step-function and calculation through the &#8216;now&#8217; time-stamp, OR you calculate the position directly on the modified CSS-value.</li>
<li>Each modified style enters the effects-queue as a seperate function, or expressed differently: each property given to animate to change becomes an entire seperate animation!</li>
<li>animate&#8217;s step function is called per step per property! Simultanious animation of multiple values becomes together with 4) almost impossible! A simple zoom function (width+height on content, left+top on parent relative to childs size-change) requires you to write approx. 50 lines!</li>
<li>You can&#8217;t pause/resume animations (no states), you have to recalculate the position (see 3), recalculate how much time is missing and restart the animation.</li>
<li>width/height animations force overflow hidden and display block. In 2010 I&#8217;d like to animate overflow visible and display inline-block too! Maybe even tables-cols if you don&#8217;t mind &#8230;</li>
<li>You can&#8217;t define proxy-animations like &#8220;font-size = height / 10&#8243;, normally I&#8217;d say it should be possible to reference any animateable attribute/style.</li>
<li>You can&#8217;t because of that also forget about the &#8220;font-size += delta-height&#8221; type.</li>
<li>All calculations are done in floating point, as no browser supports <a title="CSS Sub-pixel Background Misalignments" href="http://acko.net/blog/css-sub-pixel-background-misalignments">sub-pixel accuracy</a> you&#8217;l have horrible shift while zooming an image for example. The &#8220;zoom&#8221; effect anyway should be integral part of animate. Either way [left,left+width] are on a pixel-precision grid, if left becomes round down, left+width has to be round up (for zooming that is), of course trying to acount for this leads to headbanging on the keybord, because the animations are all seperate &#8230; and the step function is called per property &#8230;</li>
</ol>
<p>Grr. Such a small function and so much issues.</p>

]]></content:encoded>
			<wfw:commentRss>http://blog.frohling.biz/2010/08/jquerys-animate-is-short-thought/feed/</wfw:commentRss>
		</item>
		<item>
		<title>movntq alignment</title>
		<link>http://blog.frohling.biz/2010/04/movntq-alignment/</link>
		<comments>http://blog.frohling.biz/2010/04/movntq-alignment/#comments</comments>
		<pubDate>Mon, 12 Apr 2010 10:26:53 +0000</pubDate>
		<dc:creator>Ethatron</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[Assembler]]></category>

		<category><![CDATA[Rants]]></category>

		<guid isPermaLink="false">http://blog.frohling.biz/?p=25</guid>
		<description><![CDATA[<br/>
Uh, sometimes you don&#8217;t know what&#8217;s in the mind of intel guys.
They state in their documentation that the movntq-op (taken over from MMXExt) has to take place on 16-byte aligned memory. What??? Either it&#8217;s a copy &#38; paste error, it&#8217;s probably suppose to mean 8-byte aligned memory, or you got a command which can&#8217;t be [...]]]></description>
			<content:encoded><![CDATA[
<br/><p>Uh, sometimes you don&#8217;t know what&#8217;s in the mind of intel guys.</p>
<p>They state in their documentation that the movntq-op (taken over from MMXExt) has to take place on 16-byte aligned memory. What??? Either it&#8217;s a copy &amp; paste error, it&#8217;s probably suppose to mean 8-byte aligned memory, or you got a command which can&#8217;t be executed.</p>
<p>Besides, no op on MMX / 3DNow need alignment, so I think it&#8217;s even entirely wong &#8230; either way. Nice work intel.</p>

]]></content:encoded>
			<wfw:commentRss>http://blog.frohling.biz/2010/04/movntq-alignment/feed/</wfw:commentRss>
		</item>
		<item>
		<title>OP-equivalent series (pavgd)</title>
		<link>http://blog.frohling.biz/2010/02/op-equivalent-series-pavgd/</link>
		<comments>http://blog.frohling.biz/2010/02/op-equivalent-series-pavgd/#comments</comments>
		<pubDate>Tue, 16 Feb 2010 00:53:07 +0000</pubDate>
		<dc:creator>Ethatron</dc:creator>
		
		<category><![CDATA[Equivalence]]></category>

		<category><![CDATA[Assembler]]></category>

		<guid isPermaLink="false">http://blog.frohling.biz/?p=24</guid>
		<description><![CDATA[<br/>
pavgd (32bit average [(a + b + 1) >> 1]:
/* add (a + b), and compare overflow */
movq	mm6, mmc
paddd	mmc, mmd
psubd	mm6, 0x80000000
psubd	mmc, 0x80000000
pcmpgtd	mm6, mmc
paddd	mmc, 0x80000000

/* add (ab + 1), and compare overflow */
pcmpeqd	mm5, mm5
pcmpeqd	mmd, mmd
pcmpeqd	mm5, mmc
psubd	mmc, mmd

/* shift carry in */
por	mm6, mm5
pslld	mm6, 31
psrld	mmc, 1
por	mmc, mm6


]]></description>
			<content:encoded><![CDATA[
<br/><p><strong>pavgd (32bit average [(a + b + 1) >> 1]:</strong></p>
<pre><code style="color: white; font-family: monospace;">/* add (a + b), and compare overflow */
movq	mm6, mmc
paddd	mmc, mmd
psubd	mm6, 0x80000000
psubd	mmc, 0x80000000
pcmpgtd	mm6, mmc
paddd	mmc, 0x80000000

/* add (ab + 1), and compare overflow */
pcmpeqd	mm5, mm5
pcmpeqd	mmd, mmd
pcmpeqd	mm5, mmc
psubd	mmc, mmd

/* shift carry in */
por	mm6, mm5
pslld	mm6, 31
psrld	mmc, 1
por	mmc, mm6
</code></pre>

]]></content:encoded>
			<wfw:commentRss>http://blog.frohling.biz/2010/02/op-equivalent-series-pavgd/feed/</wfw:commentRss>
		</item>
		<item>
		<title>OP-equivalent series (psubq)</title>
		<link>http://blog.frohling.biz/2010/01/op-equivalent-series-psubq/</link>
		<comments>http://blog.frohling.biz/2010/01/op-equivalent-series-psubq/#comments</comments>
		<pubDate>Thu, 14 Jan 2010 10:48:55 +0000</pubDate>
		<dc:creator>Ethatron</dc:creator>
		
		<category><![CDATA[Equivalence]]></category>

		<category><![CDATA[Assembler]]></category>

		<guid isPermaLink="false">http://blog.frohling.biz/?p=23</guid>
		<description><![CDATA[<br/>
psubq (64bit - 64bit = 64bit):
movq	mm7, mmc
psubd	mmc, mms
movq	mm6, mmc
psubd	mm7, 0x80000000
psubd	mm6, 0x80000000
pcmpgtd	mm6, mm7
psllq	mm6, 32
psubd	mmc, mm6


]]></description>
			<content:encoded><![CDATA[
<br/><p><strong>psubq (64bit - 64bit = 64bit):</strong></p>
<pre><code style="color: white; font-family: monospace;">movq	mm7, mmc
psubd	mmc, mms
movq	mm6, mmc
psubd	mm7, 0x80000000
psubd	mm6, 0x80000000
pcmpgtd	mm6, mm7
psllq	mm6, 32
psubd	mmc, mm6
</code></pre>

]]></content:encoded>
			<wfw:commentRss>http://blog.frohling.biz/2010/01/op-equivalent-series-psubq/feed/</wfw:commentRss>
		</item>
		<item>
		<title>OP-equivalent series (paddq)</title>
		<link>http://blog.frohling.biz/2010/01/op-equivalent-series-paddq/</link>
		<comments>http://blog.frohling.biz/2010/01/op-equivalent-series-paddq/#comments</comments>
		<pubDate>Fri, 08 Jan 2010 07:46:29 +0000</pubDate>
		<dc:creator>Ethatron</dc:creator>
		
		<category><![CDATA[Equivalence]]></category>

		<category><![CDATA[Assembler]]></category>

		<guid isPermaLink="false">http://blog.frohling.biz/?p=22</guid>
		<description><![CDATA[<br/>
paddq (64bit + 64bit = 64bit):
movq	mm7, mmc
paddd	mmc, mma
psubd	mm7, 0x80000000
psubd	mmc, 0x80000000
pcmpgtd	mm7, mmc
paddd	mmc, 0x80000000
psllq	mm7, 32
psubd	mmc, mm7


]]></description>
			<content:encoded><![CDATA[
<br/><p><strong>paddq (64bit + 64bit = 64bit):</strong></p>
<pre><code style="color: white; font-family: monospace;">movq	mm7, mmc
paddd	mmc, mma
psubd	mm7, 0x80000000
psubd	mmc, 0x80000000
pcmpgtd	mm7, mmc
paddd	mmc, 0x80000000
psllq	mm7, 32
psubd	mmc, mm7
</code></pre>

]]></content:encoded>
			<wfw:commentRss>http://blog.frohling.biz/2010/01/op-equivalent-series-paddq/feed/</wfw:commentRss>
		</item>
		<item>
		<title>OP-equivalent series (packusqd, packssqud)</title>
		<link>http://blog.frohling.biz/2010/01/op-equivalent-series-packusqd/</link>
		<comments>http://blog.frohling.biz/2010/01/op-equivalent-series-packusqd/#comments</comments>
		<pubDate>Fri, 08 Jan 2010 05:37:43 +0000</pubDate>
		<dc:creator>Ethatron</dc:creator>
		
		<category><![CDATA[Equivalence]]></category>

		<category><![CDATA[Assembler]]></category>

		<guid isPermaLink="false">http://blog.frohling.biz/?p=21</guid>
		<description><![CDATA[<br/>
packusqd (saturated clamp from unsigned long long to unsigned long),
packssqud (saturated clamp from signed long long to unsigned long):
There is a condition for unsigned long long inputs, range is &#8220;only&#8221; [0x0000000000000000, 0x7FFFFFFFFFFF].
There is a no condition for signed long long inputs.
// from-scratch, no helper available

    /* -1 */
    pcmpeqd	mm6, [...]]]></description>
			<content:encoded><![CDATA[
<br/><p><strong>packusqd (saturated clamp from unsigned long long to unsigned long),</strong><br />
<strong>packssqud (saturated clamp from signed long long to unsigned long):</strong></p>
<p>There is a condition for unsigned long long inputs, range is &#8220;only&#8221; [0x0000000000000000, 0x7FFFFFFFFFFF].<br />
There is a no condition for signed long long inputs.</p>
<pre><code style="color: white; font-family: monospace;">// from-scratch, no helper available

    /* -1 */
    pcmpeqd	mm6, mm6

    /* x >> 32 > -1 */
    movq	mm4, mmc0
    movq	mm5, mm2

    /* no psraq/pshufd/pshufw available,
     * duplicate ((x >> 32) | x) */
    punpckhdq	mm4, mmc0
    punpckhdq	mm5, mm2

    pcmpgtd	mm4, mm6
    pcmpgtd	mm5, mm6

    pand	mmc0, mm4
    pand	mm2, mm5

    /* 0 */
    pxor	mm7, mm7

    /* x >> 32 == 0 */
    movq	mm4, mmc0
    movq	mm5, mm2

    /* no psraq/pshufd/pshufw available,
     * duplicate ((x >> 32) | x) */
    punpckhdq	mm4, mmc0
    punpckhdq	mm5, mm2

    pcmpeqd	mm4, mm7
    pcmpeqd	mm5, mm7

    pand	mmc0, mm4
    pand	mm2, mm5

    /* 0xFFFFFFFF */
    pandn	mm4, mm6
    pandn	mm5, mm6

    por		mmc0, mm4
    por		mm2, mm5

    punpckldq	mmc0, mm2
</code></pre>

]]></content:encoded>
			<wfw:commentRss>http://blog.frohling.biz/2010/01/op-equivalent-series-packusqd/feed/</wfw:commentRss>
		</item>
		<item>
		<title>OP-equivalent series (packusdw, packssduw)</title>
		<link>http://blog.frohling.biz/2010/01/op-equivalent-series-packusdw/</link>
		<comments>http://blog.frohling.biz/2010/01/op-equivalent-series-packusdw/#comments</comments>
		<pubDate>Thu, 07 Jan 2010 18:52:19 +0000</pubDate>
		<dc:creator>Ethatron</dc:creator>
		
		<category><![CDATA[Equivalence]]></category>

		<category><![CDATA[Assembler]]></category>

		<guid isPermaLink="false">http://blog.frohling.biz/?p=20</guid>
		<description><![CDATA[<br/>
packusdw (saturated clamp from unsigned long to unsigned short),
packssduw (saturated clamp from signed long to unsigned short):
There is a condition for unsigned long inputs, range is &#8220;only&#8221; [0x00000000, 0x7FFFFFFF].
There is a condition for signed long inputs, range is &#8220;only&#8221; [0x80008000, 0x7FFFFFFF]. You can go to full signed long if there would exist &#8220;psubsd&#8221;, which does [...]]]></description>
			<content:encoded><![CDATA[
<br/><p><strong>packusdw (saturated clamp from unsigned long to unsigned short),</strong><br />
<strong>packssduw (saturated clamp from signed long to unsigned short):</strong></p>
<p>There is a condition for unsigned long inputs, range is &#8220;only&#8221; [0x00000000, 0x7FFFFFFF].<br />
There is a condition for signed long inputs, range is &#8220;only&#8221; [0x80008000, 0x7FFFFFFF]. You can go to full signed long if there would exist &#8220;psubsd&#8221;, which does not.</p>
<pre><code style="color: white; font-family: monospace;">// via packssdw

psubd		xmmx, 0x00008000	// signed short in long
psubd		xmmy, 0x00008000
packssdw	xmmx, xmmy		// cast long to short
paddw		xmmy, 0x8000		// unsigned short
</code></pre>
<pre><code style="color: white; font-family: monospace;">// with punpck and variable

movdqa		xmm?1, 0x00008000
movdqa		xmm?2, xmm?1
puncklwd	xmm?2, xmm?1		// 0v0000000080008000
punckldq	xmm?2, xmm?2		// 0v8000800080008000

psubd		xmmx, xmm?1
psubd		xmmy, xmm?1
packssdw	xmmx, xmmy
paddw		xmmy, xmm?2
</code></pre>
<pre><code style="color: white; font-family: monospace;">// with pshufw and variable

movdqa		xmm?1, 0x00008000
pshuflw		xmm?2, xmm?1, 2|2|0|0	// 0v?????????80008000
pshufhw		xmm?2, xmm?1, 2|2|0|0	// 0v8000800080008000

psubd		xmmx, xmm?1
psubd		xmmy, xmm?1
packssdw	xmmx, xmmy
paddw		xmmy, xmm?2
</code></pre>
<pre><code style="color: white; font-family: monospace;">// with pshufw and variable and no memory access

pcmpeqd		xmm?1, xmm?1		// 0xFFFFFFFF
pslld		xmm?1, 31		// 0x80000000
pslrd		xmm?1, 16		// 0x00008000
pshuflw		xmm?2, xmm?1, 2|2|0|0	// 0v?????????80008000
pshufhw		xmm?2, xmm?1, 2|2|0|0	// 0v8000800080008000

psubd		xmmx, xmm?1
psubd		xmmy, xmm?1
packssdw	xmmx, xmmy
paddw		xmmy, xmm?2
</code></pre>

]]></content:encoded>
			<wfw:commentRss>http://blog.frohling.biz/2010/01/op-equivalent-series-packusdw/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>

