Ambient occlusion adds a lot of percieved realism to your game, but it also is a bandwidth and arithmetic heavy shader. While I was optimizing the shaders we implemented for OBGE I came up with a few tricks to make the AO shader a 50% faster without sacrificing detail or precision. I’ll take HBAO as an example here, but it works with any AO implementation.
The initial implementation
- store AO coefficients in the alpha-channel of the main rendertarget
- branch out for sky fragments
- calculate the contribution
- AO coefficients where in human interpretable form [0.0,1.0], black represents no ambient light contribution, white full ambient light contribution
- blur (while possibly upscaling) the coefficients with a 3×3 kernel
- merge AO and color into the final fragment
Now this implementation has quite some inefficiencies:
- it reads/writes the full 4xFP16 several times, amounting to a huge waste of bandwidth
- it reads and writes the same data (the color values from the renderbuffer), so you have to have 2x alterning renderbuffers, as you can’t read and write to the same buffer at the same time
- older hardware and SM 3.0 has no return-from-branch instruction, which makes the shortcut for the sky all but almost a cosmetic line of code
- storing coefficients in the familiar black/white form adds unnessesary operations (namely “return 1.0 - ao”), the graphics hardware has it built-in, just not as an op-code
- bluring requires 8 texture fetches, from the full 4xFP16 renderbuffer
- fragment merging is done in the shader, which also makes it necessary to read the full 4xFP16 renderbuffer in and then write it out
The worst offender here is a huge waste of bandwidth, interestingly once we start fixing this aspect there are a lot of pieces of improvements which align naturally with it.
The fixes
Bandwidth reduction
The initial bandwith required is (I’m leaving out the z-buffer reads):
4xFP16 read and write (shared alpha) +
8 times 4xFP16 read (blur kernel) +
4xFP16 read and write (merge) =
10 times 4xFP16 read plus 2 times 4xFP16 write =
120 MB/f read and 24 MB/f write (@1400×1050)
The first thing we can do is storing AO coefficients in another rendertarget, this not only allows us to use a distinct (from the main renderbuffer) precision for them, it also only needs to be 1 channel. Of course this adds memory cost, but if you have other post-processing shaders running, you may already have that buffer or you can utilize the additional renderbuffer in others as well. Additionally it eliminates the alterning renderbuffers for the main color data, if it isn’t necessary for something else.
Now the bandwidth utilized is this (I’m leaving out the z-buffer reads):
1xFP16 write +
9 times 1xFP16 read (blur kernel) +
4xFP16 read and write (merge) =
39 MB/f read and 15MB/f write (@1400×1050)
Getting the branch done
Now as we have AO coefficients seperated we don’t have any incoming data for the gather-pass, or to be more precise: we have a constant data incoming, like all white or all black, this is the matte-”color” of the AO-coefficient buffer which we can set via ClearBuffer(). Modern hardware goes so far as not to do any real memory operations when a renderbuffer is cleared, but to mark the resource as such instead and then delay the operation to the fragment-merger in the ROP-units.
Having a fallback or matte-”color” allows us to use a real shader-abortion instruction: clip(). We couldn’t use it before because we had to copy the color values over from the old renderbuffer into the temporary one. Normally the clip-instruction has it caveads, if utilized in a pixel-shader it forces the z-test to be delayed till after the pixel-shader, and disables all z-buffer optimizations like hierarchical-z. Just in this case we don’t have the z-test nor the z-write enabled, in consequence it doesn’t come with a penalty.
As fragments are processed in groups, so called wave-fronts, the time to process all fragments in that group is a constant “maximum(time-taken)” over each fragment. If there would only be an occasional “clip()” possible it’d not present any advantage, the wave-front time would still be the one of the non-clipped fragments. Though, the branch tests for sky-geometry, which is mostly one big continous area on the screen. In effect, when sky is present, entire wave-fronts shortcut and execution time drops to the one of the “test()+clip()”.
The gain is dependent on the presence of sky, that’s because I won’t say it achieves a stadistical X% cut of arithmetics executed, though X is the ratio of sky vs. non-sky - in an outdoor environment it can easily be 30-50% cut.
More bandwidth reduction
We still have the one read of the main renderbuffer lurking in the code, it’s required to do the manual fragment-merge “color * ao”.
While we see a slow migration of fixed function units to general purpose computation units, the ROP-units are still present and efficient as always. The unified shaders need a very broad connectivity to various parts of the chip, because they are multi-purpose, this in consequence creates a very complex network of interdependencies, raising the minimum execution time of a shader. This in itself isn’t such a wrong compromise, if you expect shader-execution time to raise, the ratio of minimum vs. average becomes very small. Though the ROP-units don’t do that complex general-purpose tasks, the data-set to operate on is perfectly regular (the rendertarget grid), they have direct connection to the memory controllers and operate pass-through-caches (no cache-pollution), the runtime of raster-ops is constant, and nowadays below GP-shader minimum time, giving it a clear performance edge. I believe it’s very unlikely that one would come op with a custom raster-op which truly deserves ROP-units to be general-purpose. I can imagine to do colorspace-colorspace conversions transparently in the ROP-stage (blend YCC-fragments with an incoming XYZ-renderbuffer into an outgoing sRGB-backbuffer) as a sensefull custom raster-op. Though it sounds too exotic.
So, as the ROPs are still superior to shaders in what they do, we can savely eliminate the shader-based fragment-merge with a ROP-based one. We saw earlier that the complete merge operation is “color * ao”. The corresponding raster-op is “dest.color = (zero) + (dest.color * src.alpha)”.
Now the bandwidth utilized is this (I’m leaving out the z-buffer reads):
1xFP16 write +
9 times 1xFP16 read (blur kernel) +
4xFP16 merge =
27 MB/f read and 3MB/f write and 12MB/f merge (@1400×1050)
Small change, big advantage
Earlier we mentioned that the occlusion coefficients are stored in inverse magnitude order, that is the less occlusion the higher the value. This is primarily done for natural visualization, if we’d write out “return 1.0 - ao;” and then merge as “return color * ao;” - or do the inverse, write out “return ao;” and then merge as “return color * (1.0 - ao);” is algorithmically equivalent. Now by definition of the term “amount of ambient occlusion” it is better to think 0.0 should represent no occlusion (the magnitude of occlusion is 0). It is also important, in the context of limited precision floating-point arithmetic, to establish the direction of rounding and the bias of any error of approximation towards no occlusion. In the cascade of operations, where results are also streamed out to a limited precision renderbuffer (possibly 8bit integer, possibly filtered), the natural bias of accumulated error is towards 0.
The net effect is that, in the presence of rounding/precision errors the AO cofficients will gear towards “no occlusion”, and not “some occlusion”. Pessimistically this prevents the final output to be too dark by some ulps, it also enables us to use some extreme low precision intermediate buffers like 4bit 1 channel surfaces (or BC4), the worst which can happen is that the AO coefficient drops to zero. This is also practically possible because the intermediate results get filtered (by the 3×3 blur kernel), who’s error bias also is towards 0.0.
So we opt to write [0.0,1.0] instead of [1.0,0.0] into the AO-buffer, and delay the inverse multiply untill directly before the fragment-merge operation. Where we directly recognize that we can implement it as raster-op: “dest.color = (zero) + (dest.color * src.invalpha)”. We got a more robust intermediate buffer, and saved an instruction.
We also recognize that we can eliminate the raster-op if the “src.alpha” value is 0, so we turn on alpha-test as well. This is not only the complement to our “clip()” from above for skies (sky is 0 occlusion), it also catches all cases of planar surfaces. Let’s remember that in the absense of any occluder within the sampled half-sphere aligned to the surface normal, occlusion is 0. If the planar surface is bigger than the sample-radius, its inner area’s coefficients are all 0.
Now, statistically, we pretty much always have a fragment merge rejection, not only from sky, but also from planar surfaces. It depends again on the complexity of the scene, but the best-case scenario has been improved even further.
Even more bandwidth reduction (for AMD only on DX9)
As we converted the intermediate buffer(s) to 1 channel surfaces we can utilize a a bandwidth saving feature from the pre DX10/11 days called FETCH4. ATI was the one introducing it to the DX9 API by a driver hack, but later it became part of the “gather()”-standard in DX10/11.
FETCH4 allows us to sample 4 pixels of a 1 channel surface (of any precision) in the same instruction. The result is saved the the 4 respective elements of the “float4″-vector. We can use it on the z-buffer if the AO-kernel is suitable (HDAO for example employs this scheme), but we can use it to implement any gaussian filtering kernel, like our 3×3 blur.
The kernel that we used was like this:
1.00 2.00 1.00
2.00 4.00 2.00
1.00 2.00 1.00
9 texture fetches, 5 multiplications, 8 additions, 1 multiply (inverse division).
With FETCH4 we can reduce this to 4×4 fetches like this:
F F - | * * - | * F F | * * *
F F - | * F F | * F F | F F *
- - - | - F F | - * * | F F *
4 texture fetches, no multiplication, 4 additions, 1 dot (horizontal sum), 1 multiply (inverse division).
Our intention here wasn’t so much to reduce instructions but to reduce the bandwidth utilized, but’s interesting to note how FETCH4 even simplifies arithmetics. I presume FETCH4 represents virtually 4 reads at the cost of 1 read, which can be achieved by changing the memory layout of 1 channel textures, by the reduction of memory-read latencies of spread singular vs. combined burst memory reads, or other read-combining tricks in the memory controller.
Now the bandwidth utilized is this (I’m leaving out the z-buffer reads):
1xFP16 write +
4 times 1xFP16 read (blur kernel) +
4xFP16 merge =
12 MB/f read and 3MB/f write and 12MB/f merge (@1400×1050)
Better kernel (just for my 5870)
Because of the previous fix I just gained a few free instructions for the filtering kernel, which is still memory bandwidth limited, so the ALUs actually idle while waiting for the texture-fetches to complete.
I decided to give it a bit smoother/better kernel of this kind:
1.41 2.00 1.41
2.00 2.34 2.00
1.41 2.00 1.41
4 texture fetches, no multiplication, 4 additions, 3 dot (horizontal sum), 1 multiply (inverse division).
The full operation is this:
/* Point-filtered gather-square (a=[-1,-1] r=[ 0,-1] g=[-1, 0] b=[ 0, 0]) */
/* Point-filtered gather-square (a=[ 0,-1] r=[ 1,-1] g=[ 0, 0] b=[ 1, 0]) */
/* Point-filtered gather-square (a=[-1, 0] r=[ 0, 0] g=[-1, 1] b=[ 0, 1]) */
/* Point-filtered gather-square (a=[ 0, 0] r=[ 1, 0] g=[ 0, 1] b=[ 1, 1]) */
float4 tl = FetchGS(SamplerG, coord + rcpres * float2(-1, -1)).xyzw;
float4 tr = FetchGS(SamplerG, coord + rcpres * float2( 0, -1)).xyzw;
float4 bl = FetchGS(SamplerG, coord + rcpres * float2(-1, 0)).xyzw;
float4 br = FetchGS(SamplerG, coord + rcpres * float2( 0, 0)).xyzw;
center = (tl.b + tr.g + bl.r + br.a); // * 4
/* Point-filtered gather-cross (r=[-1, 0] g=[ 1, 0] b=[ 0,-1] a=[ 0, 1]) */
gc.xyzw = float4(tl.g + bl.a, tr.b + br.r, tl.r + tr.a, bl.b + br.g); // * 2
/* Point-filtered gather-star (r=[-1,-1] g=[ 1,-1] b=[-1, 1] a=[ 1, 1]) */
gs.xyzw = float4(tl.a , tr.r , bl.b , br.b ); // * 1
/* 1.41 2.00 1.41
* 2.00 2.34 2.00
* 1.41 2.00 1.41
*/
ao = (center / 4.0) * (4.0 - 4.0 * (1.4142135623730950488016887242097 - 1.0));
ao += dot(gc, 1.0);
ao += dot(gs, 1.4142135623730950488016887242097);
ao /= 16;
An average-case look-back
Let’s have a look at an average case (some sky, say 25% rejections, some planar geometry, say 25% rejections):
Before
4xFP16 read and write (shared alpha) +
8 times 4xFP16 read (blur kernel) +
4xFP16 read and write (merge) =
10 times 4xFP16 read plus 2 times 4xFP16 write =
120 MB/f read and 24 MB/f write (@1400×1050)
After
75% 1xFP16 write +
4 times 1xFP16 read (blur kernel) +
50% 4xFP16 merge =
12 MB/f read and 2.25MB/f write and 6MB/f merge (@1400×1050)
Now, I did this kind of optimization with HBAO, implemented in OBGE. HBAO is still memory bandwidth limited (spread z-buffer reads), but now only on the AO coefficient-gathering pass, not on any of the other passes. The result:
Before: 3.9 ms/f
After: 2.3 ms/f
Which is a very nice result. I repeated this for ATI HDAO implementation, which uses FETCH4 for the z-buffer reads as well, and has a very small AO coefficient-gather kernel:
Before: 2.5 ms/f
After: 1.1 ms/f
I haven’t reduced the precision of the AO-renderbuffer, nor did I use any down-sampled render-buffer. In either case we’d not gain that much anymore because the shaders are already dominated by the gather-kernel. But if every cycle counts you may get this down to 1.5 ms and 0.6 ms respectively.