Continuing from last post, DDGI requires a high amount of memory to store its depth maps. I present a compression scheme that compresses the data to 5~10% of its original size. Note this is completely orthogonal to BC texture compression or any bit-saving tricks, so this compression ratio can be improved further with those existing techniques.

Memory Constraint

For DDGI to work effectively, we need a 16x16 depth map attached to each probe, with each texel storing both depth and depth^2. If we use R16G16 to store each texel, then that would give us 1KB per probe, not counting the gutter padding.

Let’s say we have a generous budget of 1GB for light probes, and we have a 200x200x50 meter^3 world. Placing probes one meter apart, we would need 2 million probes, which will take up 1.9G. So a dumb uniform distribution of probes won’t cut it.

A smarter way to go about it is building a voxel tree around geometry, like the one talked about by Remedy (Multi-scale GI talk). This will save us a lot of memory, but turned out that isn’t enough either. First of all, dynamic objects need probes as well, so we still need to scatter a decent amount of them out in the opening. Secondly, this space optimization is pretty meaningless if your world has geometry everywhere. So not a really robust solution to our memory issue.


In Monter, I have probes placed conservatively and one meter apart. The probes eat up about 300MB. That’s pretty bad, because we can’t scale with that. Firstly, the algorithm breaks down in some places because the probes are just too damn far apart:

As you can see, either probes can’t fit in that crevasse, or it’s clipped against the wall and has nobody to interpolate with. This is pretty glaring, and the only way to solve this is placing more probes. So I bumped the probe resolution to be 0.5 meters apart:

Much better, But we just increased the memory footprint to 2.4GB … which makes it unusable. As I’ve said, the dominating factor is the radial depth map, let’s try to compress that.

”Dead” Probes

An obvious thought: probes are small distance apart from each other, so do we really need to store the true depth value? No, because during interpolation, probes must be within some distance threshold. If a probe happens to be farther from the shading point than the threshold, the shading point will query another probe that’s closer instead, because it’s guaranteed that such a probe will exist given our uniform probe density. We can just clip the depth value at max probe distance.

depth_to_store = min(true_depth, max_dist_from_probe_to_probe);

We just reduced the range of the depth value from [0, inf] to [0, N], N being the maximum distance possible from one probe to another during interpolation. Just some small value, anyway.

This makes our lives a lot easier. Immediately, a lot of probes that are farther from geometry than this maximum distance will have its radial depth data consist of only the value N. Then we can just toss all their depth maps out of the window and slap a label on them. Every time their depth gets queried, always return depth N for all directions.

This ends up working pretty well. For my case, the depth storage size has been compressed to 50% of its original size. Which gives us 1.2GB. I mean it’s not bad, but not great either. We can do better.

Dictionary Compression

Another thought: can we dictionary compress these depth map blocks? We see a clipped depth map of value N is very common, but are there any unknown configurations that are just as common, but we are not taking advantage of?

Well, there’s simply no tradeoff here. Dictionary compression is just better because what we did is simply a subset of dictionary compression. It’s only going to achieve a better compression ratio, not worse.

So here’s what we are going to do. We queue up light probe baking requests, and process them one by one. This process spits out a 16x16 depth block per request. Then we try to match this depth block with the previous depth blocks. If we find a match, we just spit out its index, otherwise we stamp the new depth block in the depth block list and spit out that new index.

The block matching criteria is peak pixel difference thresholding. As long as the biggest per-texel difference is below 0.01, I accept that block as a match. I don’t enforce an exact zero difference, and this implies lossy compression.

My look-back window size is 10k.

GPU implementation

This algorithm is slow if implemented as a single thread. Since my baking process is all on the GPU, I’ve added this in-place streaming compressor onto the GPU as well. Instead of processing one light probe, I batch a whole bunch per dispatch. This causes issues, however. Turns out the compressor works best if blocks within vicinity are available for matching, and the probe indices are ordered in a spatially coherent way. By batching probes, these close-by probes cannot cross-talk. I can run one probe per dispatch, but that will negate any latency-hiding capability of the GPU.

Instead, I swizzle the indices like so:

int SwizzleProbeIndex(int FlatIndex, int ProbeCount)
    int BlockSize = 100;
    int BlockCount = 500;
    int GroupSize = BlockSize*BlockCount;
    int GroupIndex = FlatIndex / GroupSize;
    int GroupHead = GroupIndex * GroupSize;
    int GroupTail = GroupHead + GroupSize;
    if (GroupTail >= ProbeCount) 
        return FlatIndex;
    int Offset = FlatIndex - GroupHead;
    Offset = ((BlockSize * Offset) % GroupSize) + Offset/BlockCount;
    return GroupHead + Offset;

This code iterates through the indices at large steps, which means a batch contains distant probes instead of neighbor probes. This enables neighbor block sharing without performance penalty.


The results are very nice, for the following configurations:
(dilation is applied to the probe volume to cover open ground)

density=0.5 meters, dilation=3 meters: 4.5% of its original size
density=1.0 meters, dilation=1.5 meters: 9% of its original size

If we map this level of efficiency to our original problem, then we get to store 2.4GB worth of DDGI probes in 168MB, awesome!

The compression ratio seems to improve as dilation gets bigger due to more “dead” probes, which are easy targets to get compressed. This is cool because it permits more probes in the open field.

Global illumination is one key effect I want to achieve with Monter. For a low-poly style game, GI is an important effect that gives the fidelity, which the low-poly game asset itself lacks. I’ve attempted to achieve this effect with irradiance probes, drawing ideas from both DDGI for probe interpolation and Remedy’s GI talk for optimal probe placement.

Global Illumination with Irradiance Probes

Before light probes, Monter used a constant term for indirect illumination, so that surfaces in shadow are not completely dark. It doesn’t account for indirect illumination or skylight illumination, which are key elements to producing realistic lighting. So I am aiming to finally bring global illumination into this game with the help of irradiance probes.

flat ambient term, looks awful!

this is what we want: indirect illumination!

An irradiance probe is a probe that captures incoming light from all directions towards a single point in space. I will twist this definition and say each probe encodes only the incoming indirect light + skylight, which is my specific use case. For simplicity’s sake, let’s just say we are totally spamming these probes everywhere.

How are these probes useful? Well, imagine we need indirect illumination at a surface point. We can look up the nearest eight probes, where these probes will form a box around the surface point. Then we just do a trilinear interpolation among these eight probes to approximate the incoming indirect light at the surface point. After that, we just convolve the incoming light with the surface’s BRDF, et voila, we will have something close to the indirect illumination at the surface. Of course there can be severe issues with this approach (probes stuck in walls, drawing irradiance from probes on the other side of a wall, etc), but we will talk about those later. For now, this is sort of the big picture we are working with.

Okay, cool, shall we start spamming those probes?

Irradiance Encoded as Spherical Harmonics

For each probe, we need to store the incoming light from every single direction, since we don’t know where the surface point will be facing at shading time. This poses a challenge as to how we should store them. There are potentially very many light probes, so we need the memory footprint of each probe to be low.

Luckily, people have figured this out. The incoming light can be modelled as a spherical function, and a spherical function can be decomposed into spherical harmonics.

Spherical harmonics sound scary, I know. I’ve been intimidated by them for quite a while, especially because introductory texts illustrate them as the following:

Good thing is, we don’t have to care about that. These guys are far more friendly when put in Cartesian form (at least for the first couple of bands). In short, they are just scalar functions that take in a 3d direction as input:

You can very clearly see that these guys are just simple spherical functions! In fact, after we fold the constants and place these equations into code, they just look like this:

float SH(float3 Dir, int Basis)
    float Res = 0.0;
    if (Basis == 0)
        Res = 0.28209479177387814347;
    else if (Basis == 1)
        Res = -0.48860251190291992159 * Dir.y;
    else if (Basis == 2)
        Res = 0.48860251190291992159 * Dir.z;
    else if (Basis == 3)
        Res = -0.48860251190291992159 * Dir.x;
    return Res;

Look a lot more friendly now, don’t they?

Now I’m not going to pretend I know anything about spherical harmonics, but here’s the big picture. First, we select a few spherical harmonics that will serve as the basis functions to reconstruct the original function. Then, for each of the spherical harmonics, we _project_ the original spherical functions into it to obtain a coefficient. We compute these coefficients such that the sum of all spherical harmonics, each scaled by their own coefficient, equals the original spherical function. Now, if we only select the first few spherical harmonics, we can only obtain a very rough estimate, but that estimate is already good enough for our purpose. Diffuse GI is quite low-detail and a low order encoding is sufficient.

Anyways, after that, we just end up with a handful of floats encoding the influence of each spherical harmonics if we were to reconstruct the original spherical function using them. In this case, the spherical function is just the incoming skylight+indirect light.

Spherical harmonics are scalar functions, so we need three sets of spherical harmonics coefficients, one set for each color channel. I ended up using FORMAT_R11G11B10_FLOAT, which is good enough for storing these coefficients. I’m using the first four spherical harmonics, so that means I need four coefficients each channel. That means that the irradiance of each probe can be stored at as small as 16 bytes, pretty good!

Baking the Probes

To start out, let’s lay the probes wastefully in a very dense 3D regular grid in space. This way we ensure every surface in the game world will be able to extrapolate indirect light from some probes.

To bake these probes, I could rasterize a cubemap around each probe and then turn them into spherical harmonics coefficients, but this idea just isn’t that appealing to me. First of all, the cubemap resolution will be pretty small, let’s say we are doing 64x64 texels per face. You have to realize that, for each of those faces, the _entire_ scene geometry will have to go through the raster pipe _six_ times per probe. Not to mention, all those triangles are going to end up as tiny pixels on the cubemap faces. It’s a terrible case for raster.

Instead, I chose to go with ray tracing, since we already have a ray tracing system in place. We can simply prep all the rays for some probes, and then dispatch all these rays within one single compute dispatch, fully saturating the GPU shader cores.

A diagram showing the baking pipeline

We can nicely pipeline the baking stages as a handful of compute passes, shown as above. A nice side effect is that this actually allows real-time probe update. We can have a scheme that selects a few key probes, and send them through this pipeline to recompute.

Probe Interpolation

Here comes the real problem, you see, since we laid the probes in a 3D grid, we can easily fall into situations like these:

We want to sample indirect illumination at the shading point. Unfortunately, probe 0 and 2 falls inside a wall. Clearly we can’t use data from probe 0 or 2, but a trilinear interpolation doesn’t know that. We can improve, though, by adding a geometric factor. Let’s say, on top of the trilinear weight, we only take data from probes that are above the shading surface’s tangent plane:

Using dot product of the probe’s relative positions and shading surface normal, we can tell which half-space the probes lie within w.r.t the surface’s tangent plane. We can incorporate them into our weighting function to eliminate probes from behind the surface.

However, when we put this trick in practice, we still get corner cases with light leaks:

an example of GI with trilinear + geometric interpolation

Turns out, the tangent plane test alone isn’t enough to eliminate bad interpolations alone. Here is one of many corner cases where this traditional probe interpolation breaks down:

Despite being in the positive half space of the shading surface’s tangent plane, probe 1 is placed inside the ceiling, and therefore should not be sampled from. As you can see, we just do not have enough data to reliably determine which probes to use during interpolation.

DDGI’s Contribution

Like I mentioned in the beginning of the post, a very recent technique was developed to combat this sort of artifact, namely DDGI. The big contribution from them is the extra visibility data they store for each probe. They store the spherical depth and depth^2, and use this information to reliably detect which probes should not be visible to the shading surface. Normally if you do this in a boolean fashion, you’d end up with sharp discontinuities. What’s cooler is that the DDGI paper has come up with a heuristic that would eliminate light leaks _and_ maintain smooth blending at the same time. I don’t believe I can do a better job explaining their ideas than the paper, but I do want to share a couple of important implementation details that made this technique work, which might have been overlooked during its original presentation/paper.

Storage of spherical depth

Spherical depth can be interpreted as a scalar spherical function, just like spherical light. I have tried encoding them as spherical harmonics, but the ringing artifacts caused the algorithm to break, so I didn’t go that route. I ended using an octahedral map to encode spherical depth and depth^2, just like the DDGI paper proposed.

This scheme was referred to as “octahedral encoding” during the talk, which really confused me. I think it’s much more correct to call this octahedral-mapping. All it does, really, is mapping 3d points on a unit sphere to a 2d point on a unit square. Exploiting this mapping function, we can pack a spherical function onto a unit texture. I’ve found out that this algorithm only works well enough with a 16x16 oct-map, which is also what the original paper proposed as well. Pushing the resolution any lower, and I started seeing artifacts.

Process of octahedral mapping

A 16x16 resolution oct-map can only store 256 discrete samples on the unit sphere, sampling any directions other than those discrete points will require interpolation. A nice thing about this mapping is that we can use hardware bilinear filter to do interpolation, since the neighbor oct-mapped texels are neighbors when unfolded back to the unit sphere space as well. The only thing we need to be aware of are sampling on the edges. Padding the texture borders is more complicated than just duplicating them.

Here’s how we should stitch the texels on an oct-map. If we examine the sphere unpacking process, we can see each border of the unit square are all edges of the tetrahedron that has been cut open.

Essentially, when we pad the texture borders, we need to “stitch” together these open edges, so that the bilinear filter can pick up the texels that correspond to their real neighbors on the unit sphere. Here’s roughly how the stitching will look:

oct-map stitched

And here are the texel borders actually laid out. I’ve numbered the texels so you can see the arrangement of the border:

oct-map padded with border to allow seamless bilinear filter

Implementation details of spherical depth storage

Depth map of each probe is prefiltered to allow smooth interpolation. One issue that arises from this is that big depth values can dominate the entire filtered depth function. In order to get this working, depths need to be truncated before storing them into depth maps and filtering them. This truncation is okay, knowing that no shading surface will query probes that are farther than a certain distance.

Another crucial detail is dealing with probes that fall inside geometry. Imagine a case like this:

Probe 0 and 2 are inside the wall, and therefore should be discarded during interpolation. These probes’ depth maps indicate that they should be directly visible to the shading surface however, because their visibility is unobstructed inside the wall; no triangle blocks their view of the shading surface:

The dark green arrows indicate that, from probe 0 and 2’s point of view, the shading surface should be within their reach, since these two rays are unobstructed on their paths.

Special care should be taken when a probe sees a wall that is facing outward. To fix this, I force the depth value to 0 when ray hits the backface. This will turn all probes to be completely invisible inside walls, except cases where the probe is literally on the surface itself.

rays are clamped to a depth of zero, since they’ve hit a backface triangle

Now that we have depth-based probe interpolation to prevent light leak, let’s take a look at our GI, see how it looks:

GI of light probes, incorporating spherical depths

Looks much better, but still has some artifacts. As I have alluded, this depth-based probe interpolation has one weakness. If the probe itself is placed directly on the shading surface or very close to it, then the depth data will be rendered useless. We can combat this by pulling the sampling point outwards along the shading surface’s normal and the view direction towards the camera. By adding this bias, we can finally achieve GI without much artifacts:

GI of light probes, incorporating spherical depths and sampling bias

Final thoughts

Now, we have decent GI, but how much have I paid for it? The above picture looks quite nice, but it’s because light probes are only 0.5 meters apart. My tiny world is 100x100x40 meters, to fill this space with probes, that means we will have to allocate and bake 800,000 probes.

That’s no issue if we only consider lighting data. We are storing 16 bytes per probe for irradiance since we are using spherical harmonics. The big problem is the visibility data: we need 1KB of visibility data per probe for this interpolation scheme to work! (16x16 res oct-map of depth and depth^2 value, each is stored as a 16-bit floating point, which totals to 1024 bytes). To store this many probes, we need 784MB … not good! The actual implementation of course uses a much smarter probe placement scheme, but I will save that for the next post.

Now, I’m not saying optimal probe placement completely solves the problem. We potentially need _that_ many probes once we have a lot of surfaces in the game asset. It’s really bad that each probe needs 1KB of data. What’s worse is that this algorithm really only holds up for high density probe fields. Here’s a GI picture with probes placed 1 meter apart (slightly sparser than what I used before).

GI artifacts by placing probes 1 meter apart

Now this is no light leak, but it’s not pleasant looking either. I am forced to place probes at a high density near the surfaces to remove these artifacts. After visualizing my voxelized probe placement, it really reminds me of light maps. If I were to encode static GI, I will probably end up with a similar amount of light map texels as there are probes, and I don’t need 1KB of visibility data for those lightmap texels … This is really making me wonder, is this scheme really worth it?

To give you a general idea, this is the probe resolution required for the algorithm to _just_ work …

Anyways, that’s it for obtaining GI from probe interpolation. I still haven’t said a word about my probe placement yet. The probe placement I am using is derived from Remedy’s GI talk in 2015. I might make the next blog post on that, stay tuned!
Last time I covered the design of acceleration structure (AS) for my game engine’s ray tracing system. This time, I will go into depth on how ray traversal is done with such an AS. As I go on to describe how ray traversal is done, it should become much clearer on why the AS is designed this way.

In a traditional ray tracer, all you have to do is to build an AS over some static geometry during startup, and for the rest of the program, you just keep tracing some rays through the AS. That part is well-covered in literatures such as Physically Based Rendering, but it becomes messier when we bring ray tracing to real-time games.

Rigid Body Transformation

The first obstacle is to track rigid body transformations. Building BVH over a mesh is an expensive operation, so we only do it on startup time. Transforming the mesh then would invalidate the BVH we’ve built. It is not uncommon for a mesh to be continuously transformed throughout multiple frames, and we must have a way to handle that.

The simplest solution is to transform all the vertices and refit the BVH every frame. However, this can cause problems. First of all, BVH is composed of a hierarchy of axis-aligned bounding boxes (AABB), subdividing the space that the mesh occupies as the level goes deeper. The subdivision is carefully chosen at build-time to have optimal ray traversal performance. Expanding/shrinking the AABBs to fit the transformed mesh will no longer guarantee the same traversal characteristics. Especially if the mesh is rotated, the AABBs will change drastically and lead to almost crippled BVHs in terms of performance. Another obvious thing is we need to transform all the vertices first, which of course takes time.

A much better solution is to transform the ray to object space instead of transforming mesh to world space. This will preserve the same intersection, and it removes the need to transform vertices. And of course, the biggest win here is the BVH is no longer updated; we still get the optimal subdivision that we had from the first time we built it.

This is why each BLAS in the AS stores an inverse transformation matrix. Rays will be transformed to local space using this inverse transform before the actual traversal starts.


One thing I’d like to be supported is instancing. In the game, we inevitably have multiple entities that share the same static mesh. Storing duplicate BVHs and mesh data for these entities is a waste. Therefore, the AS is pulled into two levels. Top level entities who share the same mesh will point to the same bottom level AS.

Deformable Mesh

This is the trickiest case to handle. Vertices actually change, so we need to apply skinning every frame. This also means extra storage for these deformed vertices _and_ the BVHs that are refitted specifically for these deformed vertices. Here we just have to eat the cost of refitting BVH. BVH’s quality will suffer, but refitting is quite fast using GPU compute (covered in last post).

Two-Level Ray Traversal

The traversal itself is quite simple, but the fact that AS is two-level might seem daunting at first, so I will cover that briefly.

The algorithm used to traverse a single BVH is the same as the one in Physically Based Rendering. I don’t think I can do a better job explaining it, so here’s the link.

The Top-level AS is a BVH of instances, each instance node storing pointers to bottom-level AS and storage for the inverse transform. You can also say that it’s a BVH of BVHs. Applying the typical BVH traversal to top-level AS, we eventually hit one of the leaf nodes (instance nodes in this case). Now we stop traversal, remember this location in the top-level AS, then context switch to the bottom-level AS that it’s pointing to. This is also where we apply the inverse transform to our rays (though we should still preserve rays in global space for traversal of other instances).

After we context switched to the bottom-level AS, loading up the BVH address and transforming our ray, we again apply the same BVH traversal algorithm. Except this time, the leaf nodes are triangle nodes. A hit with the leaf node will become a candidate to ray intersection.

After the bottom-level traversal, we pop back to top-level. We continue this loop until we are done with all top-level nodes.

Now we are done, that’s all there is to it. The real work here is maintaining the two-level AS. The actual ray traversal is light work in comparison.
I started doing some work to add a GPU ray tracing system to my engine. I want to experiment and see what sort of effects can be enabled on current gen GPUS, even without special hardware support. And of course, before we can shoot some rays, we need an acceleration structure (AS) for all the geometries in the scene. There are some popular ones, such as BVH (which also happens to by my choice of AS in this case). However, it remains an issue as to how to efficiently maintain AS for a highly dynamic scene like Monter’s, with animatable meshes and dynamic geometries.

Problem Statement

I want high quality BVH for fastest ray tracing possible, so I rolled with a SAH-based top-down BVH builder. Despite producing high quality BVHs, it has two problems: it is slow, and it can’t run on the GPU.

Instead of trying to come up with some fancy GPU builder that would run per frame, I’ve sidestepped the issue and came up with a scheme of maintaining AS that removes the need of any fast BVH builder.

Here’s what the scheme achieves:

1. Zero rebuilds.
2. Allows instancing (reusing same geom but with different transforms).
3. Handles skeletal animated mesh (skinned mesh).

AS Maintenance Overview

The AS is divided into two levels. The higher level ASs store the instance data (transforms, BVH pointer and geometry pointer), and the lower level ASs store the actual BVH and geometry data. This separation allows instancing and easy transformations as you will see in a bit. Borrowing the terms from DXR, let’s call the higher level “Top Level Acceleration Structure” (TLAS) and the lower level “Bottom Level Acceleration Structure” (BLAS).

At startup, the system builds a BLAS for each mesh that might be used, even deformable meshes, but without skinning applied. These BLASs are sub allocated within the same buffer, let’s call it BLAS buffer. Recall that a BLAS stores the actual BVH and geometry data, so that’s what we need to actually do ray tracing.

During each frame, the system builds a fresh instance array. During that time, a transform is assigned to each instance according to the world data. Also, a BLAS pointer is assigned to each instance. This allows multiple instances pointing at the same BLAS, realizing instancing.

To make sure this idea is concrete, let’s start with a simple example. Say we have three instances in the scene, with two instances sharing the same underlying mesh but different transforms. It would look like this:

illustration of the simple AS example

As you can see, instance A and B use the same mesh, so they share the same BLAS. This saves memory footprint for duplicate geometries. In addition, note that BLAS 2 is created even though it’s not used. This is because the system creates BLASs for all meshes, even if they are not used. This becomes significant in a second.

Handling animated meshes

Recall that the BLASs we have so far are built around static meshes. Once the mesh starts animating, BLAS is no longer accurate. To handle this, BLAS buffer is partitioned into two parts: permanent and transient. The permanent segment stores BLAS of all meshes in static form. Transient segment stores BLAS for all animated meshes.

Extending upon our last example, let’s say we add a new instance D that is associated with an animated mesh. To accommodate that, the AS system will “build” a brand new BLAS for instance D, and place it inside the transient segment. All BLASs in the transient segment are flushed and “rebuilt” every frame:

illustration of an AS example with deformable mesh

BLAS Refit (as a Compute Pass)

But hold on a second, wasn’t a fresh BVH build every frame too expensive?

The trick is to “refit” instead of “rebuild”. We can make the assumption that the structure of a mesh does not change much during animation. Recall that BLAS stores both geometry data and BVH data. All we have to do is to skin the vertices and place them in the geometry data, then “refit” the BVH to the skinned geometry. The permanent BLASs act as mold that are copied from and refitted to create BVHs for animated meshes.

A BVH refit is a relatively cheap operation, compared to a full rebuild. I don’t think BVH refit on the GPU is well covered on the internet, so I’m going to try talk about mine. It’s definitely not the best, but it gets the job done.

To achieve fast BVH refits, I store the indices to leaf nodes and a list of parent pointers for all nodes. We start off with the BVH:

Using the leaf node indices, we can launch as many GPU threads as there are leaf nodes, and immediately start chewing on the leaf nodes.

To handle leaf node refit, I just recalculate AABB of the underlying primitives that are pointed to by these leaf nodes. Pretty simple.

After finishing processing its node, each thread tries to grab the parent node. However, for every internal node, there will be two threads trying to grab it. To get around this, I have an atomic lock on each node that only allows one thread in. The thread that fails to grab the parent gets killed.

Once a thread grabs the parent, it knows for sure the subtree from which it comes up are all done processing. But it doesn’t know if the other child of the node has done processing yet. In this case, each thread must “spinlock” until both children are done processing (communicated through their locks):

Once both children are done, the thread just union both its children’s AABBs and update that as parent’s new AABB. Recurse this process until the last thread has reached root, and there you have it. Now your BVH has been refitted completely.

Don’t literally spinlock your GPU thread (like a busy wait). GPU execute groups of threads in lockstep. If you spinlock on one thread, then all the neighbor threads are locked too. If the thread you are waiting on happens to be one of them, then it will be a deadlock. The way my compute shader is set up is that each thread is executing a while loop, and at the end of while loop, there is a barrier that ensures all threads get to execute some code at each iteration, so all threads are synchronized. If for any thread whose dependent lock is not ready, I just skip the iteration for that thread. Instructions will still be executed for neighboring threads, they will just get masked off for the threads that are still waiting.

TLAS rebuild

Another important detail I glossed over is the implementation of TLAS. So far, our TLAS is just an array. If we were to do a ray traversal, then we would have to linearly burn through all the instances, which is quite inefficient, especially if the instance count is big.

An obvious solution is to build a BVH of instances. That’s exactly what a TLAS is. Since the number of instances can never get as big as the number of polygons in the scene, it’s okay to do a full rebuild on these instances every frame.


And there you have it. Following this process allows me to maintain a correct BVH for all instances in the scene, be it deformable or newly spawned. There are some complications with the actual ray traversal through a two-level BVH like this, but that will be the topic for another post.
Hey guys, it’s been a long time. This time I want to share an optimization I did with my volumetric cloud. Here is the procedural volumetric cloudscape article that contains my initial implementation. In this post, we are going to expand upon that, and introduce techniques that would make things go fast without loss of quality.

Where we left off

As you can recall, our volumetric cloud solution was quite expensive. To run at a decent quality at 1080p, it was taking about 10~30ms on a GTX 1060 card, definitely over the frame budget.

There were two dimensions to our cost. One is the number of rays we are shooting, corresponds to the size of our shader output. The other dimension is sample count, which determines the fidelity of our cloud. Both of these need to be pretty high to achieve nice quality.

In the end of the last post, I ended up doing this cloud pass on a lower res texture and upscaled it. This reduces the number of rays drastically, and hence making this a practical implementation. However I was not quite happy with the quality loss. You can really notice the bilinear upsample artifacts and it’s just not satisfying to have it at such a low quality. So I decided to do a second pass and optimize it in the other dimension instead: sample count.

The problem

I hate the low-res look on the clouds, so this time I’m going to render this at 1080p. Let’s see how fast it is on different number of samples. Let’s start small:

16 samples per pixel

Not looking good. The slices are artifacts from undersampling the volumes. Despite the low quality, we are already hitting way above the frame budget. Let’s try crank things up a bit:

32 samples per pixel

Better, but still ridden with noticeable artifacts. Let’s up it a bit more:

64 samples per pixel

Still some artifacts remain … but somewhat passable. Even this barely passable cloud shading is taking us THIRTY milliseconds. That is way above our frame budget.

Let’s jitter things up a bit

One common trick in graphics is to turn artifacts into noise. If you recall, we sample the cloud by ray marching at a constant step. This gives the “slices” artifact when the sampling rate is too low. However, we can add noise to the starting point of our sampling region.

float Delta = GetStepSize(); // this is our raymarch step size
float T0, T1 = SetSamplingRange(); // T0 is the start range, and T1 is the end range

// Our noise trick. Before sampling, we jitter our start range a bit
T0 += Random(PixelCoordinate) * Delta;

// then we raymarch …..

This gets adds a lot of randomness to our regular sampling pattern, removing the “slicing” artifacts and introducing noise instead. Even better, I use a blue noise distribution for my random numbers for an “even” randomness, it really does a trick on your eyes. Here is the result with only 16 samples:

16 samples per pixel jittered

(Make sure you open the original image to check out those blue noise patterns! When the image is downscaled it's really hard to notice them)

Wow, this is already much better. One more trick is to cycle through multiple noise textures to trick the eye to do temporal integration for us. When we do that, noise is actually substantially less noticeable … hold on, why don’t we temporally integrate it ourselves?

Temporal reprojection

Instead of throwing away the samples from our previous frame, how about we keep it and add it to current frame’s integration? This is a neat idea borrowed from the newly emerged TAA technique, which is explained in this video by playdead.
Essentially, we can take the camera transform from the previous frame and current frame, then reproject pixels from last frame to the current frame.

Illustration of temporal reprojection by Playdead

By doing this every frame, we create a feedback loop of samples and accumulate it through time. TAA uses an exponential moving average to integrate samples across time, which gives a nice weight falloff for stale samples while keeping them in the integration. It’s a nice scheme so that’s what I chose to use for my clouds.

A typical TAA implementation would also account for a velocity buffer in order to achieve more accurate reprojection for moving/deforming objects. However, our clouds are very slow moving, so it’s not necessary here.

However, since the reprojection is rarely pixel-perfect, we have to be careful with how we sample our previous frames. A bilinear filter works fine in our case. It introduces some blurring in TAA, but since we are dealing with fuzzy clouds, this much blurring is barely noticeable.

Putting it all together

With our temporal reprojection in place, we can cut our samples even more. I’ve cut it down to 4 samples per pixel. The image still contains some noise, but trust me, when this gets animated, you can’t even notice a difference:

4 samples per pixel, temporally reprojected

(Again, open the image at full res to see the quality improvements for yourself!)

And the best part is, this is only 3 milliseconds. We have achieved a 10x speedup and an even better quality. This is awesome!


Here is a video demonstrating the robustness of this technique under heavy motion and movement. I achieved pretty high quality cloudscape at 3ms (worse case). Still room for improvements, but I say this is good for now!