I started doing some work to add a GPU ray tracing system to my engine. I want to experiment and see what sort of effects can be enabled on current gen GPUS, even without special hardware support. And of course, before we can shoot some rays, we need an acceleration structure (AS) for all the geometries in the scene. There are some popular ones, such as BVH (which also happens to by my choice of AS in this case). However, it remains an issue as to how to efficiently maintain AS for a highly dynamic scene like Monter’s, with animatable meshes and dynamic geometries.
I want high quality BVH for fastest ray tracing possible, so I rolled with a SAH-based top-down BVH builder. Despite producing high quality BVHs, it has two problems: it is slow, and it can’t run on the GPU.
Instead of trying to come up with some fancy GPU builder that would run per frame, I’ve sidestepped the issue and came up with a scheme of maintaining AS that removes the need of any fast BVH builder.
Here’s what the scheme achieves:
1. Zero rebuilds.
2. Allows instancing (reusing same geom but with different transforms).
3. Handles skeletal animated mesh (skinned mesh).
AS Maintenance Overview
The AS is divided into two levels. The higher level ASs store the instance data (transforms, BVH pointer and geometry pointer), and the lower level ASs store the actual BVH and geometry data. This separation allows instancing and easy transformations as you will see in a bit. Borrowing the terms from DXR, let’s call the higher level “Top Level Acceleration Structure” (TLAS) and the lower level “Bottom Level Acceleration Structure” (BLAS).
At startup, the system builds a BLAS for each mesh that might be used, even deformable meshes, but without skinning applied. These BLASs are sub allocated within the same buffer, let’s call it BLAS buffer. Recall that a BLAS stores the actual BVH and geometry data, so that’s what we need to actually do ray tracing.
During each frame, the system builds a fresh instance array. During that time, a transform is assigned to each instance according to the world data. Also, a BLAS pointer is assigned to each instance. This allows multiple instances pointing at the same BLAS, realizing instancing.
To make sure this idea is concrete, let’s start with a simple example. Say we have three instances in the scene, with two instances sharing the same underlying mesh but different transforms. It would look like this:
illustration of the simple AS example
As you can see, instance A and B use the same mesh, so they share the same BLAS. This saves memory footprint for duplicate geometries. In addition, note that BLAS 2 is created even though it’s not used. This is because the system creates BLASs for all meshes, even if they are not used. This becomes significant in a second.
Handling animated meshes
Recall that the BLASs we have so far are built around static meshes. Once the mesh starts animating, BLAS is no longer accurate. To handle this, BLAS buffer is partitioned into two parts: permanent and transient. The permanent segment stores BLAS of all meshes in static form. Transient segment stores BLAS for all animated meshes.
Extending upon our last example, let’s say we add a new instance D that is associated with an animated mesh. To accommodate that, the AS system will “build” a brand new BLAS for instance D, and place it inside the transient segment. All BLASs in the transient segment are flushed and “rebuilt” every frame:
illustration of an AS example with deformable mesh
BLAS Refit (as a Compute Pass)
But hold on a second, wasn’t a fresh BVH build every frame too expensive?
The trick is to “refit” instead of “rebuild”. We can make the assumption that the structure of a mesh does not change much during animation. Recall that BLAS stores both geometry data and BVH data. All we have to do is to skin the vertices and place them in the geometry data, then “refit” the BVH to the skinned geometry. The permanent BLASs act as mold that are copied from and refitted to create BVHs for animated meshes.
A BVH refit is a relatively cheap operation, compared to a full rebuild. I don’t think BVH refit on the GPU is well covered on the internet, so I’m going to try talk about mine. It’s definitely not the best, but it gets the job done.
To achieve fast BVH refits, I store the indices to leaf nodes and a list of parent pointers for all nodes. We start off with the BVH:
Using the leaf node indices, we can launch as many GPU threads as there are leaf nodes, and immediately start chewing on the leaf nodes.
To handle leaf node refit, I just recalculate AABB of the underlying primitives that are pointed to by these leaf nodes. Pretty simple.
After finishing processing its node, each thread tries to grab the parent node. However, for every internal node, there will be two threads trying to grab it. To get around this, I have an atomic lock on each node that only allows one thread in. The thread that fails to grab the parent gets killed.
Once a thread grabs the parent, it knows for sure the subtree from which it comes up are all done processing. But it doesn’t know if the other child of the node has done processing yet. In this case, each thread must “spinlock” until both children are done processing (communicated through their locks):
Once both children are done, the thread just union both its children’s AABBs and update that as parent’s new AABB. Recurse this process until the last thread has reached root, and there you have it. Now your BVH has been refitted completely.
Don’t literally spinlock your GPU thread (like a busy wait). GPU execute groups of threads in lockstep. If you spinlock on one thread, then all the neighbor threads are locked too. If the thread you are waiting on happens to be one of them, then it will be a deadlock. The way my compute shader is set up is that each thread is executing a while loop, and at the end of while loop, there is a barrier that ensures all threads get to execute some code at each iteration, so all threads are synchronized. If for any thread whose dependent lock is not ready, I just skip the iteration for that thread. Instructions will still be executed for neighboring threads, they will just get masked off for the threads that are still waiting.
Another important detail I glossed over is the implementation of TLAS. So far, our TLAS is just an array. If we were to do a ray traversal, then we would have to linearly burn through all the instances, which is quite inefficient, especially if the instance count is big.
An obvious solution is to build a BVH of instances. That’s exactly what a TLAS is. Since the number of instances can never get as big as the number of polygons in the scene, it’s okay to do a full rebuild on these instances every frame.
And there you have it. Following this process allows me to maintain a correct BVH for all instances in the scene, be it deformable or newly spawned. There are some complications with the actual ray traversal through a two-level BVH like this, but that will be the topic for another post.