Memory management of a handmade voxel editor

xcodedave

#6982

May 14, 2016

Hi all,

This is my first personal project related post - I hope to join in and contribute where I can to others projects also :)

I'm working on a handmade voxel editor (in C/C++ and OpenGL) and would appreciate some exchange of ideas and thoughts around memory management.

The Octree Data Structure

The voxels are represented as leaf-nodes in an octree data structure. Currently the octree is a contiguous block of pre-allocated memory, so there are no pointers to child nodes etc (this is in order to avoid cache misses due to jumps etc). Instead, nodes are accessed based on their x,y,z position and depth, which gives an index/address into the octree memory. This is possible because I know up front how many nodes and what depth I need.

Each voxel has a colour-index per face. Currently this is 6 bit per face, since I want to use a limited colour-indexed palette of a potential 63 colours - a colour-index of 0x00 is reserved to represent an empty voxel.

This means each leaf node takes 36 bit total to represent 6 faces with colour indexing. This encoding obviously doesn't really fit fundamental data types well, So I'm using 64-bit to store it which leaves 28-bits of padding. You can imagine a voxel as:

1	typedef uint64_t Voxel;

Question 1. Is there anything I could do to improve this representation? Alignment etc.? Maybe it'd be better to use just a
1
#define

?

Triangulation & Rendering

The way rendering works is that given a list of nodes, the voxels are iterated over and triangulated. This list of output triangles is optimized. i.e. co-planar triangles are merged/removed and nodes which are entirely enclosed can be completely ignored etc. Batches of these triangles are uploaded as VBOs (Vertex Buffer Objects) to the GPU via OpenGL (or other graphics API) ready for rendering.

Unfortunately, before iterating through these voxels and building the triangle lists I have no idea how many triangles I will need, so it feels like I'll need to dynamically allocate memory. This sounds like a bad idea to me from a performance and memory fragmentation perspective.

My first thought is that perhaps I can make my own custom memory allocator - pre-allocating a large block of contiguous memory as soon as the program loads and treating it like a stack which I push/pop from.

Question 2. Does the idea of a custom memory allocator based on a stack access pattern sound reasonable? Are there better approaches I could consider?

Question 3. Should I maintain a separate stack for different data types? This would presumably make iterating over these data-types faster due to memory access patterns being contiguous?

Phew. Many thanks for getting this far, I hope to share my work once it is slightly further along!
Any feedback and discussion would be most appreciated!

Edited by xcodedave on June 2, 2016, 8:56pm Reason: updating title to better reflect topic

Abner Coimbre

#6984

May 14, 2016

xcodedave
This is my first personal project related post - I hope to join in and contribute where I can to others projects also :)

e-mail [email protected] to discuss how to get your project on the site.

Edited by Abner Coimbre on May 14, 2016, 2:39pm Reason: Wording

Oswald Hurlem

#7037

May 23, 2016

Don't optimize prematurely.

Using an octree is optimizing prematurely. You almost certainly have a sub-optimal branching factor. If you feel you must allow for sparseness, this early on, I would follow casey's example with HMH. Make a chunk type that holds something like 64x64x64 entities, and have a hashtable of chunks.

Using 6 bytes per face is optimizing prematurely. 63 is a somewhat awkward number of colors. The alignment is probably not going to do you any favors. If you're set on limiting the color pallet then feel free to make it so, but until you have a prototype you can just enforce the pallet using ifs.

For the triangle list, start a linked list of buffer segments. Something like:

struct BufferBlock
{
    VertexType Buffer[BUFFERSIZE];
    BufferBlock* Next;
};

Day 057 of HMH describes how you can make and recycle these segments without heap allocation and freeing.

xcodedave

#7100

May 31, 2016

Thanks for the useful feedback! I know I haven't given much information yet (the project is still early in development), but I appreciate your thoughts and critique.

RE: Octree sub-optimal branching factor - At the moment I've got an indexed full-octree, so it's not sparse at all (see: Advanced Octrees). Spatial queries on the full-octree are pretty optimal though, and this is an advantage for editing and tessellation.
Recently, like you suggested, I've been considering moving towards a hash table of chunks as full-octrees really don't scale well. Testing the performance of editing, tessellation and rendering while varying the voxel count per chunk is something I'll be measuring so I can find an optimum balance.

RE: the number of colours - Sorry if I didn't make it clear - the 6 bit colour-index per face gives 64 colours in total - but 0x00 will be reserved to represent transparency/no voxel. I'd like to make the data required to store the voxels as little as possible as I'd like to be able to render and load/save quite large and dense voxel models. It is an awkward size though - storing all faces of a single cube voxel would be 36 bit per voxel which for alignment purposes is obviously not great...I'll rethink this!

RE: The triangle list - Thanks for the linked-list idea. Something I've been thinking is that since (at the moment) I don't need to hold on to the triangles as they'll be uploaded to the GPU as VBOs (OpenGL Vertex Buffer Objects), I might use a stack allocator and after uploading to the GPU just reset the stack position to zero. (The triangle list would be generated per chunk; each chunk would have a VBO on the GPU).

Once again, thanks for your thoughts! :)

xcodedave

#7133

June 2, 2016

Just a quick update to show some of the early progress.

This is a 32x32x32 volume rendered as a single VBO; the faces are colored this way to highlight the tessellation/triangulation of the voxels. The number of triangles/vertices output is optimized by finding the largest rectangular area (of contiguous color) to avoid redundant vertices/triangles.

There's still more to do on this particular algorithm (hidden surface removal should reduce the number of triangles and vertices much further) and the project still has a long, long way to go!

Now some basic rendering is up, my next task is to start measuring the performance of various spatial partitioning approaches (discussed above) and looking into current bottle-necks.

At this point, machine spec is probably relevant!

For testing/building I'm using a Macbook 2015 with the following spec:

1.2 GHz Intel Core M
8 GB 1600 MHz DDR3
Intel HD Graphics 5300 1536 MB

One thing I've noticed so far is that uploading the VBO to GPU (using glBufferData) seems to take a long time, even for something this small. This is obviously not ideal considering editing voxels will require re-uploading the VBO to GPU...

Oswald Hurlem

#7145

June 3, 2016

Maybe it depends on the size of the VBO? stb_voxel_render emphasizes a small number of bytes per voxel face, I think that affects the speed to a degree. I haven't tested this yet although I eventually will.

xcodedave

#7153

June 5, 2016

Oswald_Hurlem
Maybe it depends on the size of the VBO? stb_voxel_render emphasizes a small number of bytes per voxel face, I think that affects the speed to a degree. I haven't tested this yet although I eventually will.

Currently, my VBO format works out to 100 bytes per face, so I can definitely see room for optimization here (bit masking single floats to extract xyz perhaps?). Even if speed is not much improved, it should free up a lot of memory on the GPU. After measuring my OpenGL call performance it seems that uploading the VBO data is actually pretty quick. (and doesn't need to happen that often).

The main performance bottleneck seems to be the actual rendering. I've measured the performance of the glDrawArrays call I'm making for 32x32x32 voxels. Currently this chunk of voxels is tessellated to a single VBO of ~90,000 triangles.

This single glDrawArrays call takes an average time of ~1.4ms to render (including buffer swapping). Based on these numbers, currently it seems I am limited to ~1 million triangles, or roughly 400,000 voxels per frame.

Do these numbers seem reasonable? I was (perhaps naively) hoping for more... maybe I just need a better and less brute-force strategy.

xcodedave

#7155

June 5, 2016

xcodedave

Currently, my VBO format works out to 100 bytes per face, so I can definitely see room for optimization here (bit masking single floats to extract xyz perhaps?).

I eventually got it down to 4 bytes per vertex, so 12 bytes per triangle, quite a big improvement. I don't really see any performance difference, though at least there's less wasted memory now.

Just in case interested, this is what I used for packing/unpacking the vertex data ('w' is used as a colour index):

static __inline__ uint32_t packVertex(uint8_t x, uint8_t y, uint8_t z, uint8_t w)
{
	    // Pack into one 32 bit uint.
	    return    (((uint32_t)x)	  |
	              (((uint32_t)y)<< 8) |
	              (((uint32_t)z)<<16) |
	              (((uint32_t)w)<<24) );
}

and the GLSL for unpacking...

vec4 unpackVertex(uint value)
{
	return vec4(float(int(value)	 & 0x000000ff),
		    float(int(value>> 8) & 0x000000ff),
		    float(int(value>>16) & 0x000000ff),
		    float(int(value>>24) & 0x000000ff));
}

I've also improved the tessellation by removing hidden faces (faces enclosed/surrounded by voxels) - this has reduced the triangle count by around 80%. The reduced triangle count has reduced overdraw and improved the render time (for the single 32x32x32 chunk) to 0.78ms.

Edited by xcodedave on June 5, 2016, 6:37pm

xcodedave

#7243

June 12, 2016

Another quick update - I haven't settled on how I'll write the GUI stuff yet, though for now I've implemented some basic usage of (the excellent) dear Imgui library.

The bad news is that my previous performance measurements have shown that I will not be able to render anywhere near as many voxels as I hoped whilst maintaining 60 frames per second (given the current tessellated mesh VBO pipeline). The limit would be around 400,000 voxels and I'm aiming for around 20,000,000!

This has caused me to totally re-think my rendering strategy, and I'll be looking into ray-tracing a sparse voxel octree on the GPU. The current rendering strategy of tessellating the voxels and rendering VBO meshes is too slow and just not scale-able to the density I'm hoping for.

I'll have to re-write all of the rendering code - and learn a lot about ray-tracing on the GPU in the process. For this, I'll be reading up on the Giga-Voxel paper which looks promising.

Oswald Hurlem

#7263

June 14, 2016

Are you only rendering voxels which aren't surrounded by other voxels?

Edited by Oswald Hurlem on June 14, 2016, 3:32am

xcodedave

#7270

June 14, 2016

Oswald_Hurlem
Are you only rendering voxels which aren't surrounded by other voxels?

Yep! I'm removing all voxels surrounded by others as well as redundant faces.

I'm also optimizing the number of triangles generated by finding the largest contiguous coloured area, and outputting as a single face. i.e. a cube of 10x10x10 contiguous voxels would actually still only output 6 faces (12 triangles) if all faces had matching colours.

Oswald Hurlem

#7279

June 14, 2016

That's going to create T-junction problems... not worth it IMO.

Nathan Day

#7310

June 17, 2016

I don't know how much this would help, but I think i saw that you were having some performance issues with glBufferData. One thing I've used in the past was persistently mapped buffers. It might not work for the case you have in mind, but it definitely worked for mine. I think it's quite a modern GL feature so if you are looking for vast backwards compatibility this could be useless. Theres a talk at Steamworks https://www.youtube.com/watch?v=-bCeNzgiJ8I which is quite helpful.

Hope this helps in some way :)