I'm working on a voxel game and recently completed a big refactor of the chunk meshing. When I initially added the baked ambient occlusion, the frame rate became unplayable in the debug build and I had noticable hitches in release builds. The problem was that when I meshed a chunk, I had to sample from the 26 neighboring chunks to be able to calculate AO and to cull hidden faces. The slowness was likely do to lots of cache misses fetching blocks from the different chunks. My refactor took inspiration from "dual grid tiling" where instead of meshing each chunk, I mesh a quarter of 8 chunks...a reduction of over 3x. I create a bit mask of solid voxels per row. Because my chunks are 16x16x16, I can combine the bitmask for a row of voxels from two chunks into a single u32. Since I only mesh the center 16x16x16 area, I now have instant access to neighboring voxels, and can calculate face visibility and AO for the entire row with only a couple of CPU instructions. This massively improved performance such that I am maintaining 60fps+ even in debug build when meshing new chunks. The noise calculation is done on a separate thread, and meshing is currently done on the main thread.