Why are Lots of Small Draw Calls Bad, and how Does Batching Help?

So, I have been looking into batching in OpenGL, because I constantly hear that batching draw calls is a big win. I don't quite understand how this helps though. I don't understand what the actual overhead is per draw call, since ultimately the GPU is going to render the same number of vertices.

For example, assuming no state changes between Draw calls, why would I want to utilize CPU time to transform vertices and update a dynamic VBO, instead of just executing multiple Draw calls. I hear that for small meshes, the transformation of vertices CPU side is worth it for the savings, but what exactly is producing CPU overhead on Draw calls?

I also read that the reason is because the GPU renders super fast, so if the CPU can't supply data fast enough, we are completely CPU bottle-necked. I, once again, don't see how batching solves this issue, because sure we are supplying more data for the GPU to churn through, but we are also taking longer to submit it. Therefore, the only thing I can assume is that some sort of driver overhead is occurring. My current train of thought is that it fills up the Command Push Buffer faster on the user-mode application, and so has to incur context switches to maybe copy it over to the kernel-mode command buffer multiple times per frame? I honestly have no clue, even after reading multiple online resources.

Thanks. I would really love it if someone could help elucidate this for me!

Edited by Bronxolon on Reason: Initial post
This is a complex topic, but main point is that GPU driver (software side) has extra overhead to do for each draw call. GPU's are so fast that this CPU overhead might prevent keeping GPU 100% busy.

That said, if you are doing only few thousand or tenths of thousand polygons - it does not matter. Your GPU won't be busy enough for CPU overhead to matter. Once you get into hundreds of thousands or millions polygons per frame - then this will matter. Your CPU won't be able to process too many draw calls every frame if you don't do proper batching.

To do batching, you should not be doing vertex transformations on CPU. That's a wrong way to approach it. There are techniques where you still keep all the vertex data on GPU and supply multiple transformation matrices per each model separately - but draw them with one draw call (as long as you can use same shader).
I found slides from an NVIDIA GDC talk: Batch, Batch, Batch.

Would be better to have the video alongside them, but still pretty interesting to read by itself!
if it isn't too complex could you perhaps give me an example of the type of overhead the driver might incur per draw call. i can't imagine what that may be since in my mental model, it just has to put a draw command in the push buffer. assuming no state / material changes, I don't see where any overhead would come in. my assumption is it is just fairly esoteric things that just make it true, like consistency checks and handling issues to do with multiple applications modifying the kernel command buffer, etc. and i suppose one of the issues with lots of state changes in general, not even worrying about draw calls in particular, is that the driver is gonna have to flush those calls out to the DMA command buffer perhaps even multiple times per frame, so that's some extra overhead too potentially (obviously depends on driver implementation). I know there's no way to know precisely, but is this the sort of thing that could, at least in theory, be the sort of overhead you are referring to?

Edited by Bronxolon on
also not something i wanna get into too much, since I just don't understand enough about the problem space, but I believe the CPU transformation of vertices a la Unity 3D dynamic batching is, at least from what I've read, actually considered a marked improvement over just rendering 1 thing at a time, if the vertex count of the mesh you are drawing is under a certain amount. In Unity, in my understanding, it dynamically batches meshes with under 900 vertices (usually less, based on number of vertex attributes, etc). Basically, trading compute time on the application-side for driver-time. I think the method you were talking about is Instancing (maybe you are suggesting something I haven't heard of as of yet), but I've been told that instancing leads to bad warp utilization, since if I am rendering 100 quads, each quad will need to take up a full warp. Definitely, not sure though I could be completely wrong.
@nickav: I would not rely on "Batch, Batch, Batch" presentation/talk much. It is close to two decade old, and GPU drivers those days were quite different.
Here's a bit more modern talk: https://developer.nvidia.com/cont...adically-reduce-driver-overhead-0 (still over 5y old, so...)
And obviously AZDO:
https://www.gdcvault.com/play/102...proaching-Zero-Driver-Overhead-in
http://developer.download.nvidia....016/mschott_lbishop_gl_vulkan.pdf
https://developer.nvidia.com/opengl-vulkan

@Bronxolon: what you say generally is true for Vulkan and D3D12 API's. There the driver overhead is minimal and most of the time it just pushes commands to buffer. On the other hand - GL/D3D11 style API's have much larger overhead due to how API is abstracted from hardware. Every time GL driver encounters draw command, it needs to figure out how to properly build command stream to GPU. There is a lot of state to take into account. Not only vertex attribute pointers, but blend state, rasterizer states, used shader (maybe it needs to be recompiled), used uniforms, used atrributes, etc... There is a ton of validation of work happening before it gets to actual command buffer. That is why modern GPU drivers have separate thread for doing these things. Raw GL API most of the time just talks only to this thread, where all the submission work happens.

Edited by Mārtiņš Možeiko on
so correct me if i'm wrong, but basically what you are saying is state changes and all that sort of stuff are attached to draw calls in OGL eg. use this texture unit or these uniforms, etc. essentially, what that means is not only does it have to validate what the current state of the GPU is, but also generate commands that translate the OpenGL, which may say something as simple as please draw this mesh, to what can and is probably much more complicated on the GPU. if VBO is not resident in VRAM load it in, set these GPU registers that the uniform variables are going to refer to, etc. Basically, the fact that the GPU architecture is so heavily abstracted in OGL leads to quite a lot of CPU time being eaten up by drivers.

I guess this also explains why when I profile state change calls they tend to be pretty fast, because ultimately the state change only becomes a issue when the GPU actually begins executing a draw call, and needs to incur extra overhead beyond the base draw to validate states.
That's exactly right.