The main problem for people getting started with software rendering is to get out of the GPU mindset where textures are cheap and the pipeline is limited to old API features. The real task is to approximate reality with the least effort for the given hardware. Think about how shapes you see around you can be approximated using more than just polygons. One can use pre-generated depth maps from different camera angles, fuzzy voxels, meta shapes, height fields, spline contours for up-scaling...
Not sure if you have any of this after a quick look, but for performance you can:
- Use more unsigned integers to represent fixed point precision where possible. The best tricks use bitwise operations.
- Passively update shading to textures with specular light having a higher resolution and frequency while diffuse is stored further away. This allow sub-surface scattering, soft shadows, advanced blend mapping, parallax occlusion mapping, and more efficient SIMD and multi-threading by working with multiple images on separate background threads.
- Can also use deferred light.
- Bake your models to store texture data inside of compressed tesselated vertex geometry to fully eliminate random access reads in textures. Then optimize triangle rasterization for small triangles to get super fast detail tesselation on par with dedicated graphics cards.
- Use occlusion shapes to afford larger scenes by skipping triangles, tesselation patches, models and groups. Pixels are the most expensive primitives in CPU rendering, so compensate by making a more accurate occlusion.
- Sort your world by depth to get analytical anti-aliasing at no significant cost. If combined with detail tesselation, you don't need anisotropic filtering, because you have more polygons than pixels on the screen in 100Hz 1080p.
To move development forward faster, create higher types and functions for all the repeated SSE intrinsics.
How I write deferred light for efficient processing in larger batches:
How I get clean syntax for SSE/NEON/reference without sacrificing any performance:
How I do 3D vector math with SIMD: