Public beta of 2D/3D/isometric software renderer

I have a public beta of my 2D/3D/isometric software renderer (currently for Linux). Hopefully I can get some feedback about usability (with valid arguments, not religious references), feature suggestions and find the last bugs before making a 1.0 release from where version compatibility becomes important. Now I can still erase things that should not be in the core library.
https://github.com/Dawoodoz/DFPSR

Coding style
I try to keep the style minimalistic and unconventional in order to find a better way to write C++. Old habits have to die, so that both performance and usability can improve. The library does not have any rendering context nor global state, because each resource is reference counted independently. Most of the API uses global functions to improve testability and reduce cyclic dependencies. Classes are mostly hidden internally for running the GUI system.

Platforms
More desktop/laptop systems can be supported with contributions, but I only have different Linux distributions on my 10 computers, because Microsoft abandoned their last stable release (Windows 7). Android and IOS will not be supported, because this project focuses on long-term stability over centuries, not months.

Zero dependency
It does not link dynamically to any third-party library. No 3D accelerated drivers nor OpenGL extensions required to function, just the bare minimum of the operating system. Window managers are optionally linked from the outside as a single module so that platforms in a distant future can easily be ported to. If you don't want a window manager at all, you can convert the image into ascii art or write to an image file.

Intel/ARM abstraction
Both SSE2 and NEON are supported with a portable hardware abstraction layer for SIMD. Infix syntax generates code for the target system with the same speed as hand-written intrinsics, but testing on one platform almost guarantees that it will work on other platforms as well. I have profiled and inspected the generated assembly code to make sure that there is no overhead from the abstraction.

The isometric deferred light (main feature) is demonstrated in the Sandbox SDK example.
* Each model is pre-rendered with millions of polygons into diffuse, normal and height images.
* A low-detail 3D model is stored with the images for casting dynamic shadows.
* Sprites that rarely move are buffered using garbage collected background blocks that draw fixed sprites from an oc-tree structure when made visible by the camera.
* Dynamic sprites call the core renderer with diffuse, normal and depth images for deferred light.
* Light sources render the depth images from the low-detail shadow models.
* Light sources draw deferred light to the screen's light image.
* The diffuse and light image is multiplied.
* Multi-threading is used to upload the result using X11 while the next frame does game logic.
The result is around 300 frames per second in 800x600 resolution on a hexa-core CPU while having real-time dynamic light and unlimited detail-level for deep sprites.

Full 3D rendering
The engine also has 3D graphics with bi-linear mipmap sampling and lightmaps. Depth-buffer optimizations for 3D rendering are not yet implemented, because documenting and testing the API comes before doing fun stuff that might break things. Slowly but surely, the library becomes complete.

Interfaces
The GUI system currently has panels, buttons and list-boxes. Write your scalable interface in a file and load it into a window using a single line of C++ code. Have different layouts for different countries and preferences. Using both relative and absolute coordinates allow resizing the interface easily. Add three lines of code to define a lambda function to execute when a button in the interface is pressed.

Edited by Dawoodoz on
I think I'm more confused after reading the readme than I was before reading it. Some questions:
- Isn't the whole point of a software renderer that it's just filling a memory buffer, and is therefore largely platform-agnostic? Why is so much of the readme dedicated to listing which exact linux distributions it was tested on, and what platforms you say you don't (or won't!?) support?
- Regarding the win32/win64 bullet point, what do you mean by "emulation"? 64-bit windows doesn't emulate 32-bit applications, it can just run them directly, because backward-compatibility with x86 is a feature of the x64 ISA.
- What do you mean you "will never support mobile phones"? What would stop this renderer from running on a phone? I don't understand the justification in that bullet point either. Porting phone games to PC/console or vice versa is extremely common, and that doesn't have any bearing on whether you could use the library to make a mobile game.
- I'm not sure what you meant by the next bullet point about "web frontends" (but you actually seem to be talking about some kind of browser plugin? it's hard to tell), but whatever it was, it doesn't seem like it has any relevance to the readme.
- If no modern platforms that this library would concievably run on are big-endian, and there's no reason to expect that upcoming ones will be either, why mention it? It seems like endianness in this case could be taken as a given.
- "There's no quad-tree algorithm for quickly skipping large chunks of unseen pixels." - what does this mean? Is it a reference to hierarchical Z buffers? Same with "There's no depth sorting nor early removal of triangles yet." - does this refer to things like backface and viewport culling, or something else? In general, I think using more widely-understood terminology could make this more clear.
- Is there a reason why a whole GUI system is included with the rendering library? It seems rather out-of-scope.

The pre-baked deferred approach is cool and interesting, I don't know if I've seen someting like that used anywhere before. It could be very practical for modern isometric graphics, since it works around a lot of the problems (namely lack of fixed-function hardware for texture sampling, rasterization, hierarchical depth testing, etc. as well as less-aggressive hyperthreading/latency hiding) that normally make CPU rendering particularly slow compared to GPU rendering. However, it seems like the library aspects of the library, and the documentation, could use some more thought.
I don't have any usability feedback yet, but I built and ran the "sandbox" demo and just wanted to say it's very cool. The dynamic shadows look great!

I think work that prioritizes and tries to advance software longevity is laudable. So much of what's being created these days will easily be lost to time. It's nice to see projects that try to address that problem.
notnullnotvoid
I think I'm more confused after reading the readme than I was before reading it. Some questions:
- Isn't the whole point of a software renderer that it's just filling a memory buffer, and is therefore largely platform-agnostic? Why is so much of the readme dedicated to listing which exact linux distributions it was tested on, and what platforms you say you don't (or won't!?) support?
- Regarding the win32/win64 bullet point, what do you mean by "emulation"? 64-bit windows doesn't emulate 32-bit applications, it can just run them directly, because backward-compatibility with x86 is a feature of the x64 ISA.
- What do you mean you "will never support mobile phones"? What would stop this renderer from running on a phone? I don't understand the justification in that bullet point either. Porting phone games to PC/console or vice versa is extremely common, and that doesn't have any bearing on whether you could use the library to make a mobile game.
- I'm not sure what you meant by the next bullet point about "web frontends" (but you actually seem to be talking about some kind of browser plugin? it's hard to tell), but whatever it was, it doesn't seem like it has any relevance to the readme.
- If no modern platforms that this library would concievably run on are big-endian, and there's no reason to expect that upcoming ones will be either, why mention it? It seems like endianness in this case could be taken as a given.
- "There's no quad-tree algorithm for quickly skipping large chunks of unseen pixels." - what does this mean? Is it a reference to hierarchical Z buffers? Same with "There's no depth sorting nor early removal of triangles yet." - does this refer to things like backface and viewport culling, or something else? In general, I think using more widely-understood terminology could make this more clear.
- Is there a reason why a whole GUI system is included with the rendering library? It seems rather out-of-scope.

The pre-baked deferred approach is cool and interesting, I don't know if I've seen someting like that used anywhere before. It could be very practical for modern isometric graphics, since it works around a lot of the problems (namely lack of fixed-function hardware for texture sampling, rasterization, hierarchical depth testing, etc. as well as less-aggressive hyperthreading/latency hiding) that normally make CPU rendering particularly slow compared to GPU rendering. However, it seems like the library aspects of the library, and the documentation, could use some more thought.


Mobile versus long term stability
As a former mobile firmware developer, I'm too familiar with the obstacles faced when trying to develop native applications on an operating system constantly changing the rules about which C++ compiler to use and only supplying a sub-set of the system API for C++. Google decided to kick out OpenCL when adding their own RenderScript. The introduction of SELinux in Lollipop changed the permission system. Access to signal processors (the real power in phones) has different restrictions to hack around on each phone brand. If I would make something like this for a certain version of Android, I would rather use Kotlin and Vulkan to have the game rendering techniques re-implemented with a different API simplified for phones. This does however not mix well with a C++ API for desktops.

Slow on Windows-on-Windows-64
I made profiling where a basic logic loop adding registers without any system calls nor memory reads in a 32-bit Windows program running on 64-bit Windows 7. The slow-down was 50X compared to a native 64-bit application. Using SIMD operations always made the calculations slower than the scalar version, which is a typical sign of emulation rather than the claimed virtualization of system calls. If anyone has another theory to why 32 bits became so slow, then please share it.

GUI
In my previous engine, I refused to include a GUI system to keep it minimal while the operating system provided services for the rest. But because of the new long term stability goal on Linux, I must change my measure of complexity from how many lines there are in my library, to how many lines the final applications depend on in order to function. The focus has shifted from my own maintenance to the maintenance of the user's project. Excluding the GUI system would mean that developers are forced to use third-party dynamic dependencies that significantly add complexity to the system as a whole. Any feature not used in that dynamic dependency is essentially dead code to the program. When a project gets very old, every dynamic dependency (even closed-source OpenGL drivers) is a piece of the puzzle and you really want to keep those pieces in the same place so that versions are synchronized before doing the maintenance work. So in the end, there's a tug-of-war between covering all the basics and deciding where not to go. I excluded sound and active mobile support so that I would have time to focus on covering all the visual essentials for a full application.
I will see your project, But first let me tell you a few things. maybe they will serve you for something
I am interested in render by CPU too, as part of a development I found a way to rasterize an octree in isometric perspective. I publish the notes in gamedev. But my main development is in programming languages ​​and this makes it difficult to see the algorithm that I build.
the notes are in
https://www.gamedev.net/blogs/blo...experimental-graphics-technology/
Nice to see other software render proyects
Dawoodoz
Google decided to kick out OpenCL when adding their own RenderScript. The introduction of SELinux in Lollipop changed the permission system. Access to signal processors (the real power in phones) has different restrictions to hack around on each phone brand.

Sure, but none of that has anything to do with code that just fills memory, hence why I asked the question. But it seems you addressed the actual reason later (that that's not really what this project is), so fair enough.

Slow on Windows-on-Windows-64
I can't claim to know why you get the result you described, unless you're running Windows on ARM64 as opposed to x64. From the microsoft documentation: "On the x64 processor, x86 instructions are executed natively by the processor. Therefore, execution speed under WOW64 on x64 is similar to its speed under 32-bit Windows. On the Intel Itanium processor and any ARM64 processors, more software is involved in the emulation, and performance suffers as a result." https://docs.microsoft.com/en-us/...erformance-and-memory-consumption

Dawoodoz
In my previous engine, I refused to include a GUI system to keep it minimal while the operating system provided services for the rest.

I think this goes to the heart of the misunderstanding. I was under the impression this was just meant to be a rendering library, because that's how the documentation describes it. If it's actually meant to be a game engine, then indeed it makes a lot of sense to include a GUI system. The fuss about platform support makes sense in light of this as well. In that case I'd only suggesty updating the documentation to make that intent more clear.
That is correct. There is no emulation or virtualization involved for running 32-bit code on 64-bit Windows. Same thing on Linux. 32-bit code runs full speed as native process.

Show your benchmark and we'll tell what is wrong with it.
mmozeiko
That is correct. There is no emulation or virtualization involved for running 32-bit code on 64-bit Windows. Same thing on Linux. 32-bit code runs full speed as native process.

Show your benchmark and we'll tell what is wrong with it.


That was on my old Windows computer, which was replaced due to unstable hardware before I could finish porting to Windows.
phreda
I will see your project, But first let me tell you a few things. maybe they will serve you for something
I am interested in render by CPU too, as part of a development I found a way to rasterize an octree in isometric perspective. I publish the notes in gamedev. But my main development is in programming languages ​​and this makes it difficult to see the algorithm that I build.
the notes are in
https://www.gamedev.net/blogs/blo...experimental-graphics-technology/
Nice to see other software render proyects


The difference between oc-tree voxels where each node ends with a material, and the technique demonstrated in my SDK example, is that mine only uses the oc-tree only as a broad-phase for culling of pre-rasterized sprites with fixed camera angles. Without the oc-tree, it would have to look up each static sprite one by one and do the culling test individually, which would not scale up well for large worlds.
notnullnotvoid
Dawoodoz
Google decided to kick out OpenCL when adding their own RenderScript. The introduction of SELinux in Lollipop changed the permission system. Access to signal processors (the real power in phones) has different restrictions to hack around on each phone brand.

Sure, but none of that has anything to do with code that just fills memory, hence why I asked the question. But it seems you addressed the actual reason later (that that's not really what this project is), so fair enough.

Slow on Windows-on-Windows-64
I can't claim to know why you get the result you described, unless you're running Windows on ARM64 as opposed to x64. From the microsoft documentation: "On the x64 processor, x86 instructions are executed natively by the processor. Therefore, execution speed under WOW64 on x64 is similar to its speed under 32-bit Windows. On the Intel Itanium processor and any ARM64 processors, more software is involved in the emulation, and performance suffers as a result." https://docs.microsoft.com/en-us/...erformance-and-memory-consumption

Dawoodoz
In my previous engine, I refused to include a GUI system to keep it minimal while the operating system provided services for the rest.

I think this goes to the heart of the misunderstanding. I was under the impression this was just meant to be a rendering library, because that's how the documentation describes it. If it's actually meant to be a game engine, then indeed it makes a lot of sense to include a GUI system. The fuss about platform support makes sense in light of this as well. In that case I'd only suggesty updating the documentation to make that intent more clear.


Thanks, I will try to find another word for this crazy mix of rendering API and game engine I've made. Might have to fill the gap in the middle for it to make sense to people. :D
I managed to get a working Windows port to github before the decade old computer I borrowed could no longer contact the monitor. Even on 64 bits, running on Microsoft Windows was still many times slower than on Linux, when comparing frame-rates from the Sandbox example. Part of it is being an older computer, but a twice as slow CPU should not perform ten times as slow on cache bound operations.

I posted compilation instructions for Linux and Windows in the documentation folder, in case that someone still has a working Windows computer. My new computer cannot run Windows7 because of a new BIOS and my USB devices won't work with Windows10.

Edited by Dawoodoz on
I don't know how you get 10x slowdown.

First of all, you are being not fair in Windows code when blitting pixels to window. For X11 you are precreating XImage and reusing same object, but on Windows you are creating HBITMAP on every blit. That takes a bit of time. Correct way to do that is to use SetDIBitsToDevice function - that allows blitting to HDC from raw pointer of pixels in the memory (also nobody was releasing HBITMAP memory - so memory leak).
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
	HDC targetContext = BeginPaint(this->hwnd, &paintStruct);
		BITMAPINFO bmi = {};
		bmi.bmiHeader.biSize = sizeof(bmi.bmiHeader);
		bmi.bmiHeader.biWidth = width;
		bmi.bmiHeader.biHeight = -height;
		bmi.bmiHeader.biPlanes = 1;
		bmi.bmiHeader.biBitCount = 32;
		bmi.bmiHeader.biCompression = BI_RGB;
		SetDIBitsToDevice(targetContext, 0, 0, width, height, 0, 0, 0, height, dsr::image_dangerous_getData(this->canvas), &bmi, DIB_RGB_COLORS);
	EndPaint(this->hwnd, &paintStruct);


After this change the performance on Windows went up ~30%, now on my laptop it is pretty much the same between Windows and Linux.

Windows = 82fps, https://i.imgur.com/IzKoPsM.png
Linux = 87fps, https://i.imgur.com/xL0iUAa.png

Difference is minor, either because of different scene (is it randomly generated?), or because of different compiler version.
Laptop has i7-6700HQ cpu.


Edited by Mārtiņš Možeiko on
Yes, the image uploading on Windows is quite bad when not having much access to use the operating system. I wasted most of the time looking for a way to upload without BGR conversion. A pull-request with your optimization for Windows would be much appreciated. I can invite you to the project. (Not sure how to set resonable security though, looks like all or nothing on Github)

std::rand is being used to generate the map.

Edited by Dawoodoz on
What do you mean quite bad? Compared to what? I showed you that it is pretty much same performance as on Linux.
For real uploads you should use GL/D3D with asynchronous transfers anyways. That will give performance boost on both - Windows and Linux over plain GDI/X11. X11 protocol is also pretty bad for constantly updating client-side images.

Sorry, I have no interests in contributing to the code. All you need to do is put this code I posted above instead your CreateBitmap/BitBlt, and performance will be on par with X11. You can always cross compile with mingw on Linux to verify that code compiles + even test it with wine. It works really well nowadays.
I just wanted to clarify your misunderstanding about "slow" Windows.

Edited by Mārtiņš Možeiko on
mmozeiko
What do you mean quite bad? Compared to what? I showed you that it is pretty much same performance as on Linux.
For real uploads you should use GL/D3D with asynchronous transfers anyways. That will give performance boost on both - Windows and Linux over plain GDI/X11. X11 protocol is also pretty bad for constantly updating client-side images.

Sorry, I have no interests in contributing to the code. All you need to do is put this code I posted above instead your CreateBitmap/BitBlt, and performance will be on par with X11. You can always cross compile with mingw on Linux to verify that code compiles + even test it with wine. It works really well nowadays.
I just wanted to clarify your misunderstanding about "slow" Windows.


Sure, thanks for your help anyway. :)