Having watched many of Casey's videos, I have a few questions about things he has mentioned:
So, writing the pixels to the backbuffer as a u32 *
is more efficient than traversing as a u8 *
. Why is this? On x86, I would've thought it would be more efficient for u64 *
as this is the size of the registers and can be done in one mov
I understand that gnu/linux stack can contain different combinations of audio/visual components, e.g. wayland/x11, pulse/alsa etc. However, is it really that much effort to write for each API?
He also mentioned this might be because compilers have to adhere to the C standard. Do the STL of other languages suffer this same issue?
1.
I don't remember where was this said, but from what you wrote I assume it was said that writing u32 is faster than u8. Because it is 4x less instructions. Writing u64 will be a bit more difficult because then you need to properly pack two of your pixels into 64-bit value, and carefully handle end of line/buffer case if there is just one pixel left etc...
In general, writing or reading to memory has nothing to with CPU native register bit-size. If you want really fastest way then you should use widest register, which would be 128-bit on SSE, 256-bit on AVX, and 512-bit on AVX512. In general memory typically is fetched into L1 cache with cache line granularity (64 bytes), so when it is in cache, then writing/reading into nearby regions is pretty cheap.
2.
It's not about different API, it's about that this APIs are not part of OS, they are part of distribution which decides which ones will be present, how they will be configured, which versions will be used, etc... You cannot know how everything will be put together just from name "Linux". Because kernel has nothing to do with it. All this software diversity is very volatile, it may work in one combination, not in another, and you have almost no chance to properly test it, because user might have everything configured differently. That's one major pain point. Another is shipping binaries for Linux. It is very very hard, almost impossible to do it in a way that will work everywhere. Because of C runtime, which again is part of distro, not kernel. Different distros have different CRT's, so if you compile program with one, then you won't work with another. And no, you cannot avoid using CRT on Linux - because if you want to use GPU drivers for OpenGL or similar, then you must use same CRT as system uses (it is one that will be loading "dll" files), otherwise GPU drivers will not work - because they load CRT that is on system. Statically linked code on Linux is hard if you want to do GUI with GPU because of binary drivers. You can do software rendering without all this mess, but meh.. that's just silly to not use free performance that's available in every system nowadays.
3.
The bloatness in C++ STL is from template overusing. If these other languages don't have templates then usually their standard library is not so bad. But still, often it forces their patters onto you, like memory allocation - so sometimes you cannot use them, if you want to manage memory yourself with arenas or similar.
Thanks Mārtiņš.
Regarding point 2:
Could you not just code for the lowest possible version of say Xlib/Wayland and Alsa (possibly with something like pasuspender to halt a mixer like Pulse)? That covers visual+audio. What other issues would need to be addressed?
I'm a bit confused when you say it's almost impossible to ship a single binary for all Linux OSs because of CRT differences.
Hypothetically, if I compile my program to use the oldest version of glibc that I can, then it should work on most systems right? For example, if I compile my program on Ubuntu 20.04 to use the glibc version present on Ubuntu 14.04, it should work? Or am I missing some high level differences between the CRT of Linux distros?
Could you not just code for the lowest possible version of say Xlib/Wayland and Alsa (possibly with something like pasuspender to halt a mixer like Pulse)?
Loading those dynamically and supporting only min set of API's you want to support is not a problem. SDL does that and it works fine. Problem starts when everything is configured a bit differently which you don't know how exactly. There's too much runtime variability. And most users are not capable diagnosing it and fixing for their machines, so they will blame you that your software won't work.
Most steam games on Linux solve that by using Steam Runtime. That ships all this supporting libraries as .so files separately (either with game, or part of Steam install). Which then gives you more reasonable "base level". If you're seriously considering shipping games for Linux, I strongly suggest to use Steam Runtime. People most likely will have it in better working state, than whatever they have installed differently.
https://github.com/ValveSoftware/steam-runtime
https://partner.steamgames.com/doc/store/application/platforms/linux
https://jorgen.tjer.no/post/2014/05/28/steam-runtime-without-steam/ (+ linked previous blog entry)
I'm a bit confused when you say it's almost impossible to ship a single binary for all Linux OSs because of CRT differences.
Hypothetically, if I compile my program to use the oldest version of glibc that I can, then it should work on most systems right? For example, if I compile my program on Ubuntu 20.04 to use the glibc version present on Ubuntu 14.04, it should work? Or am I missing some high level differences between the CRT of Linux distros?
glibc is not the only C runtime. There's also musl that is used for Alpine. If you use glibc for your runtime, then your code most likely won't work on Alpine. So again writing for "Linux" and shipping binaries is almost impossible to guarantee that it will work on any "Linux". But if you just want to target "Ubuntu", then it's perfectly doable.
If writing something smaller than what you can send to the memory as one piece and the compiler is not smart enough to merge multiple writes for you, the memory bus might end up reading the old data into cache, potentially waiting 2000 cycles for a cache miss, only to overwrite it with masks and send it back.
It's easy to implement for each API, but the hard thing is deciding what the "correct" behavior is when you have to flex a little to the convention of each platform. The temp folder on Linux is cleared on reboot by being stored in memory, while Windows uses cleaning tools. Mouse and keyboard input depend on different ways of repeating keystrokes (which has to obey system settings due to user preferences) and each platform has own conventions for copying and pasting text (Ctrl+C on Windows, multiple buffers on Linux, Copy button on Unix, Ctrl+Shift+C in CLI)... If using a media layer, it's important that the implementation is easy enough for you to repair when something does not work with your preferred way of abstracting over all these differences. Some prefer to take the minimal route (not enough features), while others try and have fallback solutions (might break stuff) or feature flag tests (hard to test).
The C++ STL is very bloated, because once inserted into the standard, it can never be removed, but they do so anyway, so you can't rely on STL in the same way as a third-party library. It's often easier to target each operating system directly with native C APIs (one per system) than to use STL (one version for each combination of compiler and operating system, plus a fallback version on the C API targeting each system for compilers who didn't implement the STL feature correctly or at all).