I don't remember where was this said, but from what you wrote I assume it was said that writing u32 is faster than u8. Because it is 4x less instructions. Writing u64 will be a bit more difficult because then you need to properly pack two of your pixels into 64-bit value, and carefully handle end of line/buffer case if there is just one pixel left etc...
In general, writing or reading to memory has nothing to with CPU native register bit-size. If you want really fastest way then you should use widest register, which would be 128-bit on SSE, 256-bit on AVX, and 512-bit on AVX512. In general memory typically is fetched into L1 cache with cache line granularity (64 bytes), so when it is in cache, then writing/reading into nearby regions is pretty cheap.
It's not about different API, it's about that this APIs are not part of OS, they are part of distribution which decides which ones will be present, how they will be configured, which versions will be used, etc... You cannot know how everything will be put together just from name "Linux". Because kernel has nothing to do with it. All this software diversity is very volatile, it may work in one combination, not in another, and you have almost no chance to properly test it, because user might have everything configured differently. That's one major pain point. Another is shipping binaries for Linux. It is very very hard, almost impossible to do it in a way that will work everywhere. Because of C runtime, which again is part of distro, not kernel. Different distros have different CRT's, so if you compile program with one, then you won't work with another. And no, you cannot avoid using CRT on Linux - because if you want to use GPU drivers for OpenGL or similar, then you must use same CRT as system uses (it is one that will be loading "dll" files), otherwise GPU drivers will not work - because they load CRT that is on system. Statically linked code on Linux is hard if you want to do GUI with GPU because of binary drivers. You can do software rendering without all this mess, but meh.. that's just silly to not use free performance that's available in every system nowadays.
The bloatness in C++ STL is from template overusing. If these other languages don't have templates then usually their standard library is not so bad. But still, often it forces their patters onto you, like memory allocation - so sometimes you cannot use them, if you want to manage memory yourself with arenas or similar.