I finally got around to finishing a handmade general purpose lossless image compression format (yoga) and debugging all the edge cases. Using the images from here https://qoiformat.org/benchmark/ this is the total decode times and file sizes I get compared to stb_image.h/stb_image_write.h. Times ignore file I/O, single threaded.