writing code that uses instruction pipelining as a form of parallelism

Anyone have a good example of code that exploits ooo and instruction pipelining as a form of parallelism in conjunction with simd like fabian described at the end of the compression session of handmade con?
I think zstd compression library should have this. Not simd though.
Author had multiple articles on its design in his blog. For example: https://fastcompression.blogspot....man-revisited-part-2-decoder.html
I found slides for a talk where they try to get max perf out of an OOO machine: https://deplinenoise.files.wordpr...6/03/gdc16_fredriksson_jaguar.pdf

though one key point to make is that the exact numbers (like the instructions/cycle or latency of instructions) depend heavily on the machine.