I originally was introduced to the concept a few years ago with this talk from a lead programmer from naughty dog:
https://www.gdcvault.com/play/1022186/Parallelizing-the-Naughty-Dog-Engine
and wasn't reintroduced to the concept until themachinery engine (which is now gone as most people are aware) which uses the same strategy. It seemed like it worked pretty well actually and has a lot of pros to it (simplifys api for programmer, programmer doesn't have to worry about different low level synchronization primitives, don't have to pay the cost of unwanted context switching, and others), though I hadn't really tested it extensively. I'm wondering if anyone has tried this approach and how it's worked out for them. I want to play around with the concept my self and started thinking more deeply about it.
I also noticed in the video he talked about the issue with thread local storage optimizations and clang (clang doesn't have a compiler option to turn off these optimizations like MSVC does /GT option). Tried looking through clang documentation and still doesn't look like is has support for this