First of all - you need not to worry about extra mov's MSVC generates - because those things go away when function like this is inlined. And you really want this kind of small functions to be inlined.
Are there any reasons MSVC generates all these mov instructions? Is it just because of bad register allocation, or is it because it knows this function will get inlined a lot, so it just doesn't care?
If you look at the generated code of lerp_fma_3
, MSVC also generates other instructions like vmovss3
, vxorps
, or vmovups
. Why does it decide to generate code like this and will all of those get optimized out when inlined?
In cases when you cannot inline, at least for MSVC you can make ABI better by enabling vectorcall calling convention - add __vectorcall to function or compile whole TU with /Gv. It will pass simd arguments in simd register, no need for moving them to/from stack.
Thanks, I was unsure about this but forgot to ask (because GCC always generates vinsertps
for floats which makes me want to only use __m128
, but MSVC doesn't use the same register for SIMD vs float parameter).
Also, do I need to worry about all the extra vinsertps
that GCC generates? Or will those disappear when the function gets inlined?
Generally you want to avoid /fp:fast or -ffast-math compiler options. Because that means the code compiler produces is not doing same calculations you specified. It is free to rearrange float operations and calculate completely different thing
If you look at the beginning comment, I think that -ffast-math
should be avoided (because GCC generates worse code while Clang generates less precise code), while /fp:fast
isn't as aggressive and I haven't seen any weird/incorrect code from it.
Because without it the actual call to CRT fmaf function is not as cheap as doing separate mul and add.
Interesting, what will CRT fmaf function get generated to when you're on platforms that don't have fma?