After reading this blog post about the Lerp function, I've experimented with generated code from different compilers. The first section is just normal C code to see how well the compiler optimization is, the second section uses the CRT fmaf
function, and the third section directly uses intrinsic. As far as I can tell, Clang is just better at this, while GCC litters the code with vinsertps
every time I use _mm_set_ss
and MSVC pollutes the code with vmovaps
because of dump XMM
register allocation.
Is there any way to get around these problems for GCC and MSVC? The only way around GCC (that I can think of) is to use only floats, which wouldn't work because the CRT only has fmaf
(not fnma
or fnms
) and because no one should use the CRT.
There are multiple things here to answer.
First of all - you need not to worry about extra mov's MSVC generates - because those things go away when function like this is inlined. And you really want this kind of small functions to be inlined.
In cases when you cannot inline, at least for MSVC you can make ABI better by enabling vectorcall calling convention - add __vectorcall
to function or compile whole TU with /Gv. It will pass simd arguments in simd register, no need for moving them to/from stack.
Generally you want to avoid /fp:fast or -ffast-math compiler options. Because that means the code compiler produces is not doing same calculations you specified. It is free to rearrange float operations and calculate completely different thing. And that subtly changes when doing minor changes to your code (due to inlining heuristics or register allocation or similar reasons). Most of the time determinism is preferred over very minor speed-up it would give you. Having exactly same results between optimized and debug builds, or between x86 and arm builds is a must have feature imho.
For fma function & instruction - that is fine to use if you know you'll be shipping for platform that always has this instruction. Because without it the actual call to CRT fmaf function is not as cheap as doing separate mul and add.
First of all - you need not to worry about extra mov's MSVC generates - because those things go away when function like this is inlined. And you really want this kind of small functions to be inlined.
Are there any reasons MSVC generates all these mov instructions? Is it just because of bad register allocation, or is it because it knows this function will get inlined a lot, so it just doesn't care?
If you look at the generated code of lerp_fma_3
, MSVC also generates other instructions like vmovss3
, vxorps
, or vmovups
. Why does it decide to generate code like this and will all of those get optimized out when inlined?
In cases when you cannot inline, at least for MSVC you can make ABI better by enabling vectorcall calling convention - add __vectorcall to function or compile whole TU with /Gv. It will pass simd arguments in simd register, no need for moving them to/from stack.
Thanks, I was unsure about this but forgot to ask (because GCC always generates vinsertps
for floats which makes me want to only use __m128
, but MSVC doesn't use the same register for SIMD vs float parameter).
Also, do I need to worry about all the extra vinsertps
that GCC generates? Or will those disappear when the function gets inlined?
Generally you want to avoid /fp:fast or -ffast-math compiler options. Because that means the code compiler produces is not doing same calculations you specified. It is free to rearrange float operations and calculate completely different thing
If you look at the beginning comment, I think that -ffast-math
should be avoided (because GCC generates worse code while Clang generates less precise code), while /fp:fast
isn't as aggressive and I haven't seen any weird/incorrect code from it.
Because without it the actual call to CRT fmaf function is not as cheap as doing separate mul and add.
Interesting, what will CRT fmaf function get generated to when you're on platforms that don't have fma?
Some mov's are because of ABI (like the lerp_m128 case). Other mov's are because compiler is silly. But it does not really matter - if you put one instruction in separate function that means you don't really care about performance. Few mov's or inserts around it won't change anything, the call itself will be already bad for performance.
That's why you should expect to inline these functions. Saying "or is it because it knows this function will get inlined a lot" is a bit wrong. Because compiler does what you asked it do - in this godbolt example you asked to create functions. So it created functions. It does not inline because there's nowhere to inline.
CRT's implementation of fma function varies a lot. It depends on CRT how it looks like. For example, for musl CRT library it looks like this: https://git.musl-libc.org/cgit/musl/tree/src/math/fmaf.c
I don't understand how inlining will make this better. If I call this function and pass in different variables (not constant), all the insert and mov instructions are still there. It'll just optimize out the call instruction.
Here's an example. As you can see, GCC still generates a bunch of vinsertps
; while MSVC still poops its pant with a bunch of vmovaps
, vmovss3
, and vmovups
. What I want is something like what Clang generated.
What I mean that is that it does not matter much if you're doing just one operation like that. Yes, MSVC generated code is not pretty. It also does not optimize much over intrinsics (when input values are known) like clang does. But it's always been like that if you just just one intrinsic in middle of scalar code. If you want prettier code leave everything in scalars, or calculate in simd registers bigger blocks of code. If this is for performance reasons you would do 8x beziers at once anyway - with all the 8 lanes of AVX register.
Or use fmaf() function - MSVC usually optimizes code better when it knows what function does (like sqrtf, fmaf) instead of when you use intrinsics. It probably won't be same as you want from clang's output, but should be closer.
after inlining the compiler will do more optimizations that will end up eliminating extraneous moves
Why do compilers decide to generate different code for the lerpf
and lerp_128_cast_f
functions? Aren't they doing the same thing?
Interestingly, rather than passing float
, you can just pass __m128/256
directly, and all the compilers generate the correct code.
Different people wrote different compilers. Meaning they implemented them differently - different optimizations, different compiler options, etc...
If compilers would generate exactly the same code, then there would not be need for multiple compilers.