I'm pretty new to the Intel intrinsics, so is there any intrinsic for doing ops between lanes of a SIMD register?
Shuffle top 64-bits down to low 64-bit, then add to original register.
So does the shuffle happen on another register, or the same register? What instruction can do that?
_mm_shuffle_epi32
intrinsic, pshufd
instruction.
If you use intrinsics, then you don't worry about which exactly registers will be used for calculations - compiler will allocate them for you.
See example code here: https://godbolt.org/z/T4YrMbGKf
What is the _MM_SHUFFLE doing? The intel intrinsic guide says it is the control bit, but what does (3,2,3,2)
mean?
When I asked about register usage, I meant is there any way to only use the original register without any additional ones (presumably with only a single instruction, something like: _mm_merge_epi64(myRegister)
)? In this case, you still need an additional shuffle and store that in another register first.
_MM_SHUFFLE(a,b,c,d)
is macro that gets 4 arguments, each 2 bit value, and calculates (a<<6)|(b<<4)|(c<<2)|d
.
What these bits mean you can read in Intel intrinsic guide for instruction you are using it in. For _mm_shuffle_epi32 it will calculate following:
dst[0] = src[d] dst[1] = src[c] dst[2] = src[b] dst[3] = src[a]
Where src/dst are input/output 128-bit registers. And I use array syntax to access 32-bit lanes. In your case if you want top 64-bit (src[2] and src[3]) to be moved to low 64-bit (dst[0] and dst[1]). Thus d=2 and c=3. But a and b values don't matter, can be anything - 0, 1, 2 or 3.
There is no way to add top 64-bit to low 64-bit of __m128i with single instruction.