Understanding basic SIMD

cl777

#21138

June 8, 2019

Hi Folks,

I'm trying to wrap my had around basic SIMD stuff.
I wrote a simple program that adds two V3s and stores the result in a new one.
There are two data structures, V3 (no SIMD) and V3u (SIMD):

typedef struct 
{
    u32 x;
    u32 y;
    u32 z;
} V3;

typedef union 
{
    struct 
    {
        u32 x;
        u32 y;
        u32 z;
        u32 padding;    
    } components;
    __m128i sse;
} V3U;

and two functions to add vectors, AddVector and AddVector_128

V3 *AddVectors(V3 *va, V3 *vb, u32 size)
{
    V3 *result = calloc(size, sizeof(V3));

    for(u32 i = 0; i < size; i++) {
        result[i].x = va[i].x + vb[i].x;
        result[i].y = va[i].y + vb[i].y;
        result[i].z = va[i].z + vb[i].z;
    }

    return result;
}

V3U *AddVectors_128(V3U *va, V3U *vb, u32 size)
{
    V3U *result = calloc(size, sizeof(V3U));

    for(u32 i = 0; i < size; i++) {
        result[i].sse = _mm_add_epi32(va[i].sse, vb[i].sse);
    }

    return result;
}

When I look at the asm it's indeed 3 adds / 6 movs vs 1 paddd / 2 movdqu
Yet, timing the code shows that the SIMD code takes ~2ms longer for 1000000 vector additions.

Is my approach at all correct, or did I misunderstand something?
I know timing such simple computations is not really meaningful. I just wanted to see, if there's some difference.

(sorry for my english, greetings from Germany ;))

Christoph

Edited by cl777 on June 8, 2019, 12:09pm Reason: Initial post

Mārtiņš Možeiko

#21139

June 8, 2019

Few questions:
1) are you compiling with optimizations (O2/O3) when benchmarking?
2) you are actually processing more data in sse than scalar version - 15.2 MB vs 11.4 MB. I suggest changing code so you are processing same amount of data.
3) how are you measuring benchmark time? Is the resolution of measurement time small enough?
4) suggestion - try unrolling loop in case compiler did not do that for simd version. As these adds are independent and you have enough simd registers doing 4 or even 8 of them in "parallel" could increase performance (because then updating loop counter will happen less often).

Edited by Mārtiņš Možeiko on June 8, 2019, 8:02pm

cl777

#21140

June 9, 2019

Hello Mārtiņš,

1. I'm compiling with Cl -O2
2. That was the culprit. Comparing apples and oranges. When I add the fourth computation to the
scalar version, the sse one becomes between 1.0 and 1.5ms faster.

3. I'm measuring with QueryPerformanceCounter

inline void StartCounter()
{
    QueryPerformanceFrequency(&Frequency); 
    QueryPerformanceCounter(&StartingTime);
}

inline long long GetCounter()
{
    QueryPerformanceCounter(&EndingTime);
    ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;

    ElapsedMicroseconds.QuadPart *= 1000000;
    ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;

    return ElapsedMicroseconds.QuadPart;
}

4. I tried unrolling the loop like this

V3U *AddVectors_128(V3U *va, V3U *vb, u32 size)
{
    V3U *result = calloc(size, sizeof(V3U));

    for(u32 i = 0; i < size; i+=4) {
        result[i].sse = _mm_add_epi32(va[i].sse, vb[i].sse);
        result[i+1].sse = _mm_add_epi32(va[i+1].sse, vb[i+1].sse);
        result[i+2].sse = _mm_add_epi32(va[i+2].sse, vb[i+2].sse);
        result[i+3].sse = _mm_add_epi32(va[i+3].sse, vb[i+3].sse);
    }

    return result;
}

but the program still only uses xmm0. How do I move the data to the other registers?
Wouldn`t that be a compiler thing?

Thanks for your help.

Mārtiņš Možeiko

#21141

June 9, 2019

Try this:

V3U *AddVectors_128(V3U *va, V3U *vb, u32 size)
{
    V3U *result = calloc(size, sizeof(V3U));

    for(u32 i = 0; i < size; i+=4) {
        (*result++).sse = _mm_add_epi32((*va++).sse, (*vb++).sse);
        (*result++).sse = _mm_add_epi32((*va++).sse, (*vb++).sse);
        (*result++).sse = _mm_add_epi32((*va++).sse, (*vb++).sse);
        (*result++).sse = _mm_add_epi32((*va++).sse, (*vb++).sse);
    }

    return result;
}

Sometimes compilers like dereferencing and ++ instead of [i+n].

Edited by Mārtiņš Možeiko on June 9, 2019, 6:31pm

cl777

#21143

June 9, 2019

I tried

V3U *AddVectors_128(V3U *va, V3U *vb, u32 size)
{
    V3U *result = calloc(size, sizeof(V3U));
    
    V3U *pva = va;
    V3U *pvb = vb;
    V3U *pvc = result;

    for(u32 i = 0; i < size; i+=4) {
        (*pvc++).sse = _mm_add_epi32((*pva++).sse, (*pvb++).sse);
        (*pvc++).sse = _mm_add_epi32((*pva++).sse, (*pvb++).sse);
        (*pvc++).sse = _mm_add_epi32((*pva++).sse, (*pvb++).sse);
        (*pvc++).sse = _mm_add_epi32((*pva++).sse, (*pvb++).sse);        
    }

    return result;
}

but it still only uses one register. Here's the dumpbin

  0000000000000040: F3 0F 6F 0B        movdqu      xmm1,xmmword ptr [rbx]
  0000000000000044: 4D 8D 49 40        lea         r9,[r9+40h]
  0000000000000048: 66 0F FE 0F        paddd       xmm1,xmmword ptr [rdi]
  000000000000004C: 48 8D 5B 40        lea         rbx,[rbx+40h]
  0000000000000050: F3 41 0F 7F 49 C0  movdqu      xmmword ptr [r9-40h],xmm1
  0000000000000056: 48 8D 7F 40        lea         rdi,[rdi+40h]
  000000000000005A: F3 0F 6F 4B D0     movdqu      xmm1,xmmword ptr [rbx-30h]
  000000000000005F: 66 0F FE 4F D0     paddd       xmm1,xmmword ptr [rdi-30h]
  0000000000000064: F3 41 0F 7F 49 D0  movdqu      xmmword ptr [r9-30h],xmm1
  000000000000006A: F3 0F 6F 4B E0     movdqu      xmm1,xmmword ptr [rbx-20h]
  000000000000006F: 66 0F FE 4F E0     paddd       xmm1,xmmword ptr [rdi-20h]
  0000000000000074: F3 41 0F 7F 49 E0  movdqu      xmmword ptr [r9-20h],xmm1
  000000000000007A: F3 0F 6F 47 F0     movdqu      xmm0,xmmword ptr [rdi-10h]
  000000000000007F: F3 0F 6F 4B F0     movdqu      xmm1,xmmword ptr [rbx-10h]
  0000000000000084: 66 0F FE C8        paddd       xmm1,xmm0
  0000000000000088: F3 41 0F 7F 49 F0  movdqu      xmmword ptr [r9-10h],xmm1

But I wonder, wouldn't it be the same if it uses 4 Registers? It would still need at least 2 moves / 1 add per computation.
How would it do that in parallel?

Mārtiņš Možeiko

#21144

June 9, 2019

This is ok. Modern CPUs execute a lot of stuff ahead of "current" instruction. To do so it internally renames registers - even if next instruction uses same xmm1 register, internally it allocates different "xmm" register as long as next operations do not need previous results. So as long as there is no dependency between operations this will allow to execute multiple additions like you have in parallel. Most likely it will happen also with unrolled loop, but there there is a bit larger overhead - loop increment/comparison.

Basically if you need to calculate:
a = b + c;
d = e + f;
This can be done in parallel as long as e and f are different from a.
Even if you implement addition with same register names. Here's more information: https://en.wikipedia.org/wiki/Register_renaming

Edited by Mārtiņš Možeiko on June 9, 2019, 9:28pm

cl777

#21151

June 11, 2019

Hello Mārtiņš,

thanks for your insights and the link (in the end it`s all about knowing the right terms I guess ;)).
I tried to dig deeper into this stuff and wrote two functions which are a little more
complex (normalization, the vector components are now f32s)

V3 *NormalizeVectors(V3 *va, u32 size)
{
    V3 *result = calloc(size, sizeof(V3));

    for(u32 i = 0; i < size; i+=4) {
        f32 l = (f32)sqrt(va[i].x * va[i].x + va[i].y * va[i].y + va[i].z * va[i].z);
        result[i].x = va[i].x / l;
        result[i].y = va[i].y / l;
        result[i].z = va[i].z / l;
        result[i].padding = va[i].padding / l; // for keeping the functions comparable

        l = (f32)sqrt(va[i+1].x * va[i+1].x + va[i+1].y * va[i+1].y + va[i+1].z * va[i+1].z);
        result[i+1].x = va[i+1].x / l;
        result[i+1].y = va[i+1].y / l;
        result[i+1].z = va[i+1].z / l;
        result[i+1].padding = va[i+1].padding / l;

        l = (f32)sqrt(va[i+2].x * va[i+2].x + va[i+2].y * va[i+2].y + va[i+2].z * va[i+2].z);
        result[i+2].x = va[i+2].x / l;
        result[i+2].y = va[i+2].y / l;
        result[i+2].z = va[i+2].z / l;
        result[i+2].padding = va[i+2].padding / l;

        l = (f32)sqrt(va[i+3].x * va[i+3].x + va[i+3].y * va[i+3].y + va[i+3].z * va[i+3].z);
        result[i+3].x = va[i+3].x / l;
        result[i+3].y = va[i+3].y / l;
        result[i+3].z = va[i+3].z / l;
        result[i+3].padding = va[i+3].padding / l;
    }

    return result;
}

and

V3U *NormalizeVectors_128(V3U *va, u32 size)
{
    V3U *result = calloc(size, sizeof(V3U));
    
    __m128 lqs_128;
    __m128 lq_128;
    __m128 l1_128;
    __m128 l2_128;
    __m128 l3_128;
    __m128 l4_128;

    for(u32 i = 0; i < size; i+=4) {
        lqs_128 = _mm_set_ps(va[i].components.x * va[i].components.x + va[i].components.y * va[i].components.y + va[i].components.z * va[i].components.z,
                             va[i+1].components.x * va[i+1].components.x + va[i+1].components.y * va[i+1].components.y + va[i+1].components.z * va[i+1].components.z, 
                             va[i+2].components.x * va[i+2].components.x + va[i+2].components.y * va[i+2].components.y + va[i+2].components.z * va[i+2].components.z, 
                             va[i+3].components.x * va[i+3].components.x + va[i+3].components.y * va[i+3].components.y + va[i+3].components.z * va[i+3].components.z);

        lq_128 = _mm_sqrt_ps(lqs_128);

        l1_128 = _mm_set_ps1(lq_128.m128_f32[3]);
        l2_128 = _mm_set_ps1(lq_128.m128_f32[2]);
        l3_128 = _mm_set_ps1(lq_128.m128_f32[1]);
        l4_128 = _mm_set_ps1(lq_128.m128_f32[0]);

        result[i].sse = _mm_div_ps(va[i].sse, l1_128);
        result[i+1].sse = _mm_div_ps(va[i+1].sse, l2_128);
        result[i+2].sse = _mm_div_ps(va[i+2].sse, l3_128);
        result[i+3].sse = _mm_div_ps(va[i+3].sse, l4_128);
    }

    return result;
}

Now the usage of multiple xmm registers becomes more apperant.
The (to me) interesting thing is, that the compiler now manages to auto-vectorize the computation in NormalizeVectors().
I`m not really sure under which circumstance it would do that, respectively what structure helps it to do it.
Also my SIMD version in NormalizeVectors_128() is still faster (~1ms for 1000000 computations) so I guess relying on auto-
vectorization isn´t always the best option. I need to dig a little more I guess.

BTW, it would be very interesting to see a version of NormalizeVectors_128() by someone with a little more experience in SIMD.

Mārtiņš Možeiko

#21153

June 11, 2019

You won't see much speed gains in this case because at beginning of loop you are doing a lot of scalar operations to calculate magnitude of vector.

Typically when you do SIMD you want to rearrange how you store data. In your case instead of storing xyzxyzxyz... you should store xxx...yyy...zzz... Basically turn array of structures to structure of arrays. Then you'll be able to load 4 x coordinates with one load. Then 4 y coordinates with one load and then 4 z coordinates with one load. After than you do 3 multiply operations and 3 add operations to calculate lqs_128. Total 3+3+3=9 operations. Which will be waaaaay less than you are doing now.

If you cannot do that you can try doing shuffles, not sure if that will help though: https://godbolt.org/z/awC6Ij
Not sure if this will correctly - you might need to debug it. I have not tried running it.

Try also to not use m128_f32 member. Compiler can generate suboptimal code when you use it. I've replaced it with shuffles.

I've also replaced square root and div with reciprocal square root. Div is pretty expensive - doing it 4 times is not a good idea. SSE has 1/sqrt(x) operation as one instruction. Although it does not have best precision. To improve precision you can do Newton–Raphson method, which is pretty cheap.

Replace

1	__m128 rsqrt = _mm_rsqrt_ps(dot);

with:

1
2
3

        __m128 rsqrt = _mm_rsqrt_ps(dot);
        __m128 tmp = _mm_mul_ps(_mm_mul_ps(dot, rsqrt), rsqrt); 
        rsqrt = _mm_mul_ps(_mm_mul_ps(_mm_set_ps1(0.5f), rsqrt), _mm_sub_ps(_mm_set_ps1(3.0f), tmp));

Alternatively calculate 1/sqrt() with one sqrt_ps and one div_ps. Then use this value to do 4 mul_ps.

Edited by Mārtiņš Možeiko on June 11, 2019, 10:16pm

cl777

#21169

June 13, 2019

Hello again,

thank you very much for your explanation. You've been a great help.

Just one thing. If I understand correctly, _MM_SHUFFLE(n, n, n, n) is like a mask so
_mm_shuffle_ps() extracts the values at the nth position(s) in __m128 and distributes them accordingly.
What purpose does the second argument to _mm_shuffle() have? The description on the intrinsics guide doesn't even mention it.

Mārtiņš Možeiko

#21171

June 14, 2019

What do you mean intrinsics guide does not mention it?
It explains it here:

__m128 _mm_shuffle_ps (__m128 a, __m128 b, unsigned int imm8)
...
DEFINE SELECT4(src, control) {
	CASE(control[1:0]) OF
	0:	tmp[31:0] := src[31:0]
	1:	tmp[31:0] := src[63:32]
	2:	tmp[31:0] := src[95:64]
	3:	tmp[31:0] := src[127:96]
	ESAC
	RETURN tmp[31:0]
}
dst[31:0] := SELECT4(a[127:0], imm8[1:0])
dst[63:32] := SELECT4(a[127:0], imm8[3:2])
dst[95:64] := SELECT4(b[127:0], imm8[5:4])
dst[127:96] := SELECT4(b[127:0], imm8[7:6])

Basically this means for code like this:

1	C = _mm_shuffle_ps(a, b, _MM_SHUFFLE(x, y, z, w));

it does following:

c[0] = a[w];
c[1] = a[z];
c[2] = b[y];
c[3] = b[x];

Where x/y/z/w is in [0..3] interval. _MM_SHUFFLE is defined in one of intrinsic headers. It simply combines all two bit masks in one 8-bit mask (imm8 in intrinsics guide).

cl777

#21172

June 14, 2019

I meant the second parameter to _mm_shuffle_ps(), b and that it is not mentioned in

1 2	Description Shuffle single-precision (32-bit) floating-point elements in a using the control in imm8, and store the results in dst.

So you're basically using a function for shuffling two _m128 to shuffle one?

Mārtiņš Možeiko

#21175

June 15, 2019

I'm using it to expand float either in 1st, 2nd, 3rd or 4th position to all 4 elements.

Look more carefully in intrinsic guide, it explains in pseudocode how second argument is used:

Edited by Mārtiņš Možeiko on June 15, 2019, 10:46am