Hello Mārtiņš,
thanks for your insights and the link (in the end it`s all about knowing the right terms I guess ;)).
I tried to dig deeper into this stuff and wrote two functions which are a little more
complex (normalization, the vector components are now f32s)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32  V3 *NormalizeVectors(V3 *va, u32 size)
{
V3 *result = calloc(size, sizeof(V3));
for(u32 i = 0; i < size; i+=4) {
f32 l = (f32)sqrt(va[i].x * va[i].x + va[i].y * va[i].y + va[i].z * va[i].z);
result[i].x = va[i].x / l;
result[i].y = va[i].y / l;
result[i].z = va[i].z / l;
result[i].padding = va[i].padding / l; // for keeping the functions comparable
l = (f32)sqrt(va[i+1].x * va[i+1].x + va[i+1].y * va[i+1].y + va[i+1].z * va[i+1].z);
result[i+1].x = va[i+1].x / l;
result[i+1].y = va[i+1].y / l;
result[i+1].z = va[i+1].z / l;
result[i+1].padding = va[i+1].padding / l;
l = (f32)sqrt(va[i+2].x * va[i+2].x + va[i+2].y * va[i+2].y + va[i+2].z * va[i+2].z);
result[i+2].x = va[i+2].x / l;
result[i+2].y = va[i+2].y / l;
result[i+2].z = va[i+2].z / l;
result[i+2].padding = va[i+2].padding / l;
l = (f32)sqrt(va[i+3].x * va[i+3].x + va[i+3].y * va[i+3].y + va[i+3].z * va[i+3].z);
result[i+3].x = va[i+3].x / l;
result[i+3].y = va[i+3].y / l;
result[i+3].z = va[i+3].z / l;
result[i+3].padding = va[i+3].padding / l;
}
return result;
}

and
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32  V3U *NormalizeVectors_128(V3U *va, u32 size)
{
V3U *result = calloc(size, sizeof(V3U));
__m128 lqs_128;
__m128 lq_128;
__m128 l1_128;
__m128 l2_128;
__m128 l3_128;
__m128 l4_128;
for(u32 i = 0; i < size; i+=4) {
lqs_128 = _mm_set_ps(va[i].components.x * va[i].components.x + va[i].components.y * va[i].components.y + va[i].components.z * va[i].components.z,
va[i+1].components.x * va[i+1].components.x + va[i+1].components.y * va[i+1].components.y + va[i+1].components.z * va[i+1].components.z,
va[i+2].components.x * va[i+2].components.x + va[i+2].components.y * va[i+2].components.y + va[i+2].components.z * va[i+2].components.z,
va[i+3].components.x * va[i+3].components.x + va[i+3].components.y * va[i+3].components.y + va[i+3].components.z * va[i+3].components.z);
lq_128 = _mm_sqrt_ps(lqs_128);
l1_128 = _mm_set_ps1(lq_128.m128_f32[3]);
l2_128 = _mm_set_ps1(lq_128.m128_f32[2]);
l3_128 = _mm_set_ps1(lq_128.m128_f32[1]);
l4_128 = _mm_set_ps1(lq_128.m128_f32[0]);
result[i].sse = _mm_div_ps(va[i].sse, l1_128);
result[i+1].sse = _mm_div_ps(va[i+1].sse, l2_128);
result[i+2].sse = _mm_div_ps(va[i+2].sse, l3_128);
result[i+3].sse = _mm_div_ps(va[i+3].sse, l4_128);
}
return result;
}

Now the usage of multiple xmm registers becomes more apperant.
The (to me) interesting thing is, that the compiler now manages to autovectorize the computation in NormalizeVectors().
I`m not really sure under which circumstance it would do that, respectively what structure helps it to do it.
Also my SIMD version in NormalizeVectors_128() is still faster (~1ms for 1000000 computations) so I guess relying on auto
vectorization isn´t always the best option. I need to dig a little more I guess.
BTW, it would be very interesting to see a version of NormalizeVectors_128() by someone with a little more experience in SIMD.