Hi everyone. I'm sure that this isn't something new, but I just discovered it and decided to share because as for me it's just cool to know about. I was reading this Intel paper about optimizing copy from USWC to WB memory and decided to try it on x86 (just WB basically).
I tested the copy using code from this thread. From my tests, this really helps speed up copy if you use the rep movs instruction and your buffers don't fit in the CPU cache.
Result for copy of 1 GiB:
-- Intel Pentium 2030M 2.5GHz DDR3-1333
memcpy = 2.34 GiB/s CopyWithSSE = 1.95 GiB/s CopyWithSSESmall = 1.55 GiB/s CopyWithSSENoCache = 2.38 GiB/s CopyWithRepMovsb = 2.37 GiB/s CopyWithRepMovsd = 2.62 GiB/s CopyWithRepMovsq = 2.94 GiB/s CopyWithRepMovsbUnaligned = 2.87 GiB/s CopyWithThreads = 3.7 GiB/s CopyWithBuff = 4.9 GiB/s
-- AMD 5600g 3.9GHz DDR4-3200
memcpy = 14.12 GiB/s CopyWithSSE = 9.01 GiB/s CopyWithSSESmall = 8.94 GiB/s CopyWithSSENoCache = 12.42 GiB/s CopyWithRepMovsb = 14.02 GiB/s CopyWithRepMovsd = 14.21 GiB/s CopyWithRepMovsq = 13.91 GiB/s CopyWithRepMovsbUnaligned = 14.55 GiB/s CopyWithThreads = 15.51 GiB/s CopyWithBuff = 23.13 GiB/s
I'm not sure why t2 get fixed after memcmp so i don't change it in fork, but if cacl t2 before this on my AMD CopyWithRepMovsd is 17 GiB/s and CopyWithBuff go roughly to 35-40 GiB/s.
There is an error in the code provided, in CopyWithBuff
you use buff_ptr
instead of buff
.
After that modification the code ran, but the numbers where too good to be true, repmovsd
was about 6 GiB/s and your version was 16 GiB/s (i5 2500).
You never move the src
and dst
pointers in your function, so you are copying the same 4K and putting them at the same destination on every iteration (I didn't check but maybe the compiler detected that and optimized it out).
// size must be multiple of _copy_buff_size_ static void CopyWithBuff(uint8_t* dst, uint8_t* src, size_t size) { uint8_t buff[copy_buff_size]; while (size) { CopyWithRepMovsd(buff, src, copy_buff_size); CopyWithRepMovsd(dst, buff, copy_buff_size); size -= copy_buff_size; src += copy_buff_size; dst += copy_buff_size; } }
With that modification the code is about the same speed as repmovsd
.
The reason the issue was not picked up by the memcmp
call is that since previous test (or the initial memcpy
) already set the destination to the correct value, the final output was correct. Adding memset( dst, 0, kSize );
at the start of the BENCH
macro makes the issue visible.