Hi everyone. I'm sure that this isn't something new, but I just discovered it and decided to share because as for me it's just cool to know about. I was reading this Intel paper about optimizing copy from USWC to WB memory and decided to try it on x86 (just WB basically).
I tested the copy using code from this thread. From my tests, this really helps speed up copy if you use the rep movs instruction and your buffers don't fit in the CPU cache.
Result for copy of 1 GiB:
-- Intel Pentium 2030M 2.5GHz DDR3-1333
memcpy = 2.34 GiB/s CopyWithSSE = 1.95 GiB/s CopyWithSSESmall = 1.55 GiB/s CopyWithSSENoCache = 2.38 GiB/s CopyWithRepMovsb = 2.37 GiB/s CopyWithRepMovsd = 2.62 GiB/s CopyWithRepMovsq = 2.94 GiB/s CopyWithRepMovsbUnaligned = 2.87 GiB/s CopyWithThreads = 3.7 GiB/s CopyWithBuff = 4.9 GiB/s
-- AMD 5600g 3.9GHz DDR4-3200
memcpy = 14.12 GiB/s CopyWithSSE = 9.01 GiB/s CopyWithSSESmall = 8.94 GiB/s CopyWithSSENoCache = 12.42 GiB/s CopyWithRepMovsb = 14.02 GiB/s CopyWithRepMovsd = 14.21 GiB/s CopyWithRepMovsq = 13.91 GiB/s CopyWithRepMovsbUnaligned = 14.55 GiB/s CopyWithThreads = 15.51 GiB/s CopyWithBuff = 23.13 GiB/s
I'm not sure why t2 get fixed after memcmp so i don't change it in fork, but if cacl t2 before this on my AMD CopyWithRepMovsd is 17 GiB/s and CopyWithBuff go roughly to 35-40 GiB/s.