Copy memory with small temp buffer

Hi everyone. I'm sure that this isn't something new, but I just discovered it and decided to share because as for me it's just cool to know about. I was reading this Intel paper about optimizing copy from USWC to WB memory and decided to try it on x86 (just WB basically).

I tested the copy using code from this thread. From my tests, this really helps speed up copy if you use the rep movs instruction and your buffers don't fit in the CPU cache.

Result for copy of 1 GiB:

-- Intel Pentium 2030M 2.5GHz DDR3-1333

memcpy = 2.34 GiB/s
CopyWithSSE = 1.95 GiB/s
CopyWithSSESmall = 1.55 GiB/s
CopyWithSSENoCache = 2.38 GiB/s
CopyWithRepMovsb = 2.37 GiB/s
CopyWithRepMovsd = 2.62 GiB/s
CopyWithRepMovsq = 2.94 GiB/s
CopyWithRepMovsbUnaligned = 2.87 GiB/s
CopyWithThreads = 3.7 GiB/s
CopyWithBuff = 4.9 GiB/s

-- AMD 5600g 3.9GHz DDR4-3200

memcpy = 14.12 GiB/s
CopyWithSSE = 9.01 GiB/s
CopyWithSSESmall = 8.94 GiB/s
CopyWithSSENoCache = 12.42 GiB/s
CopyWithRepMovsb = 14.02 GiB/s
CopyWithRepMovsd = 14.21 GiB/s
CopyWithRepMovsq = 13.91 GiB/s
CopyWithRepMovsbUnaligned = 14.55 GiB/s
CopyWithThreads = 15.51 GiB/s
CopyWithBuff = 23.13 GiB/s

I'm not sure why t2 get fixed after memcmp so i don't change it in fork, but if cacl t2 before this on my AMD CopyWithRepMovsd is 17 GiB/s and CopyWithBuff go roughly to 35-40 GiB/s.


Edited by Alex on

There is an error in the code provided, in CopyWithBuff you use buff_ptr instead of buff.

After that modification the code ran, but the numbers where too good to be true, repmovsd was about 6 GiB/s and your version was 16 GiB/s (i5 2500).

You never move the src and dst pointers in your function, so you are copying the same 4K and putting them at the same destination on every iteration (I didn't check but maybe the compiler detected that and optimized it out).

// size must be multiple of _copy_buff_size_
static void CopyWithBuff(uint8_t* dst, uint8_t* src, size_t size)
{
	uint8_t buff[copy_buff_size];
    
	while (size)
	{
		CopyWithRepMovsd(buff, src, copy_buff_size);
		CopyWithRepMovsd(dst, buff, copy_buff_size);
        
		size -= copy_buff_size;
		src += copy_buff_size;
		dst += copy_buff_size;
	}
}

With that modification the code is about the same speed as repmovsd.

The reason the issue was not picked up by the memcmp call is that since previous test (or the initial memcpy) already set the destination to the correct value, the final output was correct. Adding memset( dst, 0, kSize ); at the start of the BENCH macro makes the issue visible.

Oh nooo, what a shameless bug, this was really dumb. Thanks for pointing that. Then I just delete this useless fork.

After this change on my computer it's get worse then just "rep movs" its have roughly same speed as copy using SSE


Edited by Alex on