msvc generates weird code for memcpy

https://godbolt.org/z/PYod6edac

Why does msvc decide to reserve 40 bytes on the stack for memcpy in copy2 when the src pointer is restricted? That 40 bytes can't be shadow space because the shadow space is a 32-byte stack space like in the copy3 function. Clang and GCC generate the code fine.

I think it is 40 because it needs to keep stack pointer 16-byte aligned. Because caller pushed return address, it is now 8-byte aligned. So it wants 32 bytes for shadow space, but that will be not 16-byte aligned. Thus it reserves 32+8=40 bytes.

But no idea why that is not optimized out and just leave tail call there.


Edited by Mārtiņš Možeiko on

But 40-byte isn't aligned with 16-byte, while 32-byte is? What did you mean by

Because caller pushed return address, it is now 8-byte aligned

Why is there a push/pop rbx in the copy3 function? Is it part of the calling convention?


Replying to mmozeiko (#29531)

Before call opcode the stack is 16 aligned. Call pushes 8 bytes of return address, so rsp is not 16 aligned anymore. It is +/-8 from 16 aligned. Subtracting 40 makes it 16 aligned. Because 8+40=48 which is multiple of 16.

Why is there a push/pop rbx in the copy3 function? Is it part of the calling convention?

Yes. rbx is nonvolatile register in 64-bit windows calling convention: https://learn.microsoft.com/en-us/cpp/build/x64-software-conventions?view=msvc-170#x64-register-usage


Edited by Mārtiņš Možeiko on
Replying to longtran2904 (#29536)

So each time there is a call instruction, a hidden 8 bytes gets pushed on the stack? And each time a ret instruction pops the 8 bytes off the stack? How can ret know where the 8 bytes are on the stack? Also, If that is the case, why in the copy3 function does the caller only move 32 bytes?


Replying to mmozeiko (#29538)

rsp register keeps address to stack location where call/ret/push/pop instructions can read or write.

call address is equivalent to this piece of code:

    sub rsp, 8
    mov [rsp], next
    jmp address
next:

ret instruction is equivalent to this piece of code:

    mov TEMP_REGISTER, [rsp]
    add rsp, 8
    jmp TEMP_REGISTER

Also, If that is the case, why in the copy3 function does the caller only move 32 bytes?

In copy3 function first operation is push rbx that restores 16-byte aligned stack pointer. So you shadow space can be just required 32 bytes, no need for extra.


Replying to longtran2904 (#29544)

Ok, so if the callee allocates space on the stack, it must pop that space before returning to the caller (so that rsp points to the return address). Is that why I can specify the number of bytes to pop off in the ret instruction?


Edited by longtran2904 on
Replying to mmozeiko (#29545)

ret n was used in stdcall/pascal x86 calling conventions in 16-bit and 32-bit code. Because arguments there were always passed on stack and callee was required to remove them. Intel added instruction to reduce amount of instructions required for that - ret n does both, pops return address and argument off the stack.

Newer calling conventions does not do that anymore, so ret n variant is kind of irrelevant.

If you moved rsp for whatever usage in your function (saving registers, allocating temporary space, "alloca") then you are supposed to move rsp back before ret. Calling convention describes this in more detail. For example, here it is for 64-bit windows: https://learn.microsoft.com/en-us/cpp/build/prolog-and-epilog?view=msvc-170


Edited by Mārtiņš Možeiko on
Replying to longtran2904 (#29546)

In the prolog example, why can't the compiler just do sub RSP, fixed-allocation-size? Why does it store the fixed-allocation-size in RAX fist? Does the __chkstk function use RAX as an argument? And what happens if __chkstk fails?

Are "stack pointer" and "stack frame" the same thing? Because in the x64 ABI page, rsp is the stack pointer, while rbp "May be used as a frame pointer."


Replying to mmozeiko (#29547)

Yes, __chkstk expects argument in rax register for size. If __chkstk fails then OS will raise exception that stack cannot grow more.

Stack pointer is rsp register. Stack frame is concept that means all kinds of data that is pushed on stack inside function - this means return address, arguments, saved registers, temporary/local variables.


Edited by Mārtiņš Možeiko on
Replying to longtran2904 (#29548)

Can't I just access frame data from rsp? How does rbp come into play?


Replying to mmozeiko (#29549)

rbp is used when you do dynamic allocation on stack. rsp has to move because:

"All memory beyond the current address of RSP is considered volatile: The OS, or a debugger, may overwrite this memory during a user debug session, or an interrupt handler." https://learn.microsoft.com/en-us/cpp/build/stack-usage?view=msvc-170

Since rsp will move inside a function, you can no longer use [rsp+32] to reference local variables. So another non-volatile register is used. Doesn't have to be rbp, but usually is.

So then it can look like this:

; prolog:

push rbp

sub rsp, 48

mov rbp, rsp

; body:

...

sub rsp, whatever ; dynamic allocation

mov [rbp+32], 0

...

mov rax, 0

; epilog:

lea rsp, [rbp+48]

pop rbp

ret


Replying to longtran2904 (#29550)

You can and compilers do use just rsp to reference data from stack frame.

See example here: https://godbolt.org/z/dYd6K8znG
After call to foo() it uses rsp to access e and f arguments. Similar thing will happen with locals/temporaries too.

rbp is just arbitrary register. You can use it for whatever. Windows x64 ABI does not require it for anything special. "frame pointer" or "stack base pointer" is name of rbp register because bp and ebp registers were used in 16-bit and 32-bit code to point to beginning of stack frame. In 64-bit code it can be any register. In windows docs the prolog/epilog examples use r13 register for that.

Sometimes you need to use dedicated register, because you don't know how much stack will be advanced by. For example, when using VLA or alloca.

Here's an example: https://godbolt.org/z/Te5WbxErs
Here compiler decided to use rbp for frame pointer. Because in the loop stack pointer is adjusted unknown number of times. So after call to foo() it does not know what is rsp value relative to where it was when function started. So it uses rbp for accessing e and f arguments.


Edited by Mārtiņš Možeiko on
Replying to longtran2904 (#29550)

What does (e)bp do in 16/32-bit ABI that is different from rbp?


Replying to mmozeiko (#29552)

They were used as pointer to stack frame. In 64-bit you can use any register.
I don't think 16-bit or 32-bit abi's are really officially documented for Windows. Only what people reverse engineered.

But if you omit the frame pointer, then a lot of functionality stops working - like unwinding call stack. Debuggers start showing garbage call stacks, or profilers show wrong functions. Or nothing at all.


Replying to longtran2904 (#29554)