https://godbolt.org/z/PYod6edac
Why does msvc decide to reserve 40 bytes on the stack for memcpy
in copy2
when the src pointer is restricted? That 40 bytes can't be shadow space because the shadow space is a 32-byte stack space like in the copy3
function. Clang and GCC generate the code fine.
I think it is 40 because it needs to keep stack pointer 16-byte aligned. Because caller pushed return address, it is now 8-byte aligned. So it wants 32 bytes for shadow space, but that will be not 16-byte aligned. Thus it reserves 32+8=40 bytes.
But no idea why that is not optimized out and just leave tail call there.
But 40-byte isn't aligned with 16-byte, while 32-byte is? What did you mean by
Because caller pushed return address, it is now 8-byte aligned
Why is there a push/pop rbx
in the copy3
function? Is it part of the calling convention?
Before call opcode the stack is 16 aligned. Call pushes 8 bytes of return address, so rsp is not 16 aligned anymore. It is +/-8 from 16 aligned. Subtracting 40 makes it 16 aligned. Because 8+40=48 which is multiple of 16.
Why is there a push/pop rbx in the copy3 function? Is it part of the calling convention?
Yes. rbx is nonvolatile register in 64-bit windows calling convention: https://learn.microsoft.com/en-us/cpp/build/x64-software-conventions?view=msvc-170#x64-register-usage
So each time there is a call
instruction, a hidden 8 bytes gets pushed on the stack? And each time a ret
instruction pops the 8 bytes off the stack? How can ret
know where the 8 bytes are on the stack? Also, If that is the case, why in the copy3
function does the caller only move 32 bytes?
rsp register keeps address to stack location where call/ret/push/pop instructions can read or write.
call address
is equivalent to this piece of code:
sub rsp, 8 mov [rsp], next jmp address next:
ret
instruction is equivalent to this piece of code:
mov TEMP_REGISTER, [rsp] add rsp, 8 jmp TEMP_REGISTER
Also, If that is the case, why in the copy3 function does the caller only move 32 bytes?
In copy3 function first operation is push rbx
that restores 16-byte aligned stack pointer. So you shadow space can be just required 32 bytes, no need for extra.
Ok, so if the callee allocates space on the stack, it must pop that space before returning to the caller (so that rsp
points to the return address). Is that why I can specify the number of bytes to pop off in the ret
instruction?
ret n
was used in stdcall/pascal x86 calling conventions in 16-bit and 32-bit code. Because arguments there were always passed on stack and callee was required to remove them. Intel added instruction to reduce amount of instructions required for that - ret n
does both, pops return address and argument off the stack.
Newer calling conventions does not do that anymore, so ret n
variant is kind of irrelevant.
If you moved rsp for whatever usage in your function (saving registers, allocating temporary space, "alloca") then you are supposed to move rsp back before ret. Calling convention describes this in more detail. For example, here it is for 64-bit windows: https://learn.microsoft.com/en-us/cpp/build/prolog-and-epilog?view=msvc-170
In the prolog example, why can't the compiler just do sub RSP, fixed-allocation-size
? Why does it store the fixed-allocation-size
in RAX
fist? Does the __chkstk
function use RAX as an argument? And what happens if __chkstk
fails?
Are "stack pointer" and "stack frame" the same thing? Because in the x64 ABI page, rsp
is the stack pointer, while rbp
"May be used as a frame pointer."
Yes, __chkstk expects argument in rax register for size. If __chkstk fails then OS will raise exception that stack cannot grow more.
Stack pointer is rsp register. Stack frame is concept that means all kinds of data that is pushed on stack inside function - this means return address, arguments, saved registers, temporary/local variables.
rbp is used when you do dynamic allocation on stack. rsp has to move because:
"All memory beyond the current address of RSP is considered volatile: The OS, or a debugger, may overwrite this memory during a user debug session, or an interrupt handler." https://learn.microsoft.com/en-us/cpp/build/stack-usage?view=msvc-170
Since rsp will move inside a function, you can no longer use [rsp+32] to reference local variables. So another non-volatile register is used. Doesn't have to be rbp, but usually is.
So then it can look like this:
; prolog:
push rbp
sub rsp, 48
mov rbp, rsp
; body:
...
sub rsp, whatever ; dynamic allocation
mov [rbp+32], 0
...
mov rax, 0
; epilog:
lea rsp, [rbp+48]
pop rbp
ret
You can and compilers do use just rsp to reference data from stack frame.
See example here: https://godbolt.org/z/dYd6K8znG
After call to foo() it uses rsp
to access e and f arguments. Similar thing will happen with locals/temporaries too.
rbp
is just arbitrary register. You can use it for whatever. Windows x64 ABI does not require it for anything special. "frame pointer" or "stack base pointer" is name of rbp
register because bp
and ebp
registers were used in 16-bit and 32-bit code to point to beginning of stack frame. In 64-bit code it can be any register. In windows docs the prolog/epilog examples use r13
register for that.
Sometimes you need to use dedicated register, because you don't know how much stack will be advanced by. For example, when using VLA or alloca.
Here's an example: https://godbolt.org/z/Te5WbxErs
Here compiler decided to use rbp
for frame pointer. Because in the loop stack pointer is adjusted unknown number of times. So after call to foo() it does not know what is rsp
value relative to where it was when function started. So it uses rbp
for accessing e and f arguments.
They were used as pointer to stack frame. In 64-bit you can use any register.
I don't think 16-bit or 32-bit abi's are really officially documented for Windows. Only what people reverse engineered.
But if you omit the frame pointer, then a lot of functionality stops working - like unwinding call stack. Debuggers start showing garbage call stacks, or profilers show wrong functions. Or nothing at all.