Question #1. Does the processor have to do memory accesses for every instruction that looks like a memory access when the same "memory expression" is used back to back?
Specifically suppose I have some x64 like this:
mov [rsp],rcx add [rsp],rdx imul [rsp],r8
Does the processor get to do something clever here and turn this into one memory load and one memory store, or does it have to do the full three memory stores and two memory loads implied by the code?
What about other cases, like where the "memory expression" is still identical but more complicated?
mov [rsi + r9*8],rcx add [rsi + r9*8],rdx imul [rsi + r9*8],r8
What about when the "memory expression" is not identical but the addresses are? Does the CPU still have to do an memory load and wait the ~4 cycles for the L1 cache hit? Or can it use it's bank of internal registers as an instant cache?
Question #2. A lot of intro materials introduce lea
as a way to do certain kinds of combined arithmetic in a single instruction. Is there really a benefit to this? Does this instruction just get decomposed into the same micro ops as a shift and a couple of adds? More broadly, what method can I use to answer questions like this for myself?
Question #3. (ABI Minutiae) On the Windows ABI, after the first four parameters, additional parameters are found on the stack. How are you supposed to know where on the stack though?
Question #4. (ABI Minutiae) When a C function is declared with a parameter less that the size of a whole register, what can we assume about the high bits of that register inside the function's implantation? For example if a 64-bit program has a function with a 32-bit unsigned integer in it's first parameter. Is it possible to say that the high bits of the corresponding register will all be zero? Do we have to assume any garbage could be in those bits? Is this different for 16-bit or 8-bit parameters?
I don't know for sure, but my assumption is it will do memory accesses. Though maybe that is now different on newer CPU's... haven't looked in details of that.
It is small optimization win - less instructions to generate (so better code cache usage), more free registers (so less chances of register spilling), etc. It does not decompose into individual add/shift operations because it is performed by same piece of hardware that does memory address calculation for memory operands. Often it executes on different units than your regular add/shift operations, so your code is more pipelined with rest of code, which is small performance win.
You can check uops.info table to reason about when and if instruction decomposes to multiple microcode instructions, and check their port usage to know if it will conflict with other instructions nearby it. On same site there is uiCA simulator that will show how instructions are executed in CPU pipeline.
Not sure what you are asking. They are on the top of stack. Function called uses rsp register to know where top of stack is, on top of it [rsp] is return address, then there are 4 extra entries for varargs location, then [rsp+0x28] will be arg5, then [rsp+0x30] arg6, etc...
See image at https://docs.microsoft.com/en-us/cpp/build/stack-usage?view=msvc-170 that dotted line in middle of picture is where rsp points to on entry on function B call.
Those bits will be with unspecified value, not always 0.
Here's a simple example demonstrating it: https://godbolt.org/z/s8zz8v3EE From C semantics compiler casts y to int when passing it to x, thus conceptually dropping upper 32 bits. But you can see in assembly compiler simply passed rcx register as is, with assumption that f will use only lower 32 bits of it, because it operates with int type.