Question #1. Does the processor have to do memory accesses for every instruction that looks like a memory access when the same "memory expression" is used back to back?
Specifically suppose I have some x64 like this:
mov [rsp],rcx add [rsp],rdx imul [rsp],r8
Does the processor get to do something clever here and turn this into one memory load and one memory store, or does it have to do the full three memory stores and two memory loads implied by the code?
What about other cases, like where the "memory expression" is still identical but more complicated?
mov [rsi + r9*8],rcx add [rsi + r9*8],rdx imul [rsi + r9*8],r8
What about when the "memory expression" is not identical but the addresses are? Does the CPU still have to do an memory load and wait the ~4 cycles for the L1 cache hit? Or can it use it's bank of internal registers as an instant cache?
Question #2. A lot of intro materials introduce lea
as a way to do certain kinds of combined arithmetic in a single instruction. Is there really a benefit to this? Does this instruction just get decomposed into the same micro ops as a shift and a couple of adds? More broadly, what method can I use to answer questions like this for myself?
Question #3. (ABI Minutiae) On the Windows ABI, after the first four parameters, additional parameters are found on the stack. How are you supposed to know where on the stack though?
Question #4. (ABI Minutiae) When a C function is declared with a parameter less that the size of a whole register, what can we assume about the high bits of that register inside the function's implantation? For example if a 64-bit program has a function with a 32-bit unsigned integer in it's first parameter. Is it possible to say that the high bits of the corresponding register will all be zero? Do we have to assume any garbage could be in those bits? Is this different for 16-bit or 8-bit parameters?