CPU reads from memory in cache line granularity. Regardless whether you are reading uint32 or m128, it fill fully transfer one or two L1 caches before loading actual bytes into registers.
Memory bus works differently from CPU cache. CPU does not need to request each 64-bit wide data transfer from memory and wait on it. It sets up transfer in bulk and then waits until all of it completes. Sure for 64-byte cache it will need for multiple data-bus wide transfers to complete, but this is not related to L1 cache size. I'm not exactly hardware engineer, but I would assume that they have different implementations in cpu for aligned vs unaligned ops, maybe they use less micro-ops/transistors for aligned op for fast-path (which costs space, so its not available for all alignments). And for unaligned they go into slow path where they need to load each byte/dword element individually and then combine them back to register.
Here's the answer to same question with multiple links to related information on how memory works:
https://stackoverflow.com/a/39186080/675078