DirectX11 - Why Do Constant Buffers Need to Be 16-Byte Aligned

I did some research and found this statement on Nvidia's site

"In order to make copying memory to constant buffers fast it makes sense to use _aligned_malloc() to allocate memory that is aligned to 16 byte boundaries, as this speeds up the necessary memcpy() operation from application memory to the memory returned by Map(). In a similar throw UpdateSubresource() will be able to perform the copy operation faster too."

2 questions:
1. is this why DX11 mandates that constant buffers be 16-byte aligned?
2. is the memcpy perhaps faster when aligned at 16-bytes, because memcpy has a fast path that uses SIMD?


Edited by Draos on Reason: Initial post
Yes, they want to use aligned SSE store/load operations. On older CPU's they were significant faster than unaligned SSE store/load operations. Nowadays difference is much smaller - few percents in most cases.

Edited by Mārtiņš Možeiko on
sort of unrelated, but I was wondering why do SIMD m128s need to be aligned to 16-bytes? If the Data Bus for x64 is 8-bytes wide, why would we need 16-byte alignment? Because, if we align the m128 to say 8-bytes, won't it take 2 reads from memory anyway, given the size of the data bus? Is there a good performance reason why we align SIMD to 16-bytes despite the size of the Data Bus?
There are several layers of caching between the data bus to the ram and the SSE registers.

These caches can (and do have) larger alignment. This means it's cheaper to load aligned to the cache line (reading just a single one than having to read from 2 cache lines when unaligned.
so for example if we had a 32-byte L1 Cache (just for sake of example), if the m128 was aligned to 16-bytes, we could get:

Line 1: 00000000000000000000000011111111
Line 2: 11111111000000000000000000000000

where 1 represents the simd vector, meaning in order to fetch the data we need to look at 2 cache lines. right?

my question is, given this situation, don't i need to do at least 2 reads from cache regardless of whether it is on the same line or not, since data bus is only 8-bytes and I need to read 16-bytes from the cache?

is it just a matter of having to figure out which line the data is located in?
You seem to mix byte and bit together here.
m128 is 128bit, which is 16bytes ( two 64bit value, four 32bit value...).
A cache line is 64bytes (on x64) which is four m128.
If a m128 (16bytes) is aligned to a 16bytes address, it can't cross a cache line boundary.
I don't think it's safe even for example to change the size of cache lines. The architecture was made to work with those size and changing them would probably change the architecture.
mrmixer


I don't think it's safe even for example to change the size of cache lines. The architecture was made to work with those size and changing them would probably change the architecture.


That's what a new cpu generation is for
hmmm, i wasn't really conflating bits and bytes, but I guess my post was unclear. what i was trying to say is this:

a data bus on x64 is 8-bytes (64-bit)
a m128 is 16-bytes (128-bits)

a cache line on 64-bit for L1 Cache is 64-bytes, so if a m128 is 16-byte aligned it fits entirely on 1 cache line; i agree with this.

my question is, if we can only transfer 8-bytes at a time (due to the size of the data bus), and a m128 is 16-bytes (double the size of the data bus), irrespective of whether the m128 is fully on 1 cache line, doesn't this mean I have to do 2 reads from the cache anyway?

for example say the m128 straddles 2 cache lines, let's call them Line A and Line B, where 8-bytes of the m128 are on Line A and 8-bytes are on Line B. In this scenario we need to read 8-bytes from A and 8-bytes from B.

but, if the m128 is entirely on A, don't we need to do 2 reads anyway, given we can only read 8-bytes at a time?

i'm almost certainly wrong here, but i just wanted to clarify why I am confused, and hopefully remedy that misunderstanding.

Edited by Draos on
CPU reads from memory in cache line granularity. Regardless whether you are reading uint32 or m128, it fill fully transfer one or two L1 caches before loading actual bytes into registers.

Memory bus works differently from CPU cache. CPU does not need to request each 64-bit wide data transfer from memory and wait on it. It sets up transfer in bulk and then waits until all of it completes. Sure for 64-byte cache it will need for multiple data-bus wide transfers to complete, but this is not related to L1 cache size. I'm not exactly hardware engineer, but I would assume that they have different implementations in cpu for aligned vs unaligned ops, maybe they use less micro-ops/transistors for aligned op for fast-path (which costs space, so its not available for all alignments). And for unaligned they go into slow path where they need to load each byte/dword element individually and then combine them back to register.

Here's the answer to same question with multiple links to related information on how memory works: https://stackoverflow.com/a/39186080/675078

Edited by Mārtiņš Možeiko on