I did some research and found this statement on Nvidia's site
"In order to make copying memory to constant buffers fast it makes sense to use _aligned_malloc() to allocate memory that is aligned to 16 byte boundaries, as this speeds up the necessary memcpy() operation from application memory to the memory returned by Map(). In a similar throw UpdateSubresource() will be able to perform the copy operation faster too."
2 questions:
1. is this why DX11 mandates that constant buffers be 16-byte aligned?
2. is the memcpy perhaps faster when aligned at 16-bytes, because memcpy has a fast path that uses SIMD?