Performance of Intel Intrinsics vs. ARM Intrinsics

Flyingsand

#23401

September 12, 2020

Not long ago, due mainly to curiosity, I began porting my hobby game to iOS. Until then I supported macOS and Windows. And I will say, overall the port to iOS went very smoothly, and I was pleasantly surprised at how quickly I was able to get it up and running on my iPhone (thanks mainly to a sane architecture I learned through HH and the Handmade Network, as well as the relative ease with which Swift interoperates with C). One thing I've noticed though is that performance on iOS is much less consistent than on macOS.

Part of the porting work involved writing ARM NEON equivalents for all my Intel SSE rendering code (I'm sticking with a software renderer for this project). I have used NEON before in other projects, but I'm still surprised at how it has far fewer intrinsics available than Intel does. Plus, NEON registers are only 128 bits wide, whereas on macOS & Windows I'm using AVX2, which has 256 bit registers and allows for 8-wide operations. So that's certainly a factor as far as performance goes, but I also noticed much more variance in the performance with NEON. Here is a graph comparing the frame times of a simple scene (clearing the buffer and rendering 2 entities; one plain triangle and one with a texture):

Graph

The wild variation on iOS has me a little concerned about hitting consistent frame rates.. I'm not sure why there is such inconsistency on the ARM side. Has anyone had any experience with this, or encountered this before? Is this a characteristic of the mobile ARM processors? Most other games do their rendering on the GPU of course, so maybe that offers much more stability.

As far as porting my Intel intrinsics code to ARM NEON, it's pretty much a straight line-for-line mapping of the Intel intrinsic function to its equivalent NEON function (when possible). As an example, here are the functions for clearing the buffer:

void
_clear_buffer_neon(fsColorf color, bbRenderContext *context, fsu32 scanline) {
    bbOffscreenBuffer *colorBuffer = context->colorBuffer;
    
    fsi32 minX = 0;
    fsi32 minY = 0;
    fsi32 maxX = colorBuffer->width;
    fsi32 maxY = colorBuffer->height;
    
    fsu32 scanlineTotal = context->scanlineTotal;
    _fix_minY(&minY, scanline, scanlineTotal);
    
    color *= 255.f;
    fsu32 clearPixel = ((fsu32)color.r << 0) | ((fsu32)color.g << 8) | ((fsu32)color.b << 16) | ((fsu32)color.a << 24);
    uint32x4_t clearColor_4 = vdupq_n_u32(clearPixel);
    
    uint32x4_t rowStartClipMask_4, rowEndClipMask_4;
    _get_clip_masks_aligned_neon(&minX, &maxX, &rowStartClipMask_4, &rowEndClipMask_4);
    
    for (fsi32 y = minY; y < maxY; y += scanlineTotal) {
        intptr_t offset = (intptr_t)(y * colorBuffer->width + minX);
        fsu32 *pixel = colorBuffer->buffer + offset;
        
        // NOTE: Write out first "column" of the current scanline using the row start clip mask.
        uint32x4_t pixel_4 = vld1q_u32(pixel);
        uint32x4_t maskedOut_4 = vorrq_u32(vandq_u32(rowStartClipMask_4, clearColor_4), vandq_u32(vmvnq_u32(rowStartClipMask_4), pixel_4));
        vst1q_u32(pixel, maskedOut_4);
        pixel += 4;
        
        for (fsi32 x = minX + 4; x < maxX - 4; x+=4) {
            vst1q_u32(pixel, clearColor_4);
            pixel += 4;
        }
        
        // NOTE: Write out last "column" of the current scanline using the row end clip mask.
        pixel_4 = vld1q_u32(pixel);
        maskedOut_4 = vorrq_u32(vandq_u32(rowEndClipMask_4, clearColor_4), vandq_u32(vmvnq_u32(rowEndClipMask_4), pixel_4));
        vst1q_u32(pixel, maskedOut_4);
    }
}


void
_clear_buffer_avx2(fsColorf color, bbRenderContext *context, fsu32 scanline) {
    bbOffscreenBuffer *colorBuffer = context->colorBuffer;
    
    fsi32 minX = 0;
    fsi32 minY = 0;
    fsi32 maxX = colorBuffer->width;
    fsi32 maxY = colorBuffer->height;
    
    fsu32 scanlineTotal = context->scanlineTotal;
    _fix_minY(&minY, scanline, scanlineTotal);
    
    color *= 255.f;
    fsu32 clearPixel = ((fsu32)color.r << 0) | ((fsu32)color.g << 8) | ((fsu32)color.b << 16) | ((fsu32)color.a << 24);
    __m256i clearColor_8 = _mm256_set1_epi32(clearPixel);
    
    __m256i rowStartClipMask_8, rowEndClipMask_8;
    _get_clip_masks_aligned_avx2(&minX, &maxX, &rowStartClipMask_8, &rowEndClipMask_8);
    
    for (fsi32 y = minY; y < maxY; y += scanlineTotal) {
        intptr_t offset = (intptr_t)(y * colorBuffer->width + minX);
        fsu32 *pixel = colorBuffer->buffer + offset;
        
        // NOTE: Write out first "column" of the current scanline using the row start clip mask.
        __m256i pixel_8 = _mm256_load_si256((__m256i *)pixel);
        __m256i maskedOut_8 = _mm256_or_si256(_mm256_and_si256(rowStartClipMask_8, clearColor_8), _mm256_andnot_si256(rowStartClipMask_8, pixel_8));
        _mm256_store_si256((__m256i *)pixel, maskedOut_8);
        pixel += 8;
        
        for (fsi32 x = minX + 8; x < maxX - 8; x+=8) {
            _mm256_store_si256((__m256i *)pixel, clearColor_8);
            pixel += 8;
        }
        
        // NOTE: Write out last "column" of the current scanline using the row end clip mask.
        pixel_8 = _mm256_load_si256((__m256i *)pixel);
        maskedOut_8 = _mm256_or_si256(_mm256_and_si256(rowEndClipMask_8, clearColor_8), _mm256_andnot_si256(rowEndClipMask_8, pixel_8));
        _mm256_store_si256((__m256i *)pixel, maskedOut_8);
    }
}

Multithreading is handled by scanlines in my renderer. And come to think of it, I wonder if threading is somehow to blame for the large variation in frame times on iOS? I'm testing this on an iPhone 8, which has 6 processors, although I use 4 threads in the thread pool for rendering.

Ultimately, I suppose I was expecting slightly better performance than what I'm seeing considering the iPhone 8 is reasonably powerful. It's not doing 8-wide operations like on macOS & Windows, but there are also a lot less pixels to fill (667x375 on iOS vs. 1440x960 on macOS/Windows). And this test was done using a very simple scene as well: clear the screen, draw a solid triangle, texture-map a quad. I'm not sure how much more I can optimize the renderer on iOS, so I'm wondering if I've hit the performance limit.. or.. am I missing something?

Edited by Flyingsand on September 12, 2020, 5:30pm Reason: Initial post

Mārtiņš Možeiko

#23403

September 12, 2020

What is performance comparison Neon vs C code? If you get at least 2x speedup - that's good.
ARM CPU's are much lower performance than x86 architecture. They are optimized on power usage, not performance. So you cannot expect same behavior as on x86.

For your code you could try doing 4x 128-bit stores in inner loop with vst4q_u32 intrinsic. uint32x4x4_t type basically represents 4 neon registers, but there's special store instruction that stores 4 of them in one instruction.

Another thing - instead of doing and(not(mask), x) you can do vbicq_u32(mask, x).

Also try running it single threaded to see how variation looks then.

Edited by Mārtiņš Možeiko on September 12, 2020, 10:33pm

Dawoodoz

#23405

September 12, 2020

I usually keep one pointer iterating over rows by adding a stride after each row. Then I assign the row pointer to the pointer iterating pixels. You can also replace indices with direct pointers, but that makes it harder to debug.

Replacing all multiplications with additions instead of letting the compiler do the optimization just makes it easier to count cycles when scheduling memory loads. From the instruction loading memory to the first instruction reading the target register, you have around 20 to 200 cycles to do additional math for free.

Using stride instead of width for iterating between rows allow aligning the memory of images that don't have evenly divisible width. This reduces the need for masking, when you can simply overwrite padding bytes in images that own it. Stride can also be used to re-use memory allocations for tile sets using sub-images that have the same stride but different dimensions. Sub-images might however not overwrite padding if these pixels are visible in a parent image.

And sometimes a simple memset is faster than anything SIMD can do, by using planar image formats or only reseting to byte uniform gray-scale values.

Flyingsand

#23407

September 13, 2020

mmozeiko
What is performance comparison Neon vs C code? If you get at least 2x speedup - that's good.
ARM CPU's are much lower performance than x86 architecture. They are optimized on power usage, not performance. So you cannot expect same behavior as on x86.

On average, it's about a 1.75x speedup over C code. As you point out, however, maybe there are some additional opportunities for more optimization. When profiling it, it is actually the store instruction that has the most hits, so if that can be improved with the uint32x4x4 store, that will help. And yeah, that makes sense that ARM processors are optimized more for power usage over performance. Though it will be interesting to see how the ARM chip Apple will be using for macOS will end up performing on their desktop/laptop computers.

mmozeiko

Another thing - instead of doing and(not(mask), x) you can do vbicq_u32(mask, x).
Also try running it single threaded to see how variation looks then.

Yes. Good suggestions. I will be trying that! Thanks.

Flyingsand

#23408

September 13, 2020

Dawoodoz
I usually keep one pointer iterating over rows by adding a stride after each row. Then I assign the row pointer to the pointer iterating pixels. You can also replace indices with direct pointers, but that makes it harder to debug.

It's funny, I usually do this too, but it's been years since I wrote the original Intel intrinsics code, so I can't remember why or if there was a reason I didn't do that. Definitely worth a try to see if there is any speed gain. I'll take whatever I can get. :)

Dawoodoz

Replacing all multiplications with additions instead of letting the compiler do the optimization just makes it easier to count cycles when scheduling memory loads. From the instruction loading memory to the first instruction reading the target register, you have around 20 to 200 cycles to do additional math for free.

Using stride instead of width for iterating between rows allow aligning the memory of images that don't have evenly divisible width. This reduces the need for masking, when you can simply overwrite padding bytes in images that own it. Stride can also be used to re-use memory allocations for tile sets using sub-images that have the same stride but different dimensions. Sub-images might however not overwrite padding if these pixels are visible in a parent image.

And sometimes a simple memset is faster than anything SIMD can do, by using planar image formats or only reseting to byte uniform gray-scale values.

Some good stuff to try here, thanks. It's actually the texture mapping that is taking up the most time. If I remove the texture and just render the quad as a solid color, the frame time is reduced by roughly 3x. Fortunately, I don't use many textures in my game so far, so this isn't a huge issue. Mostly it's for rendering text right now. Although I would like to use at least some textures for light particle effects at some point.

Miles

#23416

September 13, 2020

Two things:
1. You could try rendering with a single thread to see if it cuts down on variance. If it makes a big difference, you might want to make threading more granular. If not, you at least know that's not the problem.
2. Going by the graph, it looks like the ARM code has more variance, but I'm not sure if it has that much more variance. I think it's partly just that it's stretched out a lot more on the graph. If I stretch out the x86 measurements to take up the whole height of the graph, you can see it has almost as much relative variance:

Interestingly, the ARM variance has a visibly regular pattern to it that isn't apparent on x86, which hints to me that there may be something going on there that can be fixed. Without knowing more about exactly what and how you're measuring, that's about all I could glean.

Flyingsand

#23418

September 13, 2020

notnullnotvoid
Two things:
1. You could try rendering with a single thread to see if it cuts down on variance. If it makes a big difference, you might want to make threading more granular. If not, you at least know that's not the problem.

Just finished comparing multithreaded vs. single threaded, and it definitely appears that multithreading is the culprit of the large variance:

notnullnotvoid

2. Going by the graph, it looks like the ARM code has more variance, but I'm not sure if it has that much more variance. I think it's partly just that it's stretched out a lot more on the graph. If I stretch out the x86 measurements to take up the whole height of the graph, you can see it has almost as much relative variance:

Yeah, it's maybe better to see them on separate graphs. (I'm actually not sure what the best way to graph the two is.. I know graphs can be very misleading if the y axis isn't properly scaled). Here they are as separate graphs, and visually they both appear to have similar variance, but the magnitude of the variance of the Intel version is much less than the ARM one, even in relative terms. Admittedly that's me just eyeballing it :), but it seems clear enough to me that that is the case, even without the need to compare the standard deviation between the two.

Intel:

ARM:

So yeah, I think I need to look at the multithreading to see if there is anything I can do to fix the spikes I'm getting in the performance. I did recently add some code that changed how the threading worked, and it was a good optimization gain, but it could also have added in some inconsistency..?

Basically, prior to this work, I just used a thread pool to divide up the rendering work (by scanlines as I mentioned in my first post). At the beginning of each frame, I would clear the depth buffer and the color buffer. Clearing the depth buffer took around 1ms each frame, however, so I decided to double-buffer it to gain back that 1ms. So now I have two depth buffers and do a pointer swap between the cleared one and the "dirty" one at the beginning of each frame, and the dirty one is cleared on its own thread (a thread pool with only 1 thread) asynchronously. At a high level, the threading model looks like this:

begin_frame():
    complete_all_tasks(depthBufferThread) // ensure work is done before starting frame
    swap pointer to active depth depth buffer
    start task of clearing inactive depth buffer on depthBufferThread
    start clear color buffer on renderThreadPool

render_frame():
    complete_all_tasks(renderThreadPool) // ensure all work in render thread pool is done before rendering (e.g. clear color buffer)
    render opaque primitives on renderThreadPool
    complete_all_tasks(renderThreadPool)
    render translucent primitives on renderThreadPool
    complete_all_tasks(renderThreadPool)

Implementing this optimization did indeed save me at least 1ms per frame on both desktop (macOS/Windows) and iOS. This is probably not enough detail to determine whether this model introduces inconsistency in the frame time on iOS.. like what may cause a thread to stall. The thread pools are all implemented using pthread, and on iOS the render thread pool has 4 threads and the thread pool for clearing the depth buffer only has 1.

EDIT: Taking out the optimization of double-buffering the depth buffer and clearing it on its own thread did not improve the variance.

Edited by Flyingsand on September 13, 2020, 4:53pm