Not long ago, due mainly to curiosity, I began porting my hobby game to iOS. Until then I supported macOS and Windows. And I will say, overall the port to iOS went very smoothly, and I was pleasantly surprised at how quickly I was able to get it up and running on my iPhone (thanks mainly to a sane architecture I learned through HH and the Handmade Network, as well as the relative ease with which Swift interoperates with C). One thing I've noticed though is that performance on iOS is much less consistent than on macOS.
Part of the porting work involved writing ARM NEON equivalents for all my Intel SSE rendering code (I'm sticking with a software renderer for this project). I have used NEON before in other projects, but I'm still surprised at how it has far fewer intrinsics available than Intel does. Plus, NEON registers are only 128 bits wide, whereas on macOS & Windows I'm using AVX2, which has 256 bit registers and allows for 8-wide operations. So that's certainly a factor as far as performance goes, but I also noticed much more variance in the performance with NEON. Here is a graph comparing the frame times of a simple scene (clearing the buffer and rendering 2 entities; one plain triangle and one with a texture):
Graph
The wild variation on iOS has me a little concerned about hitting consistent frame rates.. I'm not sure why there is such inconsistency on the ARM side. Has anyone had any experience with this, or encountered this before? Is this a characteristic of the mobile ARM processors? Most other games do their rendering on the GPU of course, so maybe that offers much more stability.
As far as porting my Intel intrinsics code to ARM NEON, it's pretty much a straight line-for-line mapping of the Intel intrinsic function to its equivalent NEON function (when possible). As an example, here are the functions for clearing the buffer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82 | void
_clear_buffer_neon(fsColorf color, bbRenderContext *context, fsu32 scanline) {
bbOffscreenBuffer *colorBuffer = context->colorBuffer;
fsi32 minX = 0;
fsi32 minY = 0;
fsi32 maxX = colorBuffer->width;
fsi32 maxY = colorBuffer->height;
fsu32 scanlineTotal = context->scanlineTotal;
_fix_minY(&minY, scanline, scanlineTotal);
color *= 255.f;
fsu32 clearPixel = ((fsu32)color.r << 0) | ((fsu32)color.g << 8) | ((fsu32)color.b << 16) | ((fsu32)color.a << 24);
uint32x4_t clearColor_4 = vdupq_n_u32(clearPixel);
uint32x4_t rowStartClipMask_4, rowEndClipMask_4;
_get_clip_masks_aligned_neon(&minX, &maxX, &rowStartClipMask_4, &rowEndClipMask_4);
for (fsi32 y = minY; y < maxY; y += scanlineTotal) {
intptr_t offset = (intptr_t)(y * colorBuffer->width + minX);
fsu32 *pixel = colorBuffer->buffer + offset;
// NOTE: Write out first "column" of the current scanline using the row start clip mask.
uint32x4_t pixel_4 = vld1q_u32(pixel);
uint32x4_t maskedOut_4 = vorrq_u32(vandq_u32(rowStartClipMask_4, clearColor_4), vandq_u32(vmvnq_u32(rowStartClipMask_4), pixel_4));
vst1q_u32(pixel, maskedOut_4);
pixel += 4;
for (fsi32 x = minX + 4; x < maxX - 4; x+=4) {
vst1q_u32(pixel, clearColor_4);
pixel += 4;
}
// NOTE: Write out last "column" of the current scanline using the row end clip mask.
pixel_4 = vld1q_u32(pixel);
maskedOut_4 = vorrq_u32(vandq_u32(rowEndClipMask_4, clearColor_4), vandq_u32(vmvnq_u32(rowEndClipMask_4), pixel_4));
vst1q_u32(pixel, maskedOut_4);
}
}
void
_clear_buffer_avx2(fsColorf color, bbRenderContext *context, fsu32 scanline) {
bbOffscreenBuffer *colorBuffer = context->colorBuffer;
fsi32 minX = 0;
fsi32 minY = 0;
fsi32 maxX = colorBuffer->width;
fsi32 maxY = colorBuffer->height;
fsu32 scanlineTotal = context->scanlineTotal;
_fix_minY(&minY, scanline, scanlineTotal);
color *= 255.f;
fsu32 clearPixel = ((fsu32)color.r << 0) | ((fsu32)color.g << 8) | ((fsu32)color.b << 16) | ((fsu32)color.a << 24);
__m256i clearColor_8 = _mm256_set1_epi32(clearPixel);
__m256i rowStartClipMask_8, rowEndClipMask_8;
_get_clip_masks_aligned_avx2(&minX, &maxX, &rowStartClipMask_8, &rowEndClipMask_8);
for (fsi32 y = minY; y < maxY; y += scanlineTotal) {
intptr_t offset = (intptr_t)(y * colorBuffer->width + minX);
fsu32 *pixel = colorBuffer->buffer + offset;
// NOTE: Write out first "column" of the current scanline using the row start clip mask.
__m256i pixel_8 = _mm256_load_si256((__m256i *)pixel);
__m256i maskedOut_8 = _mm256_or_si256(_mm256_and_si256(rowStartClipMask_8, clearColor_8), _mm256_andnot_si256(rowStartClipMask_8, pixel_8));
_mm256_store_si256((__m256i *)pixel, maskedOut_8);
pixel += 8;
for (fsi32 x = minX + 8; x < maxX - 8; x+=8) {
_mm256_store_si256((__m256i *)pixel, clearColor_8);
pixel += 8;
}
// NOTE: Write out last "column" of the current scanline using the row end clip mask.
pixel_8 = _mm256_load_si256((__m256i *)pixel);
maskedOut_8 = _mm256_or_si256(_mm256_and_si256(rowEndClipMask_8, clearColor_8), _mm256_andnot_si256(rowEndClipMask_8, pixel_8));
_mm256_store_si256((__m256i *)pixel, maskedOut_8);
}
}
|
Multithreading is handled by scanlines in my renderer. And come to think of it, I wonder if threading is somehow to blame for the large variation in frame times on iOS? I'm testing this on an iPhone 8, which has 6 processors, although I use 4 threads in the thread pool for rendering.
Ultimately, I suppose I was expecting slightly better performance than what I'm seeing considering the iPhone 8 is reasonably powerful. It's not doing 8-wide operations like on macOS & Windows, but there are also a lot less pixels to fill (667x375 on iOS vs. 1440x960 on macOS/Windows). And this test was done using a very simple scene as well: clear the screen, draw a solid triangle, texture-map a quad. I'm not sure how much more I can optimize the renderer on iOS, so I'm wondering if I've hit the performance limit.. or.. am I missing something?