Want to learn a little bit about CPU-level optimization

jstimpfle

#22164

February 7, 2020

So I read this somewhat hilarious thread on HN a couple of days ago where someone posted a version of the wc program written in Haskell and claimed it was 4-5 times faster. (It was an apples vs oranges comparison). That got me thinking that I want to understand a little better how fast we can make a CPU go if we really, really want it.

The wc program that I want to implement just counts number of lines (as in number of 0x0a characters) and words (as in, number of non-space characters that follow space characters). It might be not the best program to try optimizations on, but it's what's grabbed my fancy.

An obvious start to me is to avoid expensive routines for reading a single character, like fgetc(). We should read in larger chunks - fread() to a buffer of 16 or 64 KB should be ok.

How to optimize further means pushing my personal limits of understanding - my vague idea is that we want to avoid stalls and conditional execution using lookup tables, and somehow want to exploit data-level and instruction-level parallelism, whatever that means exactly. I thought I'd just drop a little code that I wrote here. The optimization with the precomputed lookup[256] table below brought the user-time down from ~1.05sec to ~0.85 sec in my test. So, it's at least measurable, but not thrilling given that it adds a lot of code.

I hope that someone experienced can comment on it / show how to make it go even faster.

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

static const unsigned char spacemap[256] = {
        0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 1, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0,
        1,
};

static const unsigned char nlmap[256] = {
        0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1,
};

#define ISSP(x) (spacemap[x])
#define ISNL(x) (nlmap[x])

static unsigned char lookup[256];
static unsigned char buf[16*1024];

static void precompute(void)
{
        for (int i = 0; i < 256; i++) {
                lookup[i] = 0;
                int wassp = 0;
                for (int j = 0; j < 8; j++) {
                        int issp = (i >> j) & 1;
                        if (wassp && !issp)
                                lookup[i]++;
                        wassp = issp;
                }
        }
}

int main(int argc, const char **argv)
{
        precompute();
        if (argc != 2) {
                fprintf(stderr, "Usage: ./test FILE\n");
                exit(1);
        }
        const char *filepath = argv[1];
        FILE *f = fopen(filepath, "rb");
        if (f == NULL) {
                fprintf(stderr, "Failed to open file '%s'\n", filepath);
                exit(1);
        }
        uint64_t numBytes = 0;
        uint64_t numWords = 0;
        uint64_t numLines = 0;
        uint64_t wasSp = 1;
        for (;;) {
                size_t ret = fread(buf, 1, sizeof buf, f);
                if (ret == 0)
                        break;
                numBytes += ret;
                if (ret != sizeof buf) {
                        for (size_t i = 0; i < ret; i++) {
                                unsigned char c = buf[i];
                                if (wasSp)
                                        numWords += !ISSP(c);
                                numLines += ISNL(c);
                                wasSp = ISSP(c);
                        }
                        break;
                }
                for (size_t i = 0; i < sizeof buf / 8; i++) {
                        uint64_t x = ((uint64_t *) buf)[i];
#if 1
                        uint64_t a0 = (x >> 0) & 0xff;
                        uint64_t a1 = (x >> 8) & 0xff;
                        uint64_t a2 = (x >> 16) & 0xff;
                        uint64_t a3 = (x >> 24) & 0xff;
                        uint64_t a4 = (x >> 32) & 0xff;
                        uint64_t a5 = (x >> 40) & 0xff;
                        uint64_t a6 = (x >> 48) & 0xff;
                        uint64_t a7 = (x >> 56) & 0xff;
                        numLines += ISNL(a0);
                        numLines += ISNL(a1);
                        numLines += ISNL(a2);
                        numLines += ISNL(a3);
                        numLines += ISNL(a4);
                        numLines += ISNL(a5);
                        numLines += ISNL(a6);
                        numLines += ISNL(a7);
                        if (wasSp)
                                numWords += !ISSP(a0);
                        wasSp = ISSP(a7);
                        uint64_t spaces =
                                (ISSP(a0) << 0) |
                                (ISSP(a1) << 1) |
                                (ISSP(a2) << 2) |
                                (ISSP(a3) << 3) |
                                (ISSP(a4) << 4) |
                                (ISSP(a5) << 5) |
                                (ISSP(a6) << 6) |
                                (ISSP(a7) << 7);
                        numWords += lookup[spaces];
#else
                        if (wasSp)
                                numWords += !ISSP(x&0xff);
                        wasSp = ISSP((x>>56)&0xff);
                        uint64_t spaces = 0;
                        for (int j = 0; j < 8; j++) {
                                spaces |= ISSP(x & 0xff) << j;
                                numLines += ISNL(x & 0xff);
                                x >>= 8;
                        }
                        numWords += lookup[spaces];
#endif
                }
        }
        if (ferror(f)) {
                fprintf(stderr, "I/O error when reading '%s'\n", filepath);
                exit(1);
        }
        assert(feof(f));
        fprintf(stderr, "lines, words, bytes: %zu, %zu, %zu\n",
                numLines, numWords, numBytes);
        fclose(f);
        return 0;
}

Edited by jstimpfle on February 7, 2020, 3:20pm

Marc Costa

#22165

February 7, 2020

There are a bunch of SIMD based tricks you can use to speed up character matching. There are some good blog posts here: https://lemire.me/blog/

Miles

#22166

February 8, 2020

This seemed like a fun optimization challenge so I spent an afternoon on it. Here are my results:
https://gist.github.com/notnullno.../a4ebf552badf99395bcbd08e0fd2ded6

I used the complete text of the 11th edition Encyclopedia Britannica (compiled from archive.org) as a test corpus. It's around 293 megabytes.

The gist contains a file with timing code and 7 variations of the algorithm. I load the whole file ahead of time to simplify/focus the code and measurements, even though optimizing disk access would be an important part of a real program. Timings were measured as the fastest of several runs on my laptop at clang's `-O2`, and are listed at the bottom of the file. They are approximate within about 5ms. The variations are as follows:

The fallback loop in your original implementation, to establish a baseline.
Same as 1, but the incrementation of `numWords` is branchless. The branch is poorly predicted - about 25% of characters will be spaces, and they appear at random - so the speed gain is unsurprising.
The original program, rewritten a bit for stylistic reasons but otherwise unchanged. The bit-packing and lookup table for the word boundary check is actually quite clever, and I would guess it's probably a near-optimal algorithm for scalar code. I didn't experiment with tuning the table size, but that's also something that may be worth looking into.
A slightly faster version of the original, also branchless. Loading a u64 and then unpacking the bytes with shifts was slower than just loading the bytes directly, so I switched to doing that. This is as far as I went with scalar code.
My first attempt at a SIMD version. The basic idea is to do the calculation for multiple bytes in parallel in different SIMD lanes, and then sum the lanes at the very end. But the word and line counts would quickly overflow an 8-bit or 16-bit int, so I unpack the 16 bytes into 4x4 32-bit ints before doing the math on them. To compare adjacent chars for the word boundary check, I essentially compute whether each lane is a space, then shift the whole register by one byte and compare it to itself - although the logic of this is a bit obscured in this version due to the 4x4 unpacking, you can cleanly see it in the next two versions. Overall, the unpacking was expensive and only got 4x parallelism for loading 16 chars, so in the end it wasn't that much faster than the scalar version.
Rather than unpacking from 8 bits to 32 bits, I worked around the overflow problem by summing the lanes after every 256 iterations, instead of at the very end. This was by far the biggest speed gain, being almost 4x as fast as the previous SIMD version. And the code is much simpler (it only looks hairy because of the long intrinsic names).
Exactly the same logic as 6, but uses 256-bit AVX instructions. AVX shifts can only operate on 128-bit halves of the register, so we have to do an extra scalar word boundary check on bytes 7 and 8, but the increased parallelism is still a big win. I didn't try avx512 because I don't have a processor with that instruction set, but I'm guessing it would give a similar improvement.

In the end I was able to get down to 26 milliseconds for a 293 MB file. I'm sure an experienced x86 asm wrangler could squeeze it even further, but 11 GB/s is not bad for a single thread! If I multithreaded this, my laptop would run out of memory bandwidth before it ran out of cores. Let's just say I'm not worried about haskell performance overtaking C anytime soon ;-)

Edited by Miles on February 8, 2020, 6:41am Reason: more formatting woes

jstimpfle

#22168

February 8, 2020

That's impressive and very helpful. Thank you for posting it. I think I got the gist of it, and I'm going to try a few variations.

Edited by jstimpfle on February 8, 2020, 12:02pm

Miles

#22171

February 8, 2020

To more specifically answer your original question of "how fast can we make a CPU go", which I kind of glossed over before, here is an article I just remembered, which I read a while ago, that goes over some of the fundamental bottlenecks/speed limits in modern CPUs that you might run into depending on what kind of code you're writing:
https://travisdowns.github.io/blog/2019/06/11/speed-limits.html
Of course most large non-optimized programs will be limited mainly by memory/cache access patterns, suboptimal algorithms/time complexity, and just plainly unnecessary work, so that's usually what you encounter first when optimizing, and will often get you far enough. But if you really squeeze any given loop, you'll eventually run into one or more fundamental speed limits. Understanding them can help you make rough estimates of what performance is theoretically optimal for a given task.

Edited by Miles on February 8, 2020, 3:38pm Reason: more precise wording