ExpertMemer - EXPERimenTal MEMory access visualizER

A Visibility Jam 2023 experiment/proof of concept for visualizing memory access with cache line granularity.

What if we could visualize how much, where and when our programm accesses memory? Is memory access scattered or linear? Are cache lines utilized effectively?

I split this project into roughly three stages:

Take a single function of a program and trace all memory read instructions.
Come up with a visualization scheme and implement it.
Hack it into a profiler (Spall) and inspect the result next to the corresponding flamegraph.

1. Dynamic Instrumentation

To my knowledge, there are two popular libraries for dynamic binary instrumentation: Pin and DynamoRIO. I chose Pin, but apparently both are good. Although I don't know much about dynamic instrumentation and I've never used such tools before, I was quite impressed with them. Seems like a very useful resource with more to explore.

Anyway, to give a brief overview of this part, I collect three sets of data: Memory instructions, a callstack trace and a record of known memory ranges/allocations.

Every time any load instruction is encountered, a set of instructions that write the address and size to a buffer is injected. This buffer can fill up and needs to be flushed to a file every once in a while. Similarly, any time a function is encountered, instructions at the beginning and before every return are injected which call a function that records the function name and source code location into a buffer, which is also written to a file. For tracking allocations, we can lookup functions like malloc, or arena_create and completely replace them with our own. This injected function just calls the original function and stores the requested size and returned pointer as an event between the callstack trace events.

2. Visualizing the Data

Now that we have a big array of memory operations (address + size + other annotations), how do we visualize it? First, let's boil it down a little. Divide the address by 64 and we get the cache line. When it comes to memory, (for this purpose) we don't care about individual addresses. When we load anything from memory, the CPU pulls in the whole cache line. And anything that is not accessed just sits there and takes up space. Space is plenty, but memory access is slow if not already cached. So a cache line is a reasonable granularity and should compress the space our visualization takes down quite a bit. In fact, with 32 KiB of L1 cache, we end up only needing 512 pixels to represent all of L1. Neat!

Because we want to match the visualization up to a flame graph to see where in our code things are happening, we want to plot time in the X axis and the address space in Y. But we run into a problem: the address space is huge and also extremely sparsely accessed. My first idea was to record all known memory allocations and only show those ranges. But this would still be very sparse and take up way too much space. So I chose to compress it down maximally.

The following example code compiled with -O2

#define ARRAY_SIZE 1000
for (int i = 0; i < ARRAY_SIZE; i++) {
  array[i] *= 2;
}
for (int i = ARRAY_SIZE/2 - 1; i >= 0; i--) {
  array[i] /= 2;
}

generates this image:

We maintain a "working set" that is propagated forward over the columns of the bitmap with each new instruction. When we go to the next instruction record in the sequence, we first check if the cache line is already in the working set. If so, refresh it's "age" and add the size to it's "read" counter. If it's not in the working set, we insert a new cacheline in the correct position and shift the pixels below (cache lines with higher addresses) down. When this happens, we also decrement the age of all lines by 1. If one line has reached age 0 ("age" is a bad name, it's actually the reverse age), it's removed from the set. With this scheme, we can guarantee that if one follows the pixels of a cache line, it can only go straight to the right or up and down diagonally by one pixel.

To visualize the cache line age, we can just let the color fade out proportionally. The vague idea behind this is that the longer a cacheline is unaccessed, the more likely it is to be evicted from cache. I will touch upon this later, but we don't actually know the contents of the cache at any time - so this is not a profiler. As a bonus, we can remove those lines and make space in the bitmap. The second vague idea is that the more bytes are accessed from a cache line, the more it was "worth it" to load it in. We could track the exact access pattern of a cache line, but for the sake of simplicity and a lack of time I just choose to add up all the accessed bytes. Not ideal, but gets the idea across. To visualize how many bytes we've accessed, the color fades from purple (0 bytes read) to green (>= 64 bytes read).

3. Hack it Into a Profiler Frontend

Now all that's left is to match it up to a flame graph of our callstack trace. Except instead of measuring time, the size of the calls is just determined by the number of memory read instructions, matching exactly to the pixels in our bitmap.

With no time to spare, I had to quickly (and possibly quite badly) hack it all together into Spall, but in the end it did make for a decent prototype.

In this video you can see two different modes in action, let's just call them X and Y. In Y mode, you get a highlight of the cache line that is accessed at the point-in-time where the mouse cursor is, in addition to all the other point-in-time's where an access was recorded. In Y mode, the cache line the mouse is over is highlighted. The tooltip displays the source line and can (in theory) be used to trace the location back to your source. The faint blue highlights are two arenas I recorded the address range of. Ideally you would see a callstack or some other annotations to know which one is which. I didn't get to including all the individual malloc's, but you get the idea.

memer.mp4

(Ignore the displayed timestamps - no time is actually measured.)

Final Remarks

One obvious extension would be to also include write instructions. Should be simple enough, I but didn't get that far. Also, we could be showing more debug information - we're talking overlaying source code at the location of a memory access and overlaying struct layouts.

An important thing to remember is that this does not show what the CPU actually does. I don't think we can reliably access this information, and even if we could, it would get distorted from the instrumentation and trace recording. We could simulate what the cache might do, and there is plenty of example code and tools out there that attempt do exactly this, but I'm not sure it would be useful when we're just talking about instrumenting parts of a program. The first time we see an access to a cache line, it looks like it just loaded in to us, but it might as well have been accessed by a previous function and is still fresh in cache. We don't know these things, so take the visualization with a grain of salt. What we seem to be able to extract however, is a rough idea about the access patterns into memory and how efficiently the cache is utilized.

Open Questions

How to deal with very large amounts of data being accessed? Just "evict" the least recently used cacheline from the bitmap to keep its height bounded?