This article is about rewriting my AArch64 interpreter in raw assembly.

Introduction

As I've mentioned in the previous article, my time travel debugger is based on an interpreter for AArch64 native code. The primary motivation for this is to be able to reliably record the side effects of non-deterministic instructions and of system calls. There are other ways to do this besides using an interpreter, each with their own pros and cons. For the sake of this article though, we'll just assume that using an interpreter (accelerated with a JIT) is a good idea.

But I'll tell you what is not a good idea: Writing said interpreter in Rust. That's what I did originally, and it worked for a while, until it didn't.

Floating Point Troubles

The original idea behind writing the interpreter in a high level programming language was to be able to run it on other platforms like x86 for "cross-arch replay". But that idea didn't last long, as I soon had to implement several floating point instructions to be able to run some more complex programs.

You may think, well, what's the problem, just implement those FP instructions with f32 and f64 using their corresponding built-in operations. Rust's floating point types are defined to be IEEE 754 compliant, which is the floating point standard that modern processors implement, so using them should be perfectly portable, right? Nope. Unfortunately there are parts of the IEEE 754 standard that are "implementation defined". And if you read the Rust docs I linked carefully, you'll find that these parts are also "not guaranteed to be portable or even fully deterministic" by the Rust compiler. This is the same issue I've talked about with WebAssembly in the last post.

All that is to say: There are differences between the floating point implementations on x86 and AArch64. As a result, Rust's floating point types aren't guaranteed to be portable or deterministic. So to avoid all that trouble, I just ended up implementing the floating point operations with inline assembly. Which of course meant that my code was no longer portable.

Compile Time Troubles

Rust isn't exactly known to have great compile times. But I thought how bad could it be? Surely that only applies to code that's littered with traits, generics, and complex lifetimes. Turns out, that couldn't be further from the truth. Even completely braindead code, like that of an interpreter, is insanely slow to compile.

Here's what the implementation of a typical instruction looked like:

// ADD_32_addsub_imm
let sh = (instr_word >> 22) & 1;
let imm12 = (instr_word >> 10) & 4095;
let Rn = (instr_word >> 5) & 31;
let Rd = (instr_word >> 0) & 31;

let op1 = x_or_sp![Rn] as u32;
let op2 = if sh == 0 { imm12 } else { imm12 << 12 } as u32;
let result = op1.wrapping_add(op2) as u64;
x_or_sp![Rd, result];

Just a few bit operations, some logic, some macros. Nothing special. But my compile times were miserable! Every time I touched some code related to the interpreter, I got to twiddle my thumbs for 11 seconds.

At some point, I got so annoyed by it that I decided to profile the compiler to see what the heck it was doing.

instruments rustc profile

9 seconds on trait selection??

Come on, really? Having read some rustc source code in the past, I had a pretty good idea about what must have happened there.

So I quickly tweaked the code:

// ADD_32_addsub_imm

// nb: this part was generated by a script,
// so it was very easy to change.
let sh = <u32 as core::ops::BitAnd>::bitand(<u32 as core::ops::Shr>::shr(instr_word, 22), 1);
let imm12 = <u32 as core::ops::BitAnd>::bitand(<u32 as core::ops::Shr>::shr(instr_word, 10), 4095);
let Rn = <u32 as core::ops::BitAnd>::bitand(<u32 as core::ops::Shr>::shr(instr_word, 5), 31);
let Rd = <u32 as core::ops::BitAnd>::bitand(<u32 as core::ops::Shr>::shr(instr_word, 0), 31);

// this part is unchanged.
let op1 = x_or_sp![Rn] as u32;
let op2 = if sh == 0 { imm12 } else { imm12 << 12 } as u32;
let result = op1.wrapping_add(op2) as u64;
x_or_sp![Rd, result];

And voilร , my 11.5 second compile times dropped to 4.5 seconds.

For those who don't speak "Rust nonsense": Rust uses traits (aka interfaces aka protocols) for operator overloading. That crazy looking <u32 as Trait>::method syntax is a way to manually call a trait method without going through the trait resolver.

I understand that trait resolution is a complex issue. But is it really too much to ask that you hardcode the results for the operators of built-in types?

So my assumption from the opening paragraph of this section actually wasn't incorrect: Only code that's littered with traits, generics, and complex lifetimes is (super) slow to compile. But as it turns out, all Rust code is littered with traits and generics by default, and so all Rust code is super slow to compile - fun times!

Rant over.

(Addendum for the Rust nerds: I'm actually not sure why this change had such a drastic impact on compile times. The Shr and BitAnd traits are generic over the Rhs type: trait Shr<Rhs> {...}. So specifying the Self type using <u32 as Shr>::shr only tells the compiler half the information needed to resolve the trait. I assume this means some form of trait lookup still happens, but for some reason that's a lot faster?)

A New Approach

Even if we ignored the fact that I lost portability by using inline assembly and that my compile times were abysmal cause Rust. There were still two more issues: 1) I had implemented only 472 of the 1,726 instruction (25%). This meant that each time I tried running a new program, I first had to implement a batch of new instructions. And 2), the interpreter was really slow in debug builds. And release builds, well, I think you can guess what the problem with those wasโ€ฆ

One thing I liked about the inline assembly implementations was how easy they were to write:

// FCVTPS_32S_float2int

// (decode - this part is generated.)
let Rn = (instr_word >> 5) & 31;
let Rd = (instr_word >> 0) & 31;

// read operands from registers.
let op = v![Rn].to_u32();

// the asm.
let result: u64;
unsafe { asm! {
    "fcvtps {result:w}, {op:s}",
    op = in(vreg) op,
    result = out(reg) result,
}}

// write result to registers.
x![Rd, result];

They followed a super simple pattern of reading the operands from the register file, executing the corresponding assembly instruction, and writing the results back to the registers.

I realized that if I knew which registers an instruction accessed, then I could actually generate the entire implementation for it automatically. And since that code would be dead simple, I could even generate it as assembly code directly, which would solve the compile time problem. And as a nice bonus, debug builds would have much better runtime performance as well, because the assembly code I would generate would be more or less what an optimizing compiler would spit out after thinking about it for a hot minute.

Thing is, while that sounds like an all upside deal, it does rest on this assumption that I know the effects of all instructions (or at least of ones needed to run 99% of programs). By "effects" I mean: which registers an instruction accesses, whether it accesses memory, and a few other AArch64 specific things.

Kind of like this:

{
 "instr": "LDR_32_ldst_immpost",
 "additional": [
  "check_sp_alignment"
 ],
 "regs": [
  {
   "bank": "x",
   "reg": "Rt",
   "write": true
  },
  {
   "bank": "x_or_sp",
   "reg": "Rn",
   "read": true,
   "write": true
  }
 ],
 "mem": {
  "access": {
   "kind": "regs",
   "is_load": true,
   "bank": "x",
   "sext": 0,
   "elem_bits": 32,
   "access_size": 4
  },
  "address_mode": "PostSImm9"
 }
}

Unfortunately, I couldn't find a repository online that had all the necessary instruction metadata. But I had another idea: surely someone else had written an AArch64 interpreter before. And if they did, maybe I could "extract" the effects of each instruction from their implementation. So I asked Gippity to look for some repos, and sure enough, it found something: an open source AArch64 implementation in pure C - perfect!

Now I just had to analyze that C code somehow. This wasn't going to be the most straightforward thing, as some optimizations like inlining and constant folding were going to be necessary for some of the information I was interested in to emerge. So I compiled the C code to LLVM bit code and wrote a custom LLVM pipeline to reliably apply those optimizations as well as a few custom ones. Out the other end came an LLVM function for each instruction that I could then analyze with some more C++ code.

If it works, it ain't stupid.

Generating the Assembly

As expected, code generation was fairly straightforward. The current version of the Python script is just over 4 kloc long, but there isn't really anything interesting going on there (just lots of special case handling for instructions with immediates).

So instead of talking about how the code was generated, I'll just show you the interesting parts about the result. The following code snippet is a (very) abridged and annotated version of the interpreter assembly function. Even if you've never read assembly code before, I think the comments should help you to understand it. If not, you can always ask Gippity to explain any confusing parts to you.

.section __TEXT,__text

.global _interp_run:
_interp_run:
    // stack setup.
    
    // register setup.
    // there are some dedicated registers for
    // hot values like the program counter,
    // the address of the register file,
    // and the 32 bit instruction word.

L_fetch:
    // all instructions are 4 bytes on aarch64.
    ldr {REG_INSTR_WORD}, [{REG_PC}]

L_decode:
    // imagine a decoder based on a
    // tree of perfect hash tables here.
    // i wrote this part in rust,
    // compiled it in release mode,
    // and copied over the disassembly.

L_exec:
    // the decoder computes the linear id
    // of the instruction and stores it in x8.
    // we now use a jump table to branch to
    // that instruction's implementation.
    adrp x10, _interp_jumptab@PAGE
    add x10, x10, _interp_jumptab@PAGEOFF
    adr x9, L_exec_instrs_begin
    // here we use the instruction id as
    // the table index.
    ldr w8, [x10, x8, lsl 2]
    add x9, x9, x8
    br x9

// the jump table contains `u32` offsets
// relative to this label.
L_exec_instrs_begin:

// the code generator dumps all the
// instruction implementations here.

// ...

// nb: this is the instruction from the
// Compile Time Troubles section.
L_exec_ADD_32_addsub_imm:
    // load register value.
    // nb: we'll talk about `ubfx` in the
    // next section.
    ubfx w8, {REG_INSTR_WORD}, 5, 5
    ldr x1, [{REG_ARCH_X}, x8, lsl 3]
    
    // decode immediate.
    ubfx w8, {REG_INSTR_WORD}, 10, 12
    ubfx w9, {REG_INSTR_WORD}, 22, 1
    lsl w10, w8, 12
    cmp w9, 0
    csel w8, w8, w10, eq
    
    // execute instruction.
    add w0, w1, w8
    
    // write back results.
    ubfx w8, {REG_INSTR_WORD}, 0, 5
    str x0, [{REG_ARCH_X}, x8, lsl 3]
    b L_retire

// ...

L_retire:
    add {REG_PC}, {REG_PC}, 4
    b L_fetch

L_exit:
    // pc writeback.
    
    // stack teardown.
    ret


.section __TEXT,__const

.align 2
.global _interp_jumptab
_interp_jumptab:
    // instruction id 0 is for invalid encodings.
    // this label exits the interpreter with an error.
    .long   (L_invalid_instruction - L_exec_instrs_begin)
    // ...
    .long   (L_exec_ADD_32_addsub_imm - L_exec_instrs_begin)
    // ...

Assembly is High Level

If you read the code snippet above, you may have noticed that the assembly implementation of the ADD_32_addsub_imm instruction isn't actually much longer than the Rust version. This is in part because the Rust code is very low level, but it's also because some things are actually easier to do in assembly.

Remember the let imm12 = (instr_word >> 10) & 4095; parts that blew up the Rust compile times when written using operator syntax? They are a fascinating case where the Rust code is actually lower level than the corresponding assembly implementation!

Here's a part of the encoding diagram of the ADD_32_addsub_imm instruction: encoding diagram add 32 imm

Bits 10-21 encode the 12 bit immediate value to be added to the register Rn. In Rust, we extract this value by shifting the instruction word down by 10 bits and then masking it with 4095, which is 0b111111111111 in binary (12 ones).

Usually, I prefer to write this mask as ((1 << 12) - 1) instead of 4095 to emphasize that I'm selecting 12 bits, but that's even more verbose, and you can probably imagine what writing it like that did to my compile times.

So how do we do this in assembly? One instruction: ubfx w8, {REG_INSTR_WORD}, 10, 12.

ubfx stands for "unsigned bit field extract". w8 is the destination register. {REG_INSTR_WORD} is the register to extract the bits from. And the 10, 12 part at the end specifies the index of the lowest bit and how many bits to extract.

Now of course, this is just a skill issue on Rust's side. We could imagine a syntax like instr_word.bits[10..22] or instr_word.bits[10..+12]. But there's absolutely no way this is getting added to Rust before the year 2035, if ever. So in the meantime, we can joke about assembly being higher level than Rust.

Bits are High Level

(ht Andrew Reece)

As you will most likely know, ifs and loops are implemented using conditional jumps in assembly. For example, we can cmp x8, 7 and then b.lo <somewhere>.

The cmp instruction compares the value of x8 to 7, which sets a bunch of status flags in the CPU. And the b.lo instruction (branch if unsigned lower) reads those flags to decide whether the condition holds, and if so, it jumps to the target instruction.

Determining whether a condition (cond) holds based on the status flags (cpsr) is defined as follows:

pub fn condition_holds(cond: u32, cpsr: u32) -> bool {
    let n = (cpsr >> 31) & 1;
    let z = (cpsr >> 30) & 1;
    let c = (cpsr >> 29) & 1;
    let v = (cpsr >> 28) & 1;

    let result = match cond >> 1 {
        0b000 => z == 1,           // eq or ne
        0b001 => c == 1,           // cs/hs or cc/lo
        0b010 => n == 1,           // mi or pl
        0b011 => v == 1,           // vs or vc
        0b100 => c == 1 && z == 0, // hi or ls
        0b101 => n == v,           // ge or lt
        0b110 => n == v && z == 0, // gt or le
        0b111 => true,             // al
        _ => unreachable!()
    };

    let negate = cond & 1 == 1 && cond != 0b1111;
    return if negate { !result } else { result };
}

This Rust function is a literal translation of the AArch64 pseudocode, and unfortunately, LLVM couldn't make much sense of it. The code it generated wasn't horrible, but I remember always seeing this function when profiling the interpreter. So for the rewrite, I decided to optimize it myself.

And by "myself" I mean that I asked Gippity if it had any suggestions for implementing this function efficiently. Its first response was less than helpful, so I asked "are we sure there's no better way to do this? maybe with a LUT?"

What it came up with was truly impressive:

// w0: cond
// w1: cpsr
_interp_condition_holds:
    adr x8, L_condition_table
    ldrh w8, [x8, w0, uxtw 1]
    lsr w1, w1, 28
    lsrv w0, w8, w1
    ret
L_condition_table:
    .hword 0xf0f0 // eq
    .hword 0x0f0f // ne
    .hword 0xcccc // cs/hs
    .hword 0x3333 // cc/lo
    .hword 0xff00 // mi
    .hword 0x00ff // pl
    .hword 0xaaaa // vs
    .hword 0x5555 // vc
    .hword 0x0c0c // hi
    .hword 0xf3f3 // ls
    .hword 0xaa55 // ge
    .hword 0x55aa // lt
    .hword 0x0a05 // gt  <--
    .hword 0xf5fa // le
    .hword 0xffff // al
    .hword 0xffff // al

It was so impressive, in fact, that I believe something like this was probably in the training data, as its overall understanding of assembly is lacking, to say the least.

So what's going on in that code, how does this work?

The code uses a lookup table, L_condition_table, like I suggested. It has sixteen u16 entries. Each entry corresponds to one condition code. The first entry corresponds to the eq condition code, the second is ne, and so on. The first two assembly instructions load the LUT entry into w8 for the given condition code (w0).

But what do the LUT entries mean? Here's the cool thing: they are also lookup tables! This time, each bit in the u16 value is an entry. And this 16 bit table is "indexed" by the 4 status bits nzcv (which are the top 4 bits of the cpsr flag register).

Here's how the entry for gt is computed:

condition `gt` holds when: n == v && z == 0

bit | n z c v | n=v z=0 | &&
----+---------+---------+----+
  0 | 0 0 0 0 |   1   1 |  1
  1 | 0 0 0 1 |   0   1 |  0
  2 | 0 0 1 0 |   1   1 |  1
  3 | 0 0 1 1 |   0   1 |  0
  4 | 0 1 0 0 |   1   0 |  0
  5 | 0 1 0 1 |   0   0 |  0
  6 | 0 1 1 0 |   1   0 |  0
  7 | 0 1 1 1 |   0   0 |  0
  8 | 1 0 0 0 |   0   1 |  0
  9 | 1 0 0 1 |   1   1 |  1
 10 | 1 0 1 0 |   0   1 |  0
 11 | 1 0 1 1 |   1   1 |  1
 12 | 1 1 0 0 |   0   0 |  0
 13 | 1 1 0 1 |   1   0 |  0
 14 | 1 1 1 0 |   0   0 |  0
 15 | 1 1 1 1 |   1   0 |  0

concatenate the last row in reverse order: 0000_1010_0000_0101
convert to hexadecimal: 0a05

Pretty cool, right?

By the way, I didn't trust the AI to get the constants right (even though, surprisingly, it did), so I wrote a quick Python script to generate the table:

def gen_condition_table():
    def row(holds):
        value = 0
        for i in range(16):
            n = (i >> 3) & 1
            z = (i >> 2) & 1
            c = (i >> 1) & 1
            v = (i >> 0) & 1
            if holds(n, z, c, v):
                value |= 1 << i
        return f"    .hword 0x{value:04x}"

    rows = [
        row(lambda n, z, c, v: z == 1),
        row(lambda n, z, c, v: not (z == 1)),
        row(lambda n, z, c, v: c == 1),
        row(lambda n, z, c, v: not (c == 1)),
        row(lambda n, z, c, v: n == 1),
        row(lambda n, z, c, v: not (n == 1)),
        row(lambda n, z, c, v: v == 1),
        row(lambda n, z, c, v: not (v == 1)),
        row(lambda n, z, c, v: (c == 1 and z == 0)),
        row(lambda n, z, c, v: not (c == 1 and z == 0)),
        row(lambda n, z, c, v: n == v),
        row(lambda n, z, c, v: not (n == v)),
        row(lambda n, z, c, v: (n == v and z == 0)),
        row(lambda n, z, c, v: not (n == v and z == 0)),
        row(lambda n, z, c, v: True),
        row(lambda n, z, c, v: True),
    ]
    return "\n".join(rows)

Fuzzing

no-tests.png

Believe it or not, the code I write tends to have bugs. Which is a problem, because I want my debugger to work reliably. But I also don't like writing tests. At least not the things that people usually mean when they say "tests" - unit tests and the like.

The issue with unit tests is, I wrote the code, so if I also write the tests with the same assumptions, I'm essentially doing double entry bookkeeping, which does catch some mistakes. But it can't expose fundamental flaws in my understanding.

Enter random testing. The idea behind random testing is to continuously throw randomly generated inputs at your program to make sure it doesn't crash. On its own, this isn't a particularly interesting invariant to test. But if you litter your code with asserts, it can start to become a very powerful signal.

So let's apply this idea to the interpreter. Unit tests are obviously a non-starter. Not only because I would have to write a lot of them, but also because, like I said, that doesn't actually test my understanding. Here's a better approach: What if we just randomly picked an instruction, then executed that instruction in the interpreter and on the native hardware, and finally asserted that the results were equal? This is known as differential testing or differential fuzzing. It's a very powerful technique, but it does rely on another implementation of the same algorithm being available, which, thankfully, is the case for my interpreter.

Now, while this idea of comparing the interpreter against the native hardware sounds quite simple, there are actually a few interesting challenges when you try to implement it.

I'm stuck, step interpreter

Let's say we consult our random number generator, and it spits out add x19, x4, x12. If we're going to run that instruction in the interpreter and on the hardware in order to compare their results, we'll need to make sure all the inputs are the same. In this case, the inputs are the registers x4 and x12.

But because we don't want to trust our understanding of this instruction, we'll just generate values for the entire register file: the general purpose registers x, the stack pointer sp, the flags cpsr, the simd registers v, and the simd flag registers fpsr and fpcr. That way, if the instruction accesses some unexpected register, we'll get repeatable differences between the interp and native.

For the interpreter, we can just fill this struct with random values and then tell it to run the instruction:

pub struct AArch64State {
    pub x: [u64; 31],
    pub sp: u64,
    pub cpsr: u32,
    pub v: [Vec128; 32],
    pub fpsr: u32,
    pub fpcr: u32,
}

But for native, we actually need those values to be in the physical registers:

.global _run_native
_run_native:
    // x0 holds a pointer to a context with
    // all the register values.
    
    // load the `x` registers,
    // (nb: x0 and x1 are loaded later.)
    ldp x2,  x3,  [x0, {CTX_OFF_ARCH_X} +  2*8]
    ...
    ldp x28, x29, [x0, {CTX_OFF_ARCH_X} + 28*8]
    ldr x30,      [x0, {CTX_OFF_ARCH_X} + 30*8]

    // the stack pointer,
    ldr x1, [x0, {CTX_OFF_ARCH_SP}]
    mov sp, x1

    // the status flags,
    ldr w1, [x0, {CTX_OFF_ARCH_CPSR}]
    msr nzcv, x1

    // the simd registers,
    ldp q0,  q1,  [x0, {CTX_OFF_ARCH_V} +  0*16]
    ...
    ldp q30, q31, [x0, {CTX_OFF_ARCH_V} + 30*16]

    // and the simd flags.
    ldr w1, [x0, {CTX_OFF_ARCH_FPSR}]
    msr fpsr, x1
    ldr w1, [x0, {CTX_OFF_ARCH_FPCR}]
    msr fpcr, x1

    // now that we're done with x0 and x1, load their values.
    ldp x0, x1, [x0, {CTX_OFF_ARCH_X} + 2*8]

    // finally, run the instruction.
    add x19, x4, x12

And now we're stuckโ€ฆ

In order to compare results, we need to write all the register values back to the context, but we no longer have a pointer to the context. We lost that when we loaded the values for x0 and x1 right before executing the instruction.

And what's worse, we also can no longer return back to the Rust code that called the run_native function. That's because we overwrote the x30 register, also known as the link register lr, which held the code address that we were supposed to return to.

Usually, in assembly functions we first store all the register values that we need to restore later (like the link register) on the stack before executing the rest of the function. But in this case, the stack pointer was also overwritten, because the instruction we're testing may also access that.

One idea we might have would be to just not overwrite the stack pointer. Whenever an instruction accesses the stack pointer, we just don't test that instruction. This could work, but it would be a bad idea, as the stack pointer register is encoded as x31 on AArch64. But depending on the instruction, x31 can also be the zero register xzr (which is always zero). So we'd be missing out on testing crucial interpreter functionality.

A better idea would be to use a register like x0, which isn't any different from the other x registers. Conveniently, x0 also holds the "context" where we're supposed to write the register values back to. So we could just reserve some more space there for the return address (and any other registers we need to save) and not run instructions that mutate x0.

In practice, here's what I do:

arch = random_state();
arch.x[0] = ctx_ptr;

run_interp();

if arch.x[0] == ctx_ptr {
    run_native();
    
    compare_results();
}

The idea behind checking whether x0 is still equal to ctx_ptr instead of using instruction metadata is just that it's simpler and arguably more robust. Though since the interpreter is based on the metadata, if the metadata is incorrect, and the native code actually mutates x0, that would result in undefined behavior. But the program would most likely crash, so it's fine.

Results

The primary goals of this rewrite were to improve compile times and improve instruction coverage, so let's see how we did.

For instruction coverage, the interpreter now supports 2,348/3,428 instructions (68%). Yes, the total number of instructions has increased. That's cause I "flattened" some instructions, but that doesn't really matter right now. What matters is that number went up! We went from 25% to 68% of instructions implemented.

Now, let's see what happened to the compile times: compile times improved

With the old Rust based interpreter, there was a very large difference between debug and release builds. This gap is now much smaller. And the incremental release builds are definitely bearable now at around two seconds.

It's not really worth noting that the clean debug build times increased, because that's a very fixable problem. Most of that time is spent on a 24 kloc Rust file that contains the instruction metadata. It's just a giant const. I didn't think that was gonna be a problem, but it was, cause Rust. But I can relatively easily convert that file to a binary file and include_bytes! it, which would remove that overhead.

Let's have a look at the runtime performance: runtime perf improved

This is the runtime performance of the replay system. The three benchmarks on the right are relatively short recordings, so the performance gaps are smaller there. For kibiold_life, we can see a 2x improvement, which is solid. But I don't really know what's going on with clang_onelua - I feel like the new interpreter should be doing better there. I'll have to investigate that at some point.

Quick sidenote: The replay system uses a software page table for memory. That's why the execution speeds (even with a JIT) are much lower there than you may expect. For the recorder, I've seen speeds up to 300 M/s with the new interp, which is still very slow, but it's a lot faster than before (I think it was around 120 M/s).

All in all, I'm very happy with the results! More instructions, shorter compile times, and faster execution speeds. What more could you want?

Oh I know what you could want: You could want to join the discord, where I post links to all blog posts in the Announcements channel, so you don't miss any ๐ŸŒš

Alright, that's it!

Read More →

This article is about the first 1.5 years of work on my time travel debugger.

Introduction

The story begins on february 23rd 2023, when d7 posted a link to the tomorrow corp tech demo in the handmade network general channel. The video was truly a mind expanding experience. They showcased a tool that allowed you to scrub through recordings of someone playing their game as if it was a video. But these "recordings" didn't just contain screen captures, no, they contained "the entire execution of the game". You could select any point in time, and their tool would reconstruct the entire game state at that point, including the function callstack, local variables, and the partial draw state of the window. You could single step through the renderer code and see game objects appear on screen like in renderdoc.

And you could do something that was impossible in every other debugger that I had seen so far: you could step backwards. I was way too familiar with the experience of carefully getting my program into the right state in a debugger, only to forget what some value was a minute ago, or to step too far by accident and have to go through the entire procedure again. Very frustrating!

But it gets worse: when your program crashes due to memory corruption or other state bugs, a regular debugger can easily show you the state of your program at the time of the crash. Unfortunately, the actual cause of the crash - the corrupting write - could have happened a long time ago. But that's no problem in the tomorrow corp's debugger! Whenever they had a crash on some bogus value, they could just tell their debugger to run the program backwards from the crash to the point in time where the corrupting write took place. I've had bugs in my code where this feature alone could have saved me several hours!

Given how excited I was about this idea, you might assume that I started working on my own time travel debugger immediately. But for some reason I didn't. Instead, I kept working on my programming language experiments for a while. By late 2023 though, I decided that I wanted to work on something "commercially viable". At first, I toyed with an idea for a web based data visualization tool, but my heart wasn't in it. I had started that project with the sole goal of making money, and it wasn't even particularly promising in that regard. So by march of 2024, that project had entered the graveyard as well.

But as I was looking around at the nearby tombstones, I had an idea: a few rows next to it lay a WebAssembly interpreter that I had built for an interactive programming environment type thing in summer of 2023. It was in part inspired by the tomorrow corp demo, but there was more to the idea that I don't really remember right now. I thought, hmm, what if I went all in on a time travel debugger for wasm? Strip all those other things, just something to help people debug their code. Why wasm, you may ask? Great question! To understand that, let's talk about how time travel debugging actually works.

The Fundamentals of Time Travel Debugging

We'll define a time travel debugger as a tool that can reconstruct past program states. The easiest way to implement such a tool would be to simply record the relevant program states at runtime and dump them into a giant file. If you think I just described print debugging, then I would agree. And I would argue that print debugging is the most primitive form of time travel debugging there is.

But print debugging has its limitations. The biggest issue being that you have to modify the program and rerun it to capture additional information. If your program is nondeterministic - by design or by accident - this can be a giant pain. So... What if we just recorded everything? Then we could easily use that data to generate any print log by selecting a subset of the recorded values. Or we could build that "a step debugger that can go backwards" experience I described in the intro.

Well, it turns out, "everything" is a lot of data. Modern processors execute billions of instructions per second. So even if we only captured a single 64 bit register value for every instruction executed, we'd already generate more than 8 GB/s of data (and massively slow down the program in the process).

Fortunately, we don't actually need to record "everything", as computers are (mostly) deterministic. That is, given the same inputs, they will produce the same outputs, which is kinda the whole point of computers, right. So when you have a deterministic machine like that, all you need to "record" is which program you executed and what inputs you gave to it. Then you'll be able to perfectly reproduce the execution of that program by just running it again and providing the same inputs in the same order.

But as I hinted at with the word "mostly", computers aren't actually fully deterministic. A trivial example of this is the rdrand instruction on x86, whose sole purpose is to provide an unpredictable random value. Similar issues are posed by file and network access (though arguably those could be seen as "inputs to the program"). And perhaps less obviously, multi-threaded execution is another source of non-determinism. Even in well synchronized programs, there's generally no guarantee which thread is allowed to enter a critical section first, for example. But fear not, we'll solve all these problems with my favorite problem solving strategy: just not having the problem.

Enter WebAssembly. Given that wasm was designed to be sandboxed by default, it almost can't help but be deterministic. I say almost because some things like NaN-propagation are not guaranteed to be deterministic by the spec. But in practice, if you run the same wasm program on the same machine with the same inputs, you will get the same outputs. Because everything else that can result in non-deterministic behavior has to come from outside the wasm instance and pass through a well defined host api. The host can write values to wasm memory or wasm globals, call wasm function, and provide host functions to be called by wasm from which it can return values. Super straight forward. If you record all of those "host -> wasm" interactions as well as the wasm module you executed, then you can perfectly replay the execution of a wasm program. It didn't take me long to get this working, nor was I the only one in the handmade network with this idea.

Early Prototypes

After about a month of rewriting my wasm engine and implementing a basic gui library, I had the foundation in place: I was able to record and replay programs, and I had something to draw information on the screen. Now the question was what to display.

One of my pet peeves is when people use new technologies to implement old ideas. A classic example of this is when application forms that used to be done on paper are "digitized". The most natural to do that is to just create a web page with text input elements for the blank slots of the physical document. There's nothing intrinsically wrong with this approach, but my issue is when people stop there and don't think about the opportunities that the new medium offers.

I once reworked the application form for my former high school. Their initial version just had blank text fields for the course selection. This meant that the user had to know which combinations were valid and which courses existed in the first place. In my rework, I solved these problems by using select elements for the courses, and I made the form update the remaining options based on what was selected so far automatically. The system worked great!

So when it came to the design of my debugger, I wanted to avoid the mistake of throwing the new technology of time travel at the old idea of the step debugger. Just adding the "go backwards" functionality wasn't enough for me.

But what should you do instead, how do you take full advantage of the opportunities of new technologies? Great question, I wish a had a solid answer ๐ŸŒš. This is by no means a trivial problem, but a good starting point is to ask yourself: what can you do now that was impossible before?

The obvious answer is that time travel allows us to go backwards, which wasn't possible before. But this answer is actually still stuck in the old mindset. It assumes that, like a step debugger that's attached to a native process, we can only look at one state at a time. But this is no longer true! We can now look at any number of states at once and, for example, compare them: what changed about this data structure between here and there? Or like with the print debugging idea from the previous section, we could replay the program, capture data over time, and then plot that. The general principle there is that we can now map time to space. And that's the idea that I started with.

My goal was to build an interface that would show the entire execution of the program from top to bottom. Any point in time in the program would have a corresponding point in space in that interface. And you could scroll up and down to see what happened around that point in time. Here's what I came up with:

call tree thing part 1

Don't worry, it looks just as horrible and confusing to me as it does to you. It's just a prototype.

The idea here is that we start with the root function call to _start, draw its code, and whenever it calls another function (looper in this case), we allow the user to expand that call in place and draw that function's code, indented by one level. If there's a loop, the loop's body is repeated however many times the loop executed. Most of the clutter in the ui comes from the loops being repeated without any visual hints like boxes around the iterations and iteration counts. Pagination by default for loops would probably also be a good idea.

Here's a screenshot of a later iteration of this interface where I added line numbers, file names, and inline buttons for the expandable function calls. The "we have a gui library at home" vibes are strong in this one:

call tree thing part 2

Even though this interface was very primitive, I remember already using it to debug some issue with the debugger itself. Something about being able to scroll through the execution made it easier to understand what my program was doing than if I added prints or used breakpoints.

But I quickly ran into a problem: computers are fast. The interface worked well for small test programs, but for larger programs, you could easily scroll for minutes or perhaps even hours (if my implementation could handle that). There was just too much information, and you could no longer get a quick overview of what the program did or where you were in the execution.

So next up, I added a timeline:

call tree with timeline

Just, uh, ignore all the rendering glitches, will you? Thanks!

The timeline now gave you a high level overview of the program's execution. You could zoom out to see everything, or you could zoom in, find some specific function call, then click on it, and the code view below would jump to that point in time.

The project was definitely going somewhere, but it was nowhere near usable as a real debugger (a theme that would continue to this day - ahem, foreshadowing). So far, I had only focused on visualizing control flow. There was no way to get even a single value out of the program. So for the upcoming visibility jam, my goal was to explore visualizing the memory state of programs.

hexplorer

You can read more about that experiment over on the memory hexplorer project page, I also have some demo videos there. The basic idea was that I used debug info to visualize the data structures in memory. I started with globals and the local variables of functions in the callstack and recursively followed pointers from there.

Like the code tree ui thing from earlier, the memory hexplorer was a very zoomed in look at memory. I imagine it would probably be a good idea to come up with an interface analogous to the timeline to get a high level overview. Or simply draw this interface zoomed out with level of detail (like ben suggested on the jam recap stream, iirc).

Switching to Native

So what should you do when you have two working prototypes, one for visualizing control flow, and one for visualizing memory state? Exactly, you should throw everything away, abandon wasm, and embark on a yearlong journey of getting back to where you were at the end of month one, but for native code.

Look, we do these things not because they are easy, but because we thought they were going to be easy. Never could I have guessed how much more complicated recording and replaying native code was going to be compared to wasm. But I sure as heck had to give it a try. Because I actually didn't know a single person who shipped wasm professionally, and most of my own programs didn't even run on wasm. So supporting native programs was the clear winner from a business perspective. The only reason I hadn't done native from the start was because I didn't know whether it was actually possible (no, I didn't do my market research homework). But in retrospect, I'm glad that I started with wasm, because it allowed me to explore interesting ideas before diving into the deep end.

So after the visibility jam, about 5 months into the project, I started work on record/replay for native code. What platform? Macos/aarch64. Why? Because that's what I use ๐ŸŒš. Someone should give me an award for making smart business decisions.

Trying to deal with the sheer size of Aarch64

Alright, so let's get started with the story of going native. One of the things that makes wasm nice to work with is that it has a very small number of instructions, around 170 for the baseline version. Adding simd support quickly makes that jump to around 440. But that still pales in comparison to aarch64, which, depending on how you count, has around 2000-4000 instructions for baseline + simd.

There are several different approaches to record/replay for native code, but for now it's enough for you to know that the approach I use requires me to implement an interpreter for all those 2000-4000 instructions. Naturally, I wasn't going to do that by hand. But that doesn't mean that the approach I went with instead was any better of an idea. See, arm specifies their instruction set in a programming language called ASL. It looks something like this:

integer d = UInt(Rd);
integer n = UInt(Rn);
integer datasize = if sf == '1' then 64 else 32;
bits(datasize) imm;

case sh of
    when '0' imm = ZeroExtend(imm12, datasize);
    when '1' imm = ZeroExtend(imm12:Zeros(12), datasize);

bits(datasize) result;
bits(datasize) operand1 = if n == 31 then SP[] else X[n];

(result, -) = AddWithCarry(operand1, imm, '0');

if d == 31 then
    SP[] = result;
else
    X[d] = result;

That's the definition of the add immediate instruction.

So what do you think someone who's worked on programming languages for a few years thinks when he sees that? "I could write a compiler for that," of course. And that's precisely what I did.

5,232 lines later and you wouldn't believe it. It actually kinda worked. I was able to execute the first two instructions of a real program. Unfortunately, most programs are more than two instructions long, and my interpreter wasn't going to do much more than that, not with this approach at least. (also, I'm kinda glad it didn't work out, cause that ASL code sorta has a proprietary license, and yeah, I wasn't planning on getting sued for this.)

But before we move on, a few more words about what the idea here was. Cause you can probably imagine that interpreting that entire code up there for a single add instruction would be very slow compared to the single cycle that the hardware can do an addition in. So my plan was to optimize that code, and I was going to optimize it so hard that it would just fold back into the original instruction. I mean, we can dream, right? (btw, this whole approach was in part inspired by deegen)

Another benefit of this approach would have been that it was cross platform. So I would have been able to replay an aarch64 recording on an x86 machine. An idea that I've finally let go of a few months ago. But at the current point in the story, I was still fixed on getting cross platform replay working at some point. And because I just wanted to get something, anything running, I wrote an interpreter by hand in rust.

Here's a screenshot of that interpreter executing the first 156k instructions of a real program, making it to the first syscall (mprotect):

made it to the first syscall

Fast forward a few months and I had a working record/replay system for native code. There weren't many programs I was able to run, but hello world, lua, and even clang -c worked, as long as you didn't stray from the code paths that I had painstakingly implemented support for with my bare hands.

A Cli Debugger

Next up, I implemented a very basic cli step debugger:

cli step debugger

It could do most of the important things for a step debugger: you could seek to any instruction, you could step forwards and backwards, you could run until some address was executed (ie set a breakpoint) or until some address was written to (ie set a watchpoint).

I hadn't implemented support for debug info yet, but that didn't matter, as the cli debugger wasn't long for this world anyway. It was based on an interesting "technological" idea, but ultimately, that also ended up being its achilles heel.

I came up with a data format that allowed me to actually execute the program both forwards and backwards. And I don't mean going backwards by jumping back to the start and executing one instruction less. No, this data format encoded the "effects" of each instruction in a way where they could be applied forwards and backwards (xor, iykyk). But it ended up requiring around 10 bytes per instruction, and as we talked about a while ago, at billions of instructions per second, that puts us into the 10s of gigabytes per second range, which is just not feasible.

What Functions?

No you don't know what you've got
Until it's gone

โ€“ Linkin Park

Sometimes we have to lose things before we can truly appreciate them. One such thing was knowing what code I was executing, back when I used wasm. And by that I don't mean that I now use a crap ton of libraries. What I mean is that wasm code comes bundled in wasm modules. Each module contains a list of functions. So when you run a wasm module, those functions are the only code you have to be concerned with. Like for example, when showing the user which function is currently executing - it can only be one from that list.

But wasm is just "assembly for the web", how much more difficult could this be for real native code? Turns out, a lot.

First off, native code can be loaded and unloaded at runtime. Game developers sometimes use this to tweak their code while the game is running. For a regular step debugger, this isn't really an issue. Debuggers get notified by the OS whenever the loaded modules change. Then you can just parse the current list of modules, and voila, you know which functions there are. As a time travel debugger, this gets slightly more complicated, as the answer to the question of which functions there are now depends on when that question was asked. Though this is at most a minor inconvenience. At least, until we talk about jitting.

See, native code is just data. Nothing stops you from writing a bunch of bytes to some memory buffer, marking it as executable, and telling the cpu to execute it. Of course, if you don't know what you're doing, that will most likely just crash your program. But if you do it with skill, we call it jit compilation. Unfortunately, not all jits tell the operating system about the functions they generate. And because there's generally no way to tell where one function begins and another one ends based only on the current program counter, debuggers will usually just throw their hands up when you stop in jitted code and say "well, we're at address 0x19211cb44, and these are the next instructions we'll execute. Sorry, that's all I know."

But no problem, we don't actually care about the jitted code, that's written in assembly, and who wants to read that. Let's just go up the callstack, back to the C code we're familiar with. Uh, what do you mean we can't do that? Well, sometimes it is possible, but not always. Thing is, on wasm, the callstack is a real thing which lives outside the wasm memory, somewhere in the wasm engine's internal data structures, so you can always find the callers. On native however, the callstack is nothing but a convention. Usually, a register known as the frame pointer (fp on aarch64, rbp on x86) holds the address of the base of the current stack frame, and there, a copy of the caller's frame pointer and return address (lr) is saved. This forms a linked list of active functions:

    ### main's stack frame:
              saved fp   saved lr
.-> 0x123400  0x0        0x0       // bottom of stack.
|   0x123410  variables
|
|
|   ### foo's stack frame:
'-            saved fp   saved lr
.-> 0x123420  0x123400   main+16
|   0x123430  variables
|   0x123440  more variables
|
|
|   ### bar's stack frame:
'-            saved fp   saved lr
.-> 0x123450  0x123420   foo+120
|   0x123460  stuff
|
|
|   ### registers
|   pc = bar+8
'-  fp = 0x123450

But using the fp register as the head of this linked list is just a convention. apple does ask you to respect it on their platform, but the police won't come knocking down your door if you don't. My own jit that I wrote for the debugger does not follow this convention, and I haven't had to use my get out of jail free card yet. Though it does mean that debuggers like LLDB can't correctly reconstruct the callstack when the program is stopped in my jitted code. And that's a problem I'd like to avoid with my debugger.

Who needs Debug Info, anyway

At this point, it's probably no secret that I'm breaking every rule in the tech startup playbook. But so what, I'm having fun along the way, and isn't that the whole point? One of the rules is that you're supposed to have one clear avatar and one clear problem that you're solving for them. Which I did. It was system programmers on macos/aarch64, who worked on complex problems and needed better tools to understand what their code was doing.

But as I kept working on this thing, I realized that it could also be really useful for reverse engineering. As george hotz, who built his own time travel debugger based on qemu, put it: "if you're running the binary 10 times less times than everybody else, it's quite helpful."

Another advantage of dynamic tooling is that it knows what code did actually run. When a static disassembler runs into a virtual call, it generally does not know what the set of called functions can be. And when an obfuscated program uses some kind of "unpacking technique", where the real code is computed at runtime (kind of like in a jit compiler), a static tool may be missing the most important functions. Though you can probably capture that dynamic data with another tool and feed it to your favorite static analyzer.

I basically have zero experience in reverse engineering, so take that previous paragraph with a grain of salt. So far, I've managed to avoid the RE rabbit hole quite well, but thinking about the reverse engineering use case has led me to an interesting design principle: make the debugger as useful as possible before using any debug info.

Debug info tells the debugger what functions and types there are in the program, how to unwind the callstack, and much more. Of course, for some things, like function and variable names, you cannot get around using debug info. But a surprising amount of information can actually be inferred just from the machine code alone.

By inspecting the branches in the executed machine code, you can mostly figure out where the functions are. This is because code that isn't obfuscated to the nth degree tends to use bl and ret instructions to call functions and return from them (call and ret on x86). That gives you the start and end points of functions, so if you take the closure over direct branches (b label, j label) and conditional branches (b.cc label, jcc label) between them, that gives you a decent approximation of a function's code. This is made slightly more complicated by jump tables and tail calls, but that's a topic for another time.

If that sounds somewhat complicated, that's because it is. And given that this is a debugger for debugging programs where debug info is available, doing that kind of bottom up analysis may seem entirely unnecessary. But remember the jit problem from the previous section? We don't necessarily know where jit functions begin and end, unless the jit tells us about that. If we can infer those jit function bounds from the machine code however, we don't need it to, and we'd be able to provide a nicer experience for debugging programs with jits out of the box. So that's the reason I went down this rabbit hole.

I like Dataflow Analyses ๐Ÿš…

(heads up, technical section ahead)

Okay great, so now we can infer where the functions are in the program without anyone's help. Unfortunately that's not enough. While we can use this information to identify what function we're currently in, we can't necessarily reconstruct the other functions on the callstack. We talked about how there is this convention of the linked list of frame pointers and return addresses on macos, but we also talked about how that isn't always respected.

Here's the thing though, most functions look like this (regular function):

// allocate stack space and save caller registers on the stack.
stp     x20, x19, [sp, -0x20]!  // this updates the sp
stp     fp, lr, [sp, 0x10]
add     fp, sp, #0x10           // <- frame pointer list head.

// function body.

// restore saved registers, free stack space, return.
ldp     fp, lr, [sp, 0x10]
ldp     x20, x19, [sp], 0x20    // this updates the sp
ret

Or like this (leaf function):

// no stack setup.

// function body (does not call other functions).

// return.
ret

That is, most functions follow conventional patterns, so one strategy could be to just analyze the machine code and look for one of those patterns. Of course, if a function uses a different pattern, that won't work. So let's use a slightly more advanced strategy: abstract interpretation.

We start at the entry point of the function, and we'll track a symbolic value for each register. At the start, each register will hold its initial value. For example, sp = sp_0. (the _0 suffix means "initial value")

The first instruction, stp x20, x19, [sp, -0x20]!, decrements the stack pointer by 0x20, and then stores the registers x20, x19 at that address. So the only register that changes is the stack pointer, and we record sp = sp_0 - 0x20.

For the purposes of unwinding, we also care about where the initial values of registers are stored. And because x20 and x19 weren't modified yet, we record loc[x20_0] = mem[sp_0 - 0x20] and loc[x19_0] = mem[sp_0 - 0x18]

Now we just repeat that until we reach the function exit. And that is actually more or less what LLDB does, when it doesn't have debug info.

There's just one problem: control flow.

	cbz x0, 1f       // if x0==0, jump to label 1
	add x0, x1, #2   // x0 = x1 + 2
	b 2f             // jump to label 2.
1:
	mov x0, x13      // x0 = x13
2:

Here we either assign x0 = x1 + 2, or we assign x0 = x13. So what should the value of x0 be after the branch?

The compiler nerds are screaming "ฯ†!" right now, but we don't really need that here. In my analysis, I just set x0 = ? after the branch.

LLDB's analysis works slightly differently. iiuc, its answer is x0 = x1 + 2. No, it's not a particularly advanced analysis.

But the control flow problem actually gets even worse when there are loops:

1:
	cmp x4, 7         // compare x4 to 7
	b.hs 3f           // if x4 >= 7, jump to label 3
2:
	add x5, x5, #1    // increment x5
	ldr x4, [x7, x5]  // something else
	b 1b              // jump back to label 1 (continue looping)
3:

Here's what the control flow graph for that snippet looks like: control flow graph

Let's see what the abstract value of the register x5 is, as we explore this control flow graph.

Initially, upon entry to label 1, let's say x5 holds its initial value: x5 = x5_0.

This value is propagated to the successors 2 and 3.

Then we explore label 2. That increments x5, so now we have x5 = x5_0 + 1 at the end of label 2.

This value is propagated to the successor 1. But uh oh! We already have a value here: x5 = x5_0, and that's different, so now we have to collapse x5 = ?.

But we already told label 3 that x5 = x5_0... So we now have to go and tell it "my bad, we actually don't know what x5 holds." and label 3 would then have to propagate that to its successors, and so on.

So we actually have ourselves a nice little dataflow analysis here! (I would put a link to learn more about data flow analyses here, but I assume the wikipedia article won't be particularly helpful. So just google it on youtube or something)

Though "little" doesn't describe my implementation particularly well. Version 1 was around 2 kloc, and it got so complicated that I needed to build a custom debugger for it:

dataflow debugger

Present Day

We've almost made it through the first 1.5 years! There are still quite a few things we haven't talked about, like the challenges of recording and replaying native code, accelerating everything with a custom jit, and the crazy ways I now cope with the 2000-4000 aarch64 instructions. But I've probably already lost > 90% of readers, so let's bring this thing in to land.

A few weeks ago, I built another prototype for the 2025 wheel reinvention jam: asm-trace. It was an attempt at building an interface like the "call tree thing" I made for wasm way at the beginning. You can check that out (or one of the many other things I've linked) if you want more right now.

Currently I'm rewriting the record/replay system for about the 3rd time. Then the plan is to implement more solid versions of those machine code analyses I described above (function & unwind inference). And after that, the focus will be to get an mvp out the door. At least, that's the current plan. My plans tend to change ๐ŸŒš

To be notified about future blog posts, you join the discord.

Alright, that's it!

Read More →