This article is about rewriting my AArch64 interpreter in raw assembly.
Introduction
As I've mentioned in the previous article, my time travel debugger is based on an interpreter for AArch64 native code. The primary motivation for this is to be able to reliably record the side effects of non-deterministic instructions and of system calls. There are other ways to do this besides using an interpreter, each with their own pros and cons. For the sake of this article though, we'll just assume that using an interpreter (accelerated with a JIT) is a good idea.
But I'll tell you what is not a good idea: Writing said interpreter in Rust. That's what I did originally, and it worked for a while, until it didn't.
Floating Point Troubles
The original idea behind writing the interpreter in a high level programming language was to be able to run it on other platforms like x86 for "cross-arch replay". But that idea didn't last long, as I soon had to implement several floating point instructions to be able to run some more complex programs.
You may think, well, what's the problem, just implement those FP instructions with f32 and f64 using their corresponding built-in operations. Rust's floating point types are defined to be IEEE 754 compliant, which is the floating point standard that modern processors implement, so using them should be perfectly portable, right? Nope. Unfortunately there are parts of the IEEE 754 standard that are "implementation defined". And if you read the Rust docs I linked carefully, you'll find that these parts are also "not guaranteed to be portable or even fully deterministic" by the Rust compiler. This is the same issue I've talked about with WebAssembly in the last post.
All that is to say: There are differences between the floating point implementations on x86 and AArch64. As a result, Rust's floating point types aren't guaranteed to be portable or deterministic. So to avoid all that trouble, I just ended up implementing the floating point operations with inline assembly. Which of course meant that my code was no longer portable.
Compile Time Troubles
Rust isn't exactly known to have great compile times. But I thought how bad could it be? Surely that only applies to code that's littered with traits, generics, and complex lifetimes. Turns out, that couldn't be further from the truth. Even completely braindead code, like that of an interpreter, is insanely slow to compile.
Here's what the implementation of a typical instruction looked like:
// ADD_32_addsub_imm let sh = (instr_word >> 22) & 1; let imm12 = (instr_word >> 10) & 4095; let Rn = (instr_word >> 5) & 31; let Rd = (instr_word >> 0) & 31; let op1 = x_or_sp![Rn] as u32; let op2 = if sh == 0 { imm12 } else { imm12 << 12 } as u32; let result = op1.wrapping_add(op2) as u64; x_or_sp![Rd, result];
Just a few bit operations, some logic, some macros. Nothing special. But my compile times were miserable! Every time I touched some code related to the interpreter, I got to twiddle my thumbs for 11 seconds.
At some point, I got so annoyed by it that I decided to profile the compiler to see what the heck it was doing.

9 seconds on trait selection??
Come on, really? Having read some rustc source code in the past, I had a pretty good idea about what must have happened there.
So I quickly tweaked the code:
// ADD_32_addsub_imm // nb: this part was generated by a script, // so it was very easy to change. let sh = <u32 as core::ops::BitAnd>::bitand(<u32 as core::ops::Shr>::shr(instr_word, 22), 1); let imm12 = <u32 as core::ops::BitAnd>::bitand(<u32 as core::ops::Shr>::shr(instr_word, 10), 4095); let Rn = <u32 as core::ops::BitAnd>::bitand(<u32 as core::ops::Shr>::shr(instr_word, 5), 31); let Rd = <u32 as core::ops::BitAnd>::bitand(<u32 as core::ops::Shr>::shr(instr_word, 0), 31); // this part is unchanged. let op1 = x_or_sp![Rn] as u32; let op2 = if sh == 0 { imm12 } else { imm12 << 12 } as u32; let result = op1.wrapping_add(op2) as u64; x_or_sp![Rd, result];
And voilร , my 11.5 second compile times dropped to 4.5 seconds.
For those who don't speak "Rust nonsense": Rust uses traits (aka interfaces aka protocols) for operator overloading. That crazy looking <u32 as Trait>::method syntax is a way to manually call a trait method without going through the trait resolver.
I understand that trait resolution is a complex issue. But is it really too much to ask that you hardcode the results for the operators of built-in types?
So my assumption from the opening paragraph of this section actually wasn't incorrect: Only code that's littered with traits, generics, and complex lifetimes is (super) slow to compile. But as it turns out, all Rust code is littered with traits and generics by default, and so all Rust code is super slow to compile - fun times!
Rant over.
(Addendum for the Rust nerds: I'm actually not sure why this change had such a drastic impact on compile times. The Shr and BitAnd traits are generic over the Rhs type: trait Shr<Rhs> {...}. So specifying the Self type using <u32 as Shr>::shr only tells the compiler half the information needed to resolve the trait. I assume this means some form of trait lookup still happens, but for some reason that's a lot faster?)
A New Approach
Even if we ignored the fact that I lost portability by using inline assembly and that my compile times were abysmal cause Rust. There were still two more issues: 1) I had implemented only 472 of the 1,726 instruction (25%). This meant that each time I tried running a new program, I first had to implement a batch of new instructions. And 2), the interpreter was really slow in debug builds. And release builds, well, I think you can guess what the problem with those wasโฆ
One thing I liked about the inline assembly implementations was how easy they were to write:
// FCVTPS_32S_float2int // (decode - this part is generated.) let Rn = (instr_word >> 5) & 31; let Rd = (instr_word >> 0) & 31; // read operands from registers. let op = v![Rn].to_u32(); // the asm. let result: u64; unsafe { asm! { "fcvtps {result:w}, {op:s}", op = in(vreg) op, result = out(reg) result, }} // write result to registers. x![Rd, result];
They followed a super simple pattern of reading the operands from the register file, executing the corresponding assembly instruction, and writing the results back to the registers.
I realized that if I knew which registers an instruction accessed, then I could actually generate the entire implementation for it automatically. And since that code would be dead simple, I could even generate it as assembly code directly, which would solve the compile time problem. And as a nice bonus, debug builds would have much better runtime performance as well, because the assembly code I would generate would be more or less what an optimizing compiler would spit out after thinking about it for a hot minute.
Thing is, while that sounds like an all upside deal, it does rest on this assumption that I know the effects of all instructions (or at least of ones needed to run 99% of programs). By "effects" I mean: which registers an instruction accesses, whether it accesses memory, and a few other AArch64 specific things.
Kind of like this:
{ "instr": "LDR_32_ldst_immpost", "additional": [ "check_sp_alignment" ], "regs": [ { "bank": "x", "reg": "Rt", "write": true }, { "bank": "x_or_sp", "reg": "Rn", "read": true, "write": true } ], "mem": { "access": { "kind": "regs", "is_load": true, "bank": "x", "sext": 0, "elem_bits": 32, "access_size": 4 }, "address_mode": "PostSImm9" } }
Unfortunately, I couldn't find a repository online that had all the necessary instruction metadata. But I had another idea: surely someone else had written an AArch64 interpreter before. And if they did, maybe I could "extract" the effects of each instruction from their implementation. So I asked Gippity to look for some repos, and sure enough, it found something: an open source AArch64 implementation in pure C - perfect!
Now I just had to analyze that C code somehow. This wasn't going to be the most straightforward thing, as some optimizations like inlining and constant folding were going to be necessary for some of the information I was interested in to emerge. So I compiled the C code to LLVM bit code and wrote a custom LLVM pipeline to reliably apply those optimizations as well as a few custom ones. Out the other end came an LLVM function for each instruction that I could then analyze with some more C++ code.
If it works, it ain't stupid.
Generating the Assembly
As expected, code generation was fairly straightforward. The current version of the Python script is just over 4 kloc long, but there isn't really anything interesting going on there (just lots of special case handling for instructions with immediates).
So instead of talking about how the code was generated, I'll just show you the interesting parts about the result. The following code snippet is a (very) abridged and annotated version of the interpreter assembly function. Even if you've never read assembly code before, I think the comments should help you to understand it. If not, you can always ask Gippity to explain any confusing parts to you.
.section __TEXT,__text
.global _interp_run:
_interp_run:
// stack setup.
// register setup.
// there are some dedicated registers for
// hot values like the program counter,
// the address of the register file,
// and the 32 bit instruction word.
L_fetch:
// all instructions are 4 bytes on aarch64.
ldr {REG_INSTR_WORD}, [{REG_PC}]
L_decode:
// imagine a decoder based on a
// tree of perfect hash tables here.
// i wrote this part in rust,
// compiled it in release mode,
// and copied over the disassembly.
L_exec:
// the decoder computes the linear id
// of the instruction and stores it in x8.
// we now use a jump table to branch to
// that instruction's implementation.
adrp x10, _interp_jumptab@PAGE
add x10, x10, _interp_jumptab@PAGEOFF
adr x9, L_exec_instrs_begin
// here we use the instruction id as
// the table index.
ldr w8, [x10, x8, lsl 2]
add x9, x9, x8
br x9
// the jump table contains `u32` offsets
// relative to this label.
L_exec_instrs_begin:
// the code generator dumps all the
// instruction implementations here.
// ...
// nb: this is the instruction from the
// Compile Time Troubles section.
L_exec_ADD_32_addsub_imm:
// load register value.
// nb: we'll talk about `ubfx` in the
// next section.
ubfx w8, {REG_INSTR_WORD}, 5, 5
ldr x1, [{REG_ARCH_X}, x8, lsl 3]
// decode immediate.
ubfx w8, {REG_INSTR_WORD}, 10, 12
ubfx w9, {REG_INSTR_WORD}, 22, 1
lsl w10, w8, 12
cmp w9, 0
csel w8, w8, w10, eq
// execute instruction.
add w0, w1, w8
// write back results.
ubfx w8, {REG_INSTR_WORD}, 0, 5
str x0, [{REG_ARCH_X}, x8, lsl 3]
b L_retire
// ...
L_retire:
add {REG_PC}, {REG_PC}, 4
b L_fetch
L_exit:
// pc writeback.
// stack teardown.
ret
.section __TEXT,__const
.align 2
.global _interp_jumptab
_interp_jumptab:
// instruction id 0 is for invalid encodings.
// this label exits the interpreter with an error.
.long (L_invalid_instruction - L_exec_instrs_begin)
// ...
.long (L_exec_ADD_32_addsub_imm - L_exec_instrs_begin)
// ...
Assembly is High Level
If you read the code snippet above, you may have noticed that the assembly implementation of the ADD_32_addsub_imm instruction isn't actually much longer than the Rust version. This is in part because the Rust code is very low level, but it's also because some things are actually easier to do in assembly.
Remember the let imm12 = (instr_word >> 10) & 4095; parts that blew up the Rust compile times when written using operator syntax? They are a fascinating case where the Rust code is actually lower level than the corresponding assembly implementation!
Here's a part of the encoding diagram of the ADD_32_addsub_imm instruction:

Bits 10-21 encode the 12 bit immediate value to be added to the register Rn. In Rust, we extract this value by shifting the instruction word down by 10 bits and then masking it with 4095, which is 0b111111111111 in binary (12 ones).
Usually, I prefer to write this mask as ((1 << 12) - 1) instead of 4095 to emphasize that I'm selecting 12 bits, but that's even more verbose, and you can probably imagine what writing it like that did to my compile times.
So how do we do this in assembly? One instruction: ubfx w8, {REG_INSTR_WORD}, 10, 12.
ubfx stands for "unsigned bit field extract". w8 is the destination register. {REG_INSTR_WORD} is the register to extract the bits from. And the 10, 12 part at the end specifies the index of the lowest bit and how many bits to extract.
Now of course, this is just a skill issue on Rust's side. We could imagine a syntax like instr_word.bits[10..22] or instr_word.bits[10..+12]. But there's absolutely no way this is getting added to Rust before the year 2035, if ever. So in the meantime, we can joke about assembly being higher level than Rust.
Bits are High Level
(ht Andrew Reece)
As you will most likely know, ifs and loops are implemented using conditional jumps in assembly. For example, we can cmp x8, 7 and then b.lo <somewhere>.
The cmp instruction compares the value of x8 to 7, which sets a bunch of status flags in the CPU. And the b.lo instruction (branch if unsigned lower) reads those flags to decide whether the condition holds, and if so, it jumps to the target instruction.
Determining whether a condition (cond) holds based on the status flags (cpsr) is defined as follows:
pub fn condition_holds(cond: u32, cpsr: u32) -> bool { let n = (cpsr >> 31) & 1; let z = (cpsr >> 30) & 1; let c = (cpsr >> 29) & 1; let v = (cpsr >> 28) & 1; let result = match cond >> 1 { 0b000 => z == 1, // eq or ne 0b001 => c == 1, // cs/hs or cc/lo 0b010 => n == 1, // mi or pl 0b011 => v == 1, // vs or vc 0b100 => c == 1 && z == 0, // hi or ls 0b101 => n == v, // ge or lt 0b110 => n == v && z == 0, // gt or le 0b111 => true, // al _ => unreachable!() }; let negate = cond & 1 == 1 && cond != 0b1111; return if negate { !result } else { result }; }
This Rust function is a literal translation of the AArch64 pseudocode, and unfortunately, LLVM couldn't make much sense of it. The code it generated wasn't horrible, but I remember always seeing this function when profiling the interpreter. So for the rewrite, I decided to optimize it myself.
And by "myself" I mean that I asked Gippity if it had any suggestions for implementing this function efficiently. Its first response was less than helpful, so I asked "are we sure there's no better way to do this? maybe with a LUT?"
What it came up with was truly impressive:
// w0: cond
// w1: cpsr
_interp_condition_holds:
adr x8, L_condition_table
ldrh w8, [x8, w0, uxtw 1]
lsr w1, w1, 28
lsrv w0, w8, w1
ret
L_condition_table:
.hword 0xf0f0 // eq
.hword 0x0f0f // ne
.hword 0xcccc // cs/hs
.hword 0x3333 // cc/lo
.hword 0xff00 // mi
.hword 0x00ff // pl
.hword 0xaaaa // vs
.hword 0x5555 // vc
.hword 0x0c0c // hi
.hword 0xf3f3 // ls
.hword 0xaa55 // ge
.hword 0x55aa // lt
.hword 0x0a05 // gt <--
.hword 0xf5fa // le
.hword 0xffff // al
.hword 0xffff // al
It was so impressive, in fact, that I believe something like this was probably in the training data, as its overall understanding of assembly is lacking, to say the least.
So what's going on in that code, how does this work?
The code uses a lookup table, L_condition_table, like I suggested. It has sixteen u16 entries. Each entry corresponds to one condition code. The first entry corresponds to the eq condition code, the second is ne, and so on. The first two assembly instructions load the LUT entry into w8 for the given condition code (w0).
But what do the LUT entries mean? Here's the cool thing: they are also lookup tables! This time, each bit in the u16 value is an entry. And this 16 bit table is "indexed" by the 4 status bits nzcv (which are the top 4 bits of the cpsr flag register).
Here's how the entry for gt is computed:
condition `gt` holds when: n == v && z == 0 bit | n z c v | n=v z=0 | && ----+---------+---------+----+ 0 | 0 0 0 0 | 1 1 | 1 1 | 0 0 0 1 | 0 1 | 0 2 | 0 0 1 0 | 1 1 | 1 3 | 0 0 1 1 | 0 1 | 0 4 | 0 1 0 0 | 1 0 | 0 5 | 0 1 0 1 | 0 0 | 0 6 | 0 1 1 0 | 1 0 | 0 7 | 0 1 1 1 | 0 0 | 0 8 | 1 0 0 0 | 0 1 | 0 9 | 1 0 0 1 | 1 1 | 1 10 | 1 0 1 0 | 0 1 | 0 11 | 1 0 1 1 | 1 1 | 1 12 | 1 1 0 0 | 0 0 | 0 13 | 1 1 0 1 | 1 0 | 0 14 | 1 1 1 0 | 0 0 | 0 15 | 1 1 1 1 | 1 0 | 0 concatenate the last row in reverse order: 0000_1010_0000_0101 convert to hexadecimal: 0a05
Pretty cool, right?
By the way, I didn't trust the AI to get the constants right (even though, surprisingly, it did), so I wrote a quick Python script to generate the table:
def gen_condition_table(): def row(holds): value = 0 for i in range(16): n = (i >> 3) & 1 z = (i >> 2) & 1 c = (i >> 1) & 1 v = (i >> 0) & 1 if holds(n, z, c, v): value |= 1 << i return f" .hword 0x{value:04x}" rows = [ row(lambda n, z, c, v: z == 1), row(lambda n, z, c, v: not (z == 1)), row(lambda n, z, c, v: c == 1), row(lambda n, z, c, v: not (c == 1)), row(lambda n, z, c, v: n == 1), row(lambda n, z, c, v: not (n == 1)), row(lambda n, z, c, v: v == 1), row(lambda n, z, c, v: not (v == 1)), row(lambda n, z, c, v: (c == 1 and z == 0)), row(lambda n, z, c, v: not (c == 1 and z == 0)), row(lambda n, z, c, v: n == v), row(lambda n, z, c, v: not (n == v)), row(lambda n, z, c, v: (n == v and z == 0)), row(lambda n, z, c, v: not (n == v and z == 0)), row(lambda n, z, c, v: True), row(lambda n, z, c, v: True), ] return "\n".join(rows)
Fuzzing

Believe it or not, the code I write tends to have bugs. Which is a problem, because I want my debugger to work reliably. But I also don't like writing tests. At least not the things that people usually mean when they say "tests" - unit tests and the like.
The issue with unit tests is, I wrote the code, so if I also write the tests with the same assumptions, I'm essentially doing double entry bookkeeping, which does catch some mistakes. But it can't expose fundamental flaws in my understanding.
Enter random testing. The idea behind random testing is to continuously throw randomly generated inputs at your program to make sure it doesn't crash. On its own, this isn't a particularly interesting invariant to test. But if you litter your code with asserts, it can start to become a very powerful signal.
So let's apply this idea to the interpreter. Unit tests are obviously a non-starter. Not only because I would have to write a lot of them, but also because, like I said, that doesn't actually test my understanding. Here's a better approach: What if we just randomly picked an instruction, then executed that instruction in the interpreter and on the native hardware, and finally asserted that the results were equal? This is known as differential testing or differential fuzzing. It's a very powerful technique, but it does rely on another implementation of the same algorithm being available, which, thankfully, is the case for my interpreter.
Now, while this idea of comparing the interpreter against the native hardware sounds quite simple, there are actually a few interesting challenges when you try to implement it.
I'm stuck, step interpreter
Let's say we consult our random number generator, and it spits out add x19, x4, x12. If we're going to run that instruction in the interpreter and on the hardware in order to compare their results, we'll need to make sure all the inputs are the same. In this case, the inputs are the registers x4 and x12.
But because we don't want to trust our understanding of this instruction, we'll just generate values for the entire register file: the general purpose registers x, the stack pointer sp, the flags cpsr, the simd registers v, and the simd flag registers fpsr and fpcr. That way, if the instruction accesses some unexpected register, we'll get repeatable differences between the interp and native.
For the interpreter, we can just fill this struct with random values and then tell it to run the instruction:
pub struct AArch64State { pub x: [u64; 31], pub sp: u64, pub cpsr: u32, pub v: [Vec128; 32], pub fpsr: u32, pub fpcr: u32, }
But for native, we actually need those values to be in the physical registers:
.global _run_native
_run_native:
// x0 holds a pointer to a context with
// all the register values.
// load the `x` registers,
// (nb: x0 and x1 are loaded later.)
ldp x2, x3, [x0, {CTX_OFF_ARCH_X} + 2*8]
...
ldp x28, x29, [x0, {CTX_OFF_ARCH_X} + 28*8]
ldr x30, [x0, {CTX_OFF_ARCH_X} + 30*8]
// the stack pointer,
ldr x1, [x0, {CTX_OFF_ARCH_SP}]
mov sp, x1
// the status flags,
ldr w1, [x0, {CTX_OFF_ARCH_CPSR}]
msr nzcv, x1
// the simd registers,
ldp q0, q1, [x0, {CTX_OFF_ARCH_V} + 0*16]
...
ldp q30, q31, [x0, {CTX_OFF_ARCH_V} + 30*16]
// and the simd flags.
ldr w1, [x0, {CTX_OFF_ARCH_FPSR}]
msr fpsr, x1
ldr w1, [x0, {CTX_OFF_ARCH_FPCR}]
msr fpcr, x1
// now that we're done with x0 and x1, load their values.
ldp x0, x1, [x0, {CTX_OFF_ARCH_X} + 2*8]
// finally, run the instruction.
add x19, x4, x12
And now we're stuckโฆ
In order to compare results, we need to write all the register values back to the context, but we no longer have a pointer to the context. We lost that when we loaded the values for x0 and x1 right before executing the instruction.
And what's worse, we also can no longer return back to the Rust code that called the run_native function. That's because we overwrote the x30 register, also known as the link register lr, which held the code address that we were supposed to return to.
Usually, in assembly functions we first store all the register values that we need to restore later (like the link register) on the stack before executing the rest of the function. But in this case, the stack pointer was also overwritten, because the instruction we're testing may also access that.
One idea we might have would be to just not overwrite the stack pointer. Whenever an instruction accesses the stack pointer, we just don't test that instruction. This could work, but it would be a bad idea, as the stack pointer register is encoded as x31 on AArch64. But depending on the instruction, x31 can also be the zero register xzr (which is always zero). So we'd be missing out on testing crucial interpreter functionality.
A better idea would be to use a register like x0, which isn't any different from the other x registers. Conveniently, x0 also holds the "context" where we're supposed to write the register values back to. So we could just reserve some more space there for the return address (and any other registers we need to save) and not run instructions that mutate x0.
In practice, here's what I do:
arch = random_state(); arch.x[0] = ctx_ptr; run_interp(); if arch.x[0] == ctx_ptr { run_native(); compare_results(); }
The idea behind checking whether x0 is still equal to ctx_ptr instead of using instruction metadata is just that it's simpler and arguably more robust. Though since the interpreter is based on the metadata, if the metadata is incorrect, and the native code actually mutates x0, that would result in undefined behavior. But the program would most likely crash, so it's fine.
Results
The primary goals of this rewrite were to improve compile times and improve instruction coverage, so let's see how we did.
For instruction coverage, the interpreter now supports 2,348/3,428 instructions (68%). Yes, the total number of instructions has increased. That's cause I "flattened" some instructions, but that doesn't really matter right now. What matters is that number went up! We went from 25% to 68% of instructions implemented.
Now, let's see what happened to the compile times:

With the old Rust based interpreter, there was a very large difference between debug and release builds. This gap is now much smaller. And the incremental release builds are definitely bearable now at around two seconds.
It's not really worth noting that the clean debug build times increased, because that's a very fixable problem. Most of that time is spent on a 24 kloc Rust file that contains the instruction metadata. It's just a giant const. I didn't think that was gonna be a problem, but it was, cause Rust. But I can relatively easily convert that file to a binary file and include_bytes! it, which would remove that overhead.
Let's have a look at the runtime performance:

This is the runtime performance of the replay system. The three benchmarks on the right are relatively short recordings, so the performance gaps are smaller there. For kibiold_life, we can see a 2x improvement, which is solid. But I don't really know what's going on with clang_onelua - I feel like the new interpreter should be doing better there. I'll have to investigate that at some point.
Quick sidenote: The replay system uses a software page table for memory. That's why the execution speeds (even with a JIT) are much lower there than you may expect. For the recorder, I've seen speeds up to 300 M/s with the new interp, which is still very slow, but it's a lot faster than before (I think it was around 120 M/s).
All in all, I'm very happy with the results! More instructions, shorter compile times, and faster execution speeds. What more could you want?
Oh I know what you could want: You could want to join the discord, where I post links to all blog posts in the Announcements channel, so you don't miss any ๐
Alright, that's it!







