Setting Up
This tutorial assumes you're running on a modern 64-bit UNIX operating system, with access to libc and a linker.
The assembler we will use is called nasm
. On ArchLinux, you can download it with the command sudo pacman -S nasm
.
Once you have downloaded it, you can check it is installed correctly by running the command nasm --version
.
Hello, world!
Let's write a program to test our setup. Don't worry how it works; I'll explain later.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | [bits 64] [section .text] [global main] [extern printf] main: xor rax,rax mov rdi,hello_world_string mov rbx,printf call rbx xor rax,rax ret hello_world_string: db "Hello, world!", 10, 0 |
And here's our build script:
1 2 3 4 5 6 7 8 | # Replace assembly.s with the name of your source file. nasm -felf64 -Fdwarf -o assembly.o assembly.s # Replace gcc with the name of the linker you want to use. Make sure to link the program against libc. gcc -o assembly assembly.o # Run the executable. ./assembly |
If you run that, you should see the following printed out:
Hello, world!
Concepts
As we'll be learning assembly, there are some basic concepts about the functioning of processors that we need to learn.
Instructions
Every program you compile in a high level language ultimately gets converted to a list of instructions. A processor instruction is the simplest command a programmer can instruct the processor to perform. They, and their operands (parameters) are encoded in your program's text segment. This is a read-only part of your executable that is loaded in memory when your program is started. The processor will then start enumerating instructions from the beginning of the program and performing each one, serially (or, at least it'll appear this way) until it encounters an instruction that instructs it (sometimes based on condition) to branch to a different part of the program. In x64, each instruction can only perform a simple task. This could be incrementing a value in memory, comparing the value of two registers (you'll soon learn what they are), branching to another part of the program, etc.
Registers
A register is simply a small, storage location on the processor. Registers are similar to read-writable memory locations except for a few key characteristics.
- They are accessed far quicker than main memory, because they are actually stored inside the processor.
- There are far fewer registers than bytes of memory.
- Some registers have special functions or meaning to the processor. This means some registers cannot be directly modified or viewed by the executing program.
Example
Here's a short program to demonstrate what we've learnt.
1 2 3 | mov rax,10 mov rbx,20 add rax,rbx |
This assembly program adds together 10 and 20. Each line represents an instruction, with the mnemonic at the start of each line representing which instruction it is, and the following text containing information about the operands (parameters). rax
and rbx
are general-purpose registers, meaning they can be used to store any values the programmer wants without affecting the execution of the processor.
Let's suppose this program gets loaded into memory at address 0x100000.
This is a hex dump of that location in memory, with each instruction enclosed by []
.
You may notice that bytes 0x100003 and 0x10000A contain the numeric values 10 and 20: where do you think they came from?
|--------------------------------------------------------------------------- | | -0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -a -b -c -d -e -f | |--------------------------------------------------------------------------| |0x10000- |[48 c7 c0 0a 00 00 00][48 c7 c3 14 00 00 00][48 01 | |0x10001- | d8] | |---------------------------------------------------------------------------
Since the processor needs to keep track of which instruction it's about to execute - you guessed it - it needs a register to store its address. On x64 this is called the RIP
register.
Now, let's take a look at how the value stored in each register changes as our program is executed.
Assembly Address RIP RAX RBX mov rax,10 0x100000 0x100000 10 uninitialised mov rbx,20 0x100007 0x100007 10 20 add rax,rbx 0x10000E 0x10000E 30 20
By this, you should be able to work out what the MOV
and ADD
instructions do:
mov <a>,<b>
: Stores the value of<b>
in<a>
.add <a>,<b>
: Adds the value of<b>
to<a>
.
And thus, a simple computation is performed by the processor.