Introduction to x86_64 programming
The steps in the tutorial have been tested on a 64-bit Arch Linux installation. This tutorial will contain many simplifications to make it easier to understand. This tutorial expects you to have a good understanding of C, and a basic understanding of process architecture.
A more gentle introduction can be found here: Tutorial/x64 Assembly
First, we need to install an assembler. The decision of which assembler is quite important as there are two main syntaxes of x86(_64) assembly.
For example, to move the value in register
rbx to register
mov rax, rbx
movq %rbx, %rax
I personally prefer Intel syntax, so that's what this tutorial will use.
Hence, the assembler we will use shall be NASM as it:
- Uses Intel syntax
- Free (BSD license)
- Has powerful preprocessing capabilities
To install it on Arch Linux, type
pacman -S nasm
We will also need gcc and binutils to do linking for us, so type
pacman -S gcc binutils
To test the assembler, first make the following file called
1 2 3 4 5 6 7 8 9 10 11 12 13 14
bits 64 section .text global main extern puts main: mov rdi,string mov rax,puts call rax xor rax,rax ret string: db "hello",0
You can then assemble, link and run with:
1 2 3 4
nasm -felf64 hello.s -o hello.o gcc hello.o -o hello chmod +x hello ./hello
And you should see something like:
[[email protected] ~]$ ./hello hello
How it works
Let's go through the hello world program line by line to see what's happening.
This is a directive to the assembler to tell it that our program is 64-bits, i.e. uses x8664. Therefore it will output x8664 machine code when it assembles our program.
A program is split into different sections where different data can be stored. The
.text section is used to store machine code. We'll also use it to store the hello world string.
Assembly has functions just like any C program! This directive tells the assembler to make the
main function we define global. This means when are program is run, the OS will be able to find where to start.
To print a message to the screen, we need the C standard library function
puts. This directive tell the assembler that we won't define the function
puts in our program - but rather it needs to get it from a library.
This is a label. It assigns a name ("main") to an address in memory. In this case, it will be where the machine code following the label will be found in memory when the program is run. This means it can act like a function - we have named the starting address of a block of machine code.
This is the first instruction in our program. All previous lines have contained directives - things that tell the assembler what to do. But an assembly instruction will get directly translated to a machine code instruction.
We're about to call the
puts function to output a string. But the
puts function takes an argument - a string. Arguments are passed via registers in x86_64. The first argument to a function is passed in the
rdi register. Therefore we need to put the address of the string into the
mov instruction will move a value from one place to another, e.g. from memory to a register. In this case we move a constant value (the address of the string) into a register (
rdi). Note that the destination comes first, followed by the source.
string are the operands to the instruction, and are separated by a comma.
mov is the instruction itself. It's sometimes called a mnemonic, as it's short for the full name of the instruction ("move").
string is another label. We define it later, just before we put the bytes of the string into the program.
Here we get the address of the function
puts and store it in the register
rax. We'll use this address to call the function.
This calls the function at the address in
rax - i.e. the
puts function. It will take its arguments from the
rdi register, where we placed the address of the string.
The return value of a function is placed in the
rax register before it exits. This instruction will calculate set the value of
rax to 0, by xoring it with itself.
This instruction will return us from the main function.
This label comes before the string we want to print out.
db tells the assembler to output bytes directly. So it will output the bytes of the string literal, and then the zero byte to terminate the string.
Generated machine code
The machine code that is generated from our assembly file looks like this:
48 BF 00 00 00 00 00 00 00 00 mov rdi,0 48 B8 00 00 00 00 00 00 00 00 mov rax,0 FF D0 call rax 48 31 C0 xor rax,rax C3 ret 68 65 6C 6C 6F 00 "hello"
Wait a minute! What happened to the first two instructions? Why are the values set to 0?
This is because the executable may be placed anywhere in memory when it is run. The operating system will update these instructions to reflect where the addresses actually are in memory at runtime.
See x86 Registers for more information.