Original article.
In this blog post I'm going to dig deeper into the compiler architecture an internals of the current version. This article can be helpful for people interested in programming languages and compiler internals but also for potential BL compiler contributors.
Build pipeline
Compilation process is consist of various compilation stages organized in one big pipeline. Every stage basically do some job and pass the result to next stage. Whole process is driven by
builder. The
builder creates
assembly and then
unit for every compiled file. Every unit is then passed to compilation process.
We can use simple hello world program as an example:
| main :: fn () s32 {
print("Hello world!\n");
return 0;
}
|
1. File loader stage
This one is very first stage which loads passed source file into current compiled
unit. There is nothing special about this stage.
2. Lexer stage
The lexer scans every single character in source code and creates tokens. You can use
-lex-dump parameter to see the lexer output. Every token is recognized pattern in source code with some meaning (number, identifier, operator, text...). Another important job of the lexer is store information about token position in source code file, this information can be eventually used later when compiler needs to report some error and give a hit to programmer what is wrong and where.
Output of lexer:
| Tokens:
3: ['identifier' 3:1], [':' 3:6], [':' 3:7], ['fn' 3:9], ['(' 3:12], [')' 3:13],
['identifier' 3:15], ['{' 3:19],
4: ['identifier' 4:5], ['(' 4:10], ['string' 4:12], [')' 4:27], [';' 4:28],
5: ['return' 5:5], ['number' 5:12], [';' 5:13],
6: ['}' 6:1],
7: ['end' 7:1],
|
3.Parser stage
In this stage we take created sequence of tokens and organize them into the
AST. Use
-ast-dump parameter to see output of the parser. The parser defines language syntax and can produce various set of errors when processed source is incorrect. AST code representation can be passed to the next stage only when everything was parsed correctly.
Output of parser:
| UBlock <IMPLICIT> 0x7fc824814a50 test.bl
DeclEntity <3:1> 0x7fc824814ae0 'main' 'immutable'
ExprLitFn <3:9> 0x7fc824814b20
Block <3:19> 0x7fc824814c40
ExprCall <4:10> 0x7fc824814d20
ExprRef <4:5> 0x7fc824814cd0 'print'
ExprLitString <4:12> 0x7fc824814d60 Hello world!
StmtReturn <5:5> 0x7fc824814db0
ExprLitInt <5:12> 0x7fc824814df0 0
|
4. MIR generation stage
This stage takes care of MIR generation from AST. The
MIR(Middle Intermediate Representation) is an internal bytecode language similar to
LLVM IR. MIR code is generated
here, main goal of this stage is to make output code more flat and remove all syntax sugar. Basically every single language feature is split into quite small MIR instruction set. One important thing about MIR language is data type processing, we don't have any relations between types and values yet (this is done in analyze stage later) but we can use types in the same way as values here.
Let's say we are going to declare variable:
Then
my_variable is some declaration of type named
s32, we know that
s32 should be a type, but we don't have any other information in this stage. MIR generator produce simple sequence of instructions for this declaration:
| /* syntax: %<id> <return type> <instruction name> [args] */
...
%16 <unknown> declref s32 /* find declaration named 's32' */
%17 void decl my_variable : %16 /* declare new symbol 'my_variable' of type %16*/
...
|
Now we have skeleton of this declaration, in next stage we can execute instruction
declref to get reference to
s32 and create new variable.
MIR code can be exported into the file by
-emit-mir compiler parameter, our non-analyzed hello world looks like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43 | @main : <unknown> : {
/* This is entry bacic block.*/
%entry_7:
/* Declaration of temporary for return value. */
%9 void decl .ret.0 : <unknown>
/* Hello world text as constant string. */
%12 string const {13, "Hello world!"}
/* We want to find 'print' function.*/
%13 <unknown> declref print
/* Call the print. */
%14 <unknown> call %13(%12)
/*
* This is bolerplate generated by 'return 0;'.
* We store constant 0 into the ret.0 temporary and continue to 'exit_8' block.
*/
%15 s32 const 0
%16 <unknown> declref %9
%17 void store %15 -> %16
%18 void br %exit_8
%exit_8:
%10 <unknown> declref %9
%11 void ret %10
}
/*
* This is type resolver function produced for 'fn () s32' type of the 'main' function.
* It's never going to be exported to the final executable, but it's executed in next
* stage. Return type of this function is Type and value is actual function type of
* the 'main'. It looks like an over-engineering to have separate function for this, but
* keep in mind that function types can be more complex...
*/
@.type : fn() type : {
%entry_2:
%3 <unknown> declref s32
%4 type const fn () %3
%5 void ret %4
}
|
5. Analyze stage
This is the most important and most complicated stage of the compiler. Main goal of analyze stage is resolve all data types, variables and functions and connect everything together... Analyzer operates on MIR code from previous stage. It iterates over instructions one by one and call
analyze_instr function. There are more possible results of this function call defined in following enum:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17 | typedef enum {
/* Analyze pass failed. */
ANALYZE_FAILED = 0,
/* Analyze pass passed. */
ANALYZE_PASSED = 1,
/* Analyze pass cannot be done because some of sub-parts has not been
analyzed yet and probably needs to be executed during analyze pass. In
such case we push analyzed instruction at the end of analyze queue. */
ANALYZE_POSTPONE = 2,
/* In this case AnalyzeResult will contain hash of desired symbol which be satisfied later,
instruction is pushed into waiting table. */
ANALYZE_WAITING = 3,
} AnalyzeState;
|
All analyzed instructions are organized into analyze queue and analyzer choose what to do based on analyze result. When result is
ANALYZE_PASSED we can simply remove instruction from queue, but when analyze finishes with
ANALYZE_POSTPONE for example, we put current instruction at the end of the queue. With the last state
ANALYZE_WAITING analyzer must get also hash of the desired symbol which is needed to complete analyze pass of current instruction. In such case we don't put current instruction back to the queue but store it into hash table of waiting queues assigned to the missing symbol. Later when missing symbol appears as completed every waiting instruction is going to be analyzed again.
Our hello world after analyze stage passed looks like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16 | /* analyzed */
@main : fn() s32 : {
%entry_7:
%9 void decl .ret.0 : s32
%8882 ...Any vargs Any {}
%14 s32 call @print({13, "Hello world!"}, %8882)
%16 *s32 declref %9 /* direct */
%17 void store 0 -> %16
%18 void br %exit_8
%exit_8:
%10 *s32 declref %9 /* direct */
%8883 s32 load %10
%11 void ret %8883
} /* comptime */
|
As you can see all
<unknown> types are replaced by actual type. There are also bunch of automatically generated instructions like
vargs and
load, those are generated during analyze pass. Function
print needs format string and
vargs as arguments so the input was slightly modified. Also
load instruction was generated for
ret.
Another interesting thing is our "hello world" string representation here. In previous stage this string was represented as separate
string const {13, "Hello world!"} instruction. Analyzer took this instruction and mark it as
comptime. Comptime instructions can be directly executed and replaced by execution result constant during analyze since they are directly known and has no side-effects in runtime.
6. VM execution stage
This stage is optional and invoked only when we are going to execute
main in compile-time. To do so use
-run compiler flag. When MIR bytecode is fully analyzed it can be executed by BL interpreter in this stage even before we produce any native binary or LLVM IR. Calls to external libraries are invoked via
dyncall library.
The interpreter use preallocated stack to store all variables and temporary data passed between instructions. For stack data manipulation we use
PUSH and
POP operations. Execution of our hello world program is not so trivial since we use
print function but here you can see begin of
main function execution:
| Executing 'main' in compile time...
- Terminal PUSH RA
- PUSH (4B, 0x1083d1b18) s32
- PUSH (16B, 0x1083d1b28) ...Any
8882 InstrVArgs PUSH (16B, 0x1083d1b48) ...Any
14 InstrCall PUSH RA
14 InstrCall PUSH (4B, 0x1083d1b88) s32
14 InstrCall PUSH (16B, 0x1083d1b98) ...Any
14 InstrCall PUSH (16B, 0x1083d1bb8) string
...
|
7. LLVM IR generation stage
This stage produce
LLVM IR from MIR code. Our MIR code is designed to be simply translated into LLVM's one, so this stage only iterates over all instructions and generate LLVM equivalents via C API. We're not going to describe LLVM syntax here, but as you can see in example code, the LLVM's IR looks quite similar as MIR. We have
main function, two blocks,
store,
load, etc.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33 | define i32 @main() {
entry:
%0 = alloca i32, align 4
%1 = alloca { i64, %Any* }, align 8
%2 = getelementptr inbounds { i64, %Any* }, { i64, %Any* }* %1, i32 0, i32 0
store i64 0, i64* %2
%3 = getelementptr inbounds { i64, %Any* }, { i64, %Any* }* %1, i32 0, i32 1
store %Any* null, %Any** %3
%4 = alloca %string
%5 = load { i64, %Any* }, { i64, %Any* }* %1
store %string
{ i64 13, i8* getelementptr inbounds ([14 x i8], [14 x i8]* @.str.16, i32 0, i32 0) }, %string* %4
%6 = bitcast %string* %4 to { i64, i64 }*
%7 = getelementptr inbounds { i64, i64 }, { i64, i64 }* %6, i32 0, i32 0
%8 = load i64, i64* %7
%9 = getelementptr inbounds { i64, i64 }, { i64, i64 }* %6, i32 0, i32 1
%10 = alloca { i64, %Any* }
%11 = load i64, i64* %9
store { i64, %Any* } %5, { i64, %Any* }* %10
%12 = bitcast { i64, %Any* }* %10 to { i64, i64 }*
%13 = getelementptr inbounds { i64, i64 }, { i64, i64 }* %12, i32 0, i32 0
%14 = load i64, i64* %13
%15 = getelementptr inbounds { i64, i64 }, { i64, i64 }* %12, i32 0, i32 1
%16 = load i64, i64* %15
%17 = call i32 @print(i64 %8, i64 %11, i64 %14, i64 %16)
store i32 0, i32* %0, align 4
br label %exit
exit: ; preds = %entry
%18 = load i32, i32* %0, align 4
ret i32 %18
}
|
8. Write object file stage
In this stage we take generated LLVM IR and produce object file with use of LLVM C API. This file can be later passed to native linker.
9. Liker stage
This is the last stage, here we link everything together. Platform specific linker is called with our generated object file and all its dependencies to produce executable.
Conclusion
Compilers are hard, but when we divide complex problems into smaller ones it's not so hard as it seems to be. However it's still lot of programming.
Useful links
-
MIR documentation
-
Compiler source