A simple C compiler for learning, based on Engineering a Compiler (the 2nd Edition).
Precompiled eac-compiler
binary can be downloaded from the GitHub "Releases" page. As eac-compiler
uses MASM for x64
as its assembler and linker, you need to install the C++ workload from the Visual Studio. Also, you should make sure all environment variables are properly set by running vcvars64.bat
or using the x64 Native Tools Command Prompt
(both of them are shipped with the Visual Studio).
eac-compiler.zip
contains three files. Among them, antlr4-runtime.dll
is the runtime of the ANTLR parser generator. driver.asm
will call the main
function and print the return value of it in decimal to the standard output. driver.asm
will be automatically compiled and linked against our main assembly file, and it should be kept in the same directory as eac-compiler.exe
.
Suppose there is a simple C file named fib.c
that computes the n-th Fibonacci numbers:
int fib(int n) {
if (n <= 2) {
return n - 1;
} else {
return fib(n - 1) + fib(n - 2);
}
}
int main() {
return fib(10); // 34
}
To compile this file, simply run the following command.
> eac-compiler.exe fib.c
eac-compiler
will compile it into main.asm
and combine it with driver.asm
to produce main.exe
. Y.u can run main.exe
to see its output:
> main.exe
> 34
You can also make eac-compiler
to print intermediate results by passing command line flags --ast
, --ssa
, --cfg
, --vasm
, or --asm
, which will print the AST, the SSA IR, the control flow graph (after destructing all Phi functions), pseudo-x64 assembly using virtual registers, and x64 assembly using physical registers.
eac-compiler
's parser is generated by ANTLR and only supports a small subset of the C langauge, the formal grammar is given in the end of this document, which basically covers only arithmetics over the integer type. If you are interested in the implementation of the pareser and other other language features, there is another project Agile-C in which I built a constraint-based type inference C transpiler that covers most frequently-used C features (structs, arrays, pointers, and more) with a hand-written recursive descent parser.
There are two extra points that worth mentioning. First, the FFI code that binds the generated C++ parser to Rust make use of global variables, so when you are running tests, you should disable the multithreading test runner, i.e. always use the command cargo test -- --test-threads=1
instead of the plain cargo test
. Second, compound statments (curly braces), except the one used to delimit the function body, do not introduce new scopes, so we can construct continuous data flow across the whole function to make the data-flow analysis meaningful even for simple test programs.
eac-compiler
performs the data-flow analysis over the original AST and transforms it into the Static Single-Assignment Form (SSA form) IR. The construction and destruction processes follows Chapter 9.3 of the book.
eac-compiler
emits x64 assembly and follows the Windows x64 calling convention. There are good introduction materials for the x64 assembly from Intel and for the Windows x64 calling convention from Microsoft. Among them, the most important parts eac-compiler
uses is listed as below.
- The return value (if any) is located in
rax
. - Arguments are stored in
rcx
,rdx
,r8
,r9
, and then stack, in that order from left to right. There is 32 bytes "shadow space" allocated on the stack to store the first four arguments. rsp
is the stack pointer.rbp
is the stack frame pointer (callee-saved).rcx
,rdx
,r8:r11
are caller-saved general-purpose registers.rbx
,rsi
,rdi
,r12:r15
are callee-saved general-purpose registers.- It is caller's responsibility to clean the stack.
The instruction selection process of eac-compiler
follows Chapter 11.4 (Insturction Selection via Tree-Pattern Matching), but because currently each AST/SSA node is mapped into an unique sequence of assembly code, the process looks pretty like the naive treewalk scheme.
The register allocation process follows the bottom-up local register allocation in Chapter 13.3.2. Because eac-compiler
does not compute the live range information for variables, registers will not be actively freed but just spilt to the stack if all physical registers are occupied.
<non-digit> ::= "A" | "B" | "C" | "D" | "E" | "F" | "G"
| "H" | "I" | "J" | "K" | "L" | "M" | "N"
| "O" | "P" | "Q" | "R" | "S" | "T"
| "U" | "V" | "W" | "X" | "Y" | "Z"
| "a" | "b" | "c" | "d" | "e" | "f" | "g"
| "h" | "i" | "j" | "k" | "l" | "m" | "n"
| "o" | "p" | "q" | "r" | "s" | "t"
| "u" | "v" | "w" | "x" | "y" | "z"
| "_";
<digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9";
<identifier> ::= <non-digit> (<digit> | <non-digit>)*;
<number> ::= ["+" | "-"] <digit>+;
<primary-expression> ::= <identifier> | <number> | "(" <expression> ")";
<postfix-expression> ::= <primary-expression> | <postfix-expression> "(" <argument-list> ")";
<argument-list> ::= <expression> | <argument-list> "," <expression>;
<prefix-expression> ::= <postfix-expression> | "!" <postfix-expression> | "-" <postfix-expression>;
<multiplicative-expression> ::= <prefix-expression>
| <multiplicative-expression> "*" <prefix-expression>
| <multiplicative-expression> "/" <prefix-expression>;
<additive-expression> ::= <multiplicative-expression>
| <additive-expression> "+" <multiplicative-expression>
| <additive-expression> "-" <multiplicative-expression>;
<relational-expression> ::= <additive-expression>
| <relational-expression> "<" <additive-expression>
| <relational-expression> ">" <additive-expression>
| <relational-expression> "<=" <additive-expression>
| <relational-expression> ">=" <additive-expression>;
<equality-expression> ::= <relational-expression>
| <equality-expression> "==" <relational-expression>
| <equality-expression> "!=" <relational-expression>;
<logical-AND-expression> ::= <equality-expression>
| <logical-AND-expression> "&&" <equality-expression>;
<logical-OR-expression> ::= <logical-AND-expression>
| <logical-OR-expression> "||" <logical-AND-expression>;
<assignment-expression> ::= <identifier> "=" <logical-OR-expression>;
<expression> ::= <assignment-expression>;
<statement> ::= <declaration-statement>
| <compound-statement>
| <expression-statement>
| <selection-statement>
| <iteration-statement>
| <jump-statement>;
<declaration-statement> ::= "int" <identifier>;
<compound-statement> ::= "{" <statement>* "}";
<expression-statement> ::= <expression> ";";
<selection-statement> ::= "if" "(" <expression> ")" <statement> ["else" <statement>];
<iteration-statement> ::= "while" "(" <expression> ")" <statement>;
<jump-statement> ::= "return" [<expression>] ";";
<function> ::= "void" | "int" <identifier> "(" <parameter-list> ")" <compound-statement>;
<parameter-list> ::= ["int" <identifier>] | <parameter-list> "," ["int" <identifier>];
<program> ::= <function>*;