In this project, you will build the code generation part of a BASIC-to-ASM compiler.
There is no lint
test this week. The point of the lint
test last week was to teach you about undefined behavior, but we don’t feel like it’s as valuable this week. You’re welcome.
TeenyBASIC (Our Source Language)
The BASIC programming language was invented in 1964 as an attempt to make programming more wide-spread across fields besides STEM. It is quite primitive and we’ve made it even simpler for this assignment. Nonetheless, the language you will compile (called “TeenyBASIC”) is a full programming language capable of complex computations.
Syntax and Semantics
- The language has number literals which can represent the range of
int64_t
’s. - The language only has the four major arithmetic operations as built-ins (
+
,-
,*
,/
). - The only variables are single capital letters (
A
,B
, …,Z
) and they are all of typeint64_t
. - The only comparisons are equality (
=
), greater than (>
), and less than (<
). - The language is composed of the following types of statements:
-
Print the value of
expression
tostdout
:PRINT expression
-
Store the value calculated by evaluating
expression
in the variablevariable
:LET variable = expression
-
Execute
statement1
, thenstatement2
, …:statement1 statement2 statement3
-
If
condition
is true, then executetrue_statements
:IF condition true_statements END IF
You can optionally add statements to run if
condition
is false:IF condition true_statements ELSE false_statements END IF
-
Execute
statements
untilcondition
becomes false:WHILE condition statements END WHILE
Example Program
To give you an idea of what the language looks like, the following TeenyBASIC program prints out the first 1000 prime numbers.
LET I = 0
LET P = 2
WHILE I < 1000
LET A = 1
LET T = 2
WHILE T * T < P + 1
IF P / T * T = P
LET A = 0
LET T = P
END IF
LET T = T + 1
END WHILE
IF A = 1
PRINT P
LET I = I + 1
END IF
LET P = P + 1
END WHILE
The Compilation Process
There are three types of files involved in compiling TeenyBASIC into an executable.
First, the programmer writes TeenyBASIC source code (.bas
files in the progs
directory).
Then, the parser (which we have written) converts the source code into a tree of nodes (called an “abstract syntax tree”) which is consumed by your compiler.
Your compiler will output x86-64 assembly code (.s
files in the out
directory).
Finally, the LLVM assembler will convert your assembly files into binaries (in the bin
directory).
You can visualize this process as the graph below:
graph LR;
bas["BASIC Code (.bas)"]
AST[Abstract Syntax Tree]
asm["ASM Code (.s)"]
bin[Executable]
bas--Parser-->AST
AST--Compiler-->asm
asm--LLVM assembler-->bin
Writing the Compiler (50 points)
Reading the Existing Code and Development Strategy
In previous courses, it may have been possible for you to get away without reading and understanding the provided code.
This is VERY not true for this course.
You absolutely need to understand how the AST works and what the struct
s look like.
We have provided an example function at the bottom of ast.c
which prints out an entire AST.
We strongly recommend that you read and understand this code as well as ast.h
before beginning to code yourself.
Your code will likely mirror the recursive structure of the print code; so, it will be extremely valuable to understand how it interacts with the nodes.
Most of the code you will be writing will be in compile.c
.
If you need to edit another file, we will tell you so in that particular stage.
Stage 1: Printing Numbers
We have provided a function called print_int
(defined in runtime/print_int.s
) which prints a TeenyBASIC value.
In this stage, you will have to compile the NUM
, PRINT
, and SEQUENCE
node types (read the provided code to see what we mean by these).
We recommend you start out simply (just to get a program that runs) which means you will likely eventually have to re-write these clauses.
The easiest thing to do is stick the number directly in a register (in the NUM
clause) and call print_int
in the PRINT
clause.
The SEQUENCE
clause is needed to handle multiple print statements, but you can implement it after the single-print tests are passing.
Make sure you use the right register.
Make sure you prefix the constant with a dollar sign.
After you’ve completed this section, running make compile1
should compile and run several test programs.
Make sure the tests output PASSED
before moving on.
A useful debugging strategy is to check the .s
files your compiler produces (in the out
directory).
Stage 2: Adding Numbers
In this stage, you will need to handle the '+'
operator of the BINARY_OP
clause.
For now, return false
for all other binary operators; you’ll fill them in in the next stage.
To implement addition, you will need to choose some invariants for how your compiler outputs code.
In particular, you can store computed values either in a register (we’d recommend %rdi
) or on the stack.
You will, however, need to store intermediary results on the stack (use push
and pop
) because of potential further left recursion.
At a high-level, the addition clause involves five steps:
- Compute the left side of the addition, recursively
- Store the result on the stack
- Compute the right side of the addition, recursively
- Store the results on the stack or in a register
- Use the left and right results as arguments to the
addq
instruction to compute the final result
After you’ve completed this section, running make compile2
should compile and run several test programs.
Make sure the tests output PASSED
before moving on.
Stage 3: Multiplication and Subtraction of Numbers
This stage should be very similar to the previous one.
Multiplication and subtraction are nearly identical to addition with the exception of the actual instruction used.
(You should use imulq
and subq
.)
As usual, run make compile3
to make sure your code passes all the tests.
Stage 4: Division of Numbers
Division is slightly more complicated than the other arithmetic operations, though it does folllow the same generic pattern.
You will want to read documentation for idiv
to make sure you understand how the division instruction works.
Don’t forget to run make compile4
to make sure your code passes all the tests.
Stage 5: Reading and Writing Variables
This stage is conceptually the hardest one in the assignment, because we must work with the stack.
Real compilers attempt to minimize the usage of the stack by choosing registers for each variable (this is called “register allocation”), but we won’t worry about this yet.
Instead, we will allocate space on the stack for all 26 possible variables (this is called “spilling”) at the beginning of basic_main
by setting up a stack frame.
Then, we will assume the variables are in alphabetical order on the stack and load and store from those memory locations as necessary.
In compiler.c
, you may have noticed header()
and footer()
functions which are called at the beginning and end of the program.
You will have to modify these functions during this stage for setup and teardown of a function.
In particular, you will need to add lines to modify and restore %rsp
and %rbp
–taking special care to allocate enough bytes on the stack for 26 64-bit integer variables.
After you write code to allocate and deallocate the stack frame, you should be able to fill in the VAR
and LET
clauses.
After you’re finished, don’t forget to run make compile5
to make sure your code passes all the tests.
Stage 6: Conditionals
Now, you will implement IF
statements.
An IF
statement in TeenyBASIC looks like IF [expr1] [op] [expr2] [statement_true] ELSE [statement_false] END IF
where [statement_true]
should be executed when [expr1] [op] [expr2]
evaluates to true, and [statement_false]
should be executed otherwise.
ELSE [statement_false]
can be omitted, in which case nothing should happen if [expr1] [op] [expr2]
evaluates to false.
[op]
will always be one of equality (=
), less than (<
), or greater than (>
).
You will likely find yourself needing a distinct label or two for each IF
statement.
You may use a global variable as a counter to accomplish this.
We recommend prefixing your labels with IF
.
After you’re finished, don’t forget to run make compile6
to make sure your code passes all the tests.
Stage 7: Loops
Finally, you will implement WHILE
loops.
A WHILE
statement looks like WHILE [expr1] [op] [expr2] [statements] END WHILE
, where [statements]
should be executed while [expr1] [op] [expr2]
evaluates to true.
Like for IF
statements, you can use labels and jumps to translate WHILE
loops.
We recommend prefixing your labels with WHILE
.
The main difference is that you will need to jump backwards to run the loop again.
After you’re finished, don’t forget to run make compile7
to make sure your code passes all the tests.
Optimizing the Generated Assembly (30 points)
There are a ton of optimizations that one can make to the code we’re generating. To get a taste, we’d like you to implement the following two optimizations. The tests for this part will check the speed-up compared to the reference solution on tests that are particularly amenable to these optimizations.
Strength Reduction
As discussed in class, strength reduction replaces multiplications by powers-of-two with much faster shifts. Specifically,
we’d like you to check if a node is of the form [expr] * k
where k
is a power of 2 and replace the multiplication emitted by your compiler with a left-shift.
Run make opt1
to test your optimized assembly’s speed improvement over the reference solution.
Constant Folding
Sometimes, the AST (or part of it) is easier to just evaluate at compile time. In particular, if the operations we’re compiling are all constant arithmetic operations, we can just evaluate the entire sub-tree into a number. For this task, recursively find arithmetic operations that consist of only constants and evaluate them into a constant. Then, instead of emitting code to calculate the constant, emit the constant directly.
Run make opt2
to test your optimized assembly’s speed improvement over the reference solution.
Extra Credit: Register Allocation (+15 points)
Above, we recommended using the stack to store variables and intermediate values computed by expressions. However, reading and writing values on the stack requires memory accesses, which are much slower than accessing registers. Because registers are so fast, real-world compilers store as many values in registers as possible. Algorithms for assigning values to registers can get quite complicated, but you can implement a basic version of register allocation for extra credit.
You should store both variables and intermediate values in registers while there are registers available.
Additional values should be “spilled” onto the stack (either to a pre-allocated stackframe or using push
to extend the bottom of the stack).
Make sure to consider the distinction between caller-save and callee-save registers.
Which ones can store variables and which can store expression values?
Although it is possible to keep track of the stack height as you compile expressions, your compiler will probably need to know how many variables the program uses before it starts printing the assembly. You can “pre-traverse” the AST to find all the variables that are read/written.
Even this simple heurestic should give a 10% to 100% speed improvement on almost all the stage7
tests.
We haven’t included automated tests, so let Adam know if you have implemented this extra credit.