Aaron Gutierrez & Shyam Raghavan
15-411 Final Project
Fall 2015
As current 15-122 TAs, we felt as though an interesting and difficult project would be to retarget the L4 compiler to a minimal adaptation of Clac. Through the implementation of this retargeted compiler, we hoped to learn the following: how to retarget a register-based language to a stack-based virtual machine language; how to minimize the instruction set needed for a normally very large register-based instruction set to a very small stack-based instruction set; and how to develop conventions for a very small language that mimic behaviors of a very large language.
In order to learn more about Clac and the process of compiling to a stack-based language, we researched information about Forth and the development philosophy behind Forth. This helped us determine what modifications were available to make Clac more targetable.
We also researched methods for considering the optimizations of stack-based virtual machines and for register allocation on the stack. While we decided that we would not focus on optimizations and would instead ensure correctness, we did implement a simple register allocator to better utilize the operand stack.
Retargeting the register-based x86-64 assembly language into a stack-based virtual machine language allowed us to delve into the specifics of the implementation of compilers that target stack-based virtual machines (like the Java Virtual Machine). One of the traditional advantages of a stack-based virtual machine language is that the instruction set (and therefore the number of bytes needed to represent each instruction) is very small - a disadvantage (that we clearly rediscovered) is that because the instructions are less specific, more instructions (and the total number of bytes needed to represent them) are needed to achieve the same functionality. Finally, we also needed to develop conventions on function calling, function flow, and memory - because we were retargeting from x86-64 assembly, most of the conventions we developed were inspired by it.
The development of this retargeted compiler, while not immediately useful, helped us learn skills necessary for engineering software for a rapidly-moving, ever-changing target, required us to be very careful in the engineering and the maintenance of conventions for a language and virtual machine, and forced us to consider tradeoffs between performance and correctness with the short amount of time and resources we had to complete the compiler.
For this project, we chose to retarget C0 code to a minimal adaptation of the interpreted language based off of Clac used in 15-122 Principles of Imperative Computation called Clac.
In order to accomplish this, we must minimally alter the Clac runtime to allow for heap-like memory access and allocation. We must also alter the compiler to produce a stack-based internal representation and a back-end Clac-based representation. In sum, this means augmenting the internal representation and the back-end of the compiler.
We aimed to keep the clac language as close to the original as possible. The majority of the work in this project is derived from the difficulty in adapting an internal representation designed to target x86-64 assembly to output clac. While we could have improved performance and optimizations, we chose to instead focus on correctness and developed with the explicit belief that correctness was paramount over performance given our time constraints and the resources available.
The Clac language needed to be extended in order to successfully compile all (or at least a significant portion) of the features needed by the L4 language. We started with a working implementation of the Claculator from the fall 2015 version of the 15-122 assignment, which has user-defined tokens. We were able to implement almost all of the control flow needed using just user-defined tokens, with one exception, noted below.
Our first modification adds a mutable, global array of values to complement the operand stack. To use the memory array, the following instructions were added:
n get
will take the value stored at index n and place it at the top of the operand
stack v n put
will place the value v at index n
in the memory array n alloc
allocates a contiguous block of memory for
n values and places the index of the
first location on the top of the operand stackThese instructions were designed to be in the spirit of the original Clac
language. The get
and put
tokens are similar in behavior to the !
and
@
words, respectively, in the Forth language upon
which Clac is based. The alloc
token serves an
analogous purpose to the same keyword in L4.
Additionally, the remaining comparison operators,
<=, >=, =, !=
were added. These behave as
expected. Bitwise operators were added to support bitwise arithmetic. Bitwise
and (&
), or (|
), and
exclusive or (^
), along with arithmetic shifts
(<<
and >>
),
behave as they do in C0 and L4.
A couple other modifications were made to support programs that do not return
a value. The abort
token causes the Claculator to
raise a SIGABRT
signal, used for assertions.
c0_mem_error
raises a
SIGUSR2
signal, used to indicate memory errors in the
source program, rather than the Claculator.
With the above extensions, we are able to compile almost all programs in the
L4 language, with one large exception. Clac’s only control flow tokens are
if
and skip
. In order to
handle programs that contain loops, the loop guard and loop body become separate
functions. As a result, without further modification, we could not compile
programs that return inside the body of a loop.
One of the features of the calculator that we aimed to preserve is the simplicity of the interpreter loop. Though the introduction of user-defined tokens and the call stack complicates things slightly, for the most part it consists of a queue where each token is considered exactly once. If we were to add backwards skips, the interpreter would become significantly more involved. So in order to support returning from the body of a loop, we made hack that aims to leave the interpreter loop simple.
To overcome that shortcoming we introduced a special
return
token. For most functions,
return
causes the rest of the instruction queue to be
skipped, and the stack is left unmodified. But for functions whose names start
with ‘.’ return
causes the current function
and the calling function to return. If the calling function’s name also
starts with ‘.’, it’s calling function also returns recursively. We are
therefore allowed to use functions like basic blocks and keep the interpreter
loop relatively simple.
In order to better match the behavior of a C0 program, the Claculator was modified to not print a prompt or the operand stack when reading from a file, and to return the top value on the stack, if it exists, when exiting.
Because Clac programs are quite significantly different than x86-64 programs, we had to add a completely new back end to our compiler. We are able to use the same intermediate tree representation for both the x86-64 and the Clac back end. From that point, we add two new steps to generate Clac code.
Our first translation targets a high-level stack-based form that contains variables, branches, and unconditional jumps. This serves mostly as a translation from infix to reverse polish form. An example program in this form can be seen in Figure 1.
From this high-level language, then translate to Clac. Because Clac doesn’t specify how the memory array should be used, we emulate the x86_64 stack discipline. All function arguments are passed on the stack, and temporary variables are assigned from the highest-index down. Our Clac implementation creates a 50,000 element memory array, so we start from index 49,999. To simulate stack frames, we use memory address 0 as a pseudo-stack pointer. For example, the nth local variable for a function is access by:
0 get n + get
The value at address 0 is decremented the appropriate amount at the start of
each function, and increment before every instance of
return
.
Conditional branches like
… exp if stms else stms …
are expanded to
… exp if .1 1 skip .2 …;
: .1 stms ;
: .2 stms ;
Loops such as
… .1 exp if stms goto .1 else …
become
… .1 …;
: .1 exp if .2 1 skip .3 ;
: .2 stms .1 ;
: .3 ;
Translation of arithmetic and stack manipulation expressions are mostly trivial. The final output of our sample program can be seen in Figure 2.
: _c0_foo 0 get 3 - 0 put 0 get 0 + put 2 0 get 0 + get * 0 get 1 + put 0 get 0 + get 0 < 1 = if .6 1 skip .7 0 get 1 + get 0 get 3 + 0 put return ;
: .6 0 0 get 1 + put ;
: .7 ;
: _c0_main 0 get 0 - 0 put 5 _c0_foo 0 get 0 + 0 put return ;
49999 0 put _c0_main
At the end of our Clac source file, we see our minimal runtime, which sets
the stack pointer to the largest index, and calls the function
_c0_main
.
Memory operations were very straight forward to implement. Unlike when
implementing L4 for x86-64, pointers are the same size as integers, because
pointers are just indexes into the memory array. Dereferencing a pointer p is simply
p get get
, and computing array and struct offsets are
greatly simplified by the fact that all small types have size 1.
In order to support array bound checking, the size of the array needs to be
stored in memory. We employ the same strategy as before, storing the size at the
index before the first array element. Bounds checks and null checks are
implemented exactly as before, adding an if statement before all accesses and
conditionally evaluating the new c0_mem_error
token in
Clac if the access is invalid. NULL
is represented by
the value 0.
In the most straightforward translation, the operand stack is rarely employed. To increase utilization of the stack, we implemented Koopman’s algorithm for register allocation on stack machines. The algorithm as implemented is described below.
Compute the set of temps that is used at some point after each line
within basic blocks. This analysis is greatly simplified as the only Clac
instruction that uses a value is get
.
For each temp that is used within a block, when it is first written to, additionally push a copy of the value onto the top of the operand stack.
For subsequent reads, replace the get
instruction with a pick
instruction that moves the
corresponding value to the top of the stack
For simple programs, our register allocator works correctly. However, our
implementation is not correct for all inputs, and time did not permit more
effort to correct our implementation. As such, register allocation is disabled
by default, but can be enabled with the –clac-alloc
flag.
Because of the nature of the stack, values are not always in the same location relative to the top. As such, care must be taken to keep track of where values are located.
0 get 3 - 0 put
15 0 get 0 + put
411 0 get 1 + put
0 get 0 + get
1000
*
0 get 1 + get
+
0 get 3 + 0 put
return
0 get 3 - 0 put
15 1 pick 0 get 0 + put
411 1 pick 0 get 1 + put
2 pick
1000
*
2 pick
+
0 get 3 + 0 put
return
In Figure 3, we see a sample program with and without register allocation.
Notice that we still place values in the memory array even though they are no
longer used. An implementation that recognizes dead
put
instructions is outside the scope of our project,
but would be interesting to study further.
To test our implementation, we reused most of the tests from Lab 5. Any test that used library functions were deleted, as we did not port the full runtime to Clac, but all other features of L4 were implemented.
To run tests against our implementation, we modified the existing testing
framework to pass the necessary flag for our compiler to target Clac. To run the
code and produce output, it then calls our modified Claculator with the compiled
Clac code. The Claculator is linked with a library that raises the correct
signals to test for memory errors, SIGABRT
is
generated using the C0 assert
function, and otherwise
the value returned from the main method is return by the main method of the
Claculator. Thus, we should see exactly the same effects from running our Clac
code using the Claculator and the same code compiled natively for x86-64. As
such, the testing framework took very little modification.
For speed, the Claculator is compiled with the unsafe C0 runtime using the
-O2
flag for gcc. Even so, execution time is
significantly higher than running the native executable. For instance, the gcd
benchmark runs around 500 times slower than our native compiler output with all
optimizations disabled. We could have chosen to increase the timeouts for
testing, but chose not to so that the grading time was still sane.
We explicitly made the decision to not focus on optimizations or performance - this decision was made with the reasoning that (a) the virtual machine implementation of Clac has no way of being possibly as fast as the hardware instruction set given by x86-64 assembly, precisely because the virtual machine implementation of Clac is compiled into x86-64 assembly, (b) stack-based machines are empirically slower than register-based machines because of the larger number of instructions, and (c) with the limited amount of time and resources we had available, we felt as though we needed to focus on either correctness or performance, and chose correctness to be paramount.
While we did make this decision, it is important to note that, as mentioned
earlier, the gcd
benchmark runs about 500 times slower
on the Clac virtual machine than on the native compiler output with all
optimizations disabled. In addition, running on a system with two quad-core
Intel X5560 CPUs running at 2.8Ghz and the default timeouts in the autograder,
208 tests timed out, as opposed to the regression tests, in which 72 tests timed
out.
Again, because this was not the focus of this project, we chose to disregard
performance issues and focus on improving correctness throughout our time
working on the compiler. In the future, to improve the performance of our
Clac-targeted compiler, we could improve the stack-frame addressing system -
rather than executing 5 instructions (0 get i + get
)
to get the ith stack-frame variable,
we could either alter the instruction set to include a stack pointer or improve
the ability of the compiler to decide what needs to be stored on the operand
stack and/or on the stack frame. We could also reduce the amount of dead code
created by the compiler - while function arguments are evaluated from left to
right in C0, the way we were able to implement this was by evaluating them from
left to right, dropping them off the operand stack, and then evaluating them
from right to left in preparation of calling the function (because the calling
convention developed expected the arguments in left-to-right order). In order to
reduce this dead code, we could either alter the calling convention to expect
the arguments in right-to-left order (and then, instead of dropping arguments
upon left-to-right evaluation, we leave it on the operand stack) or alter the
specification that arguments are evaluated in left-to-right order.
Regardless of the improvements we can make to the Clac runtime or the compiled code, we know that we will never be able to improve the performance of the Clac virtual machine to be faster than that of x86-64 assembly. Because the only test cases we fail are those that particularly stress the compiler or the runtime, we feel as though the performance is robust enough for demonstration to curious 15-122 students that wonder why they’re asked to add function definitions to a language like Clac.
On that point, with the single addition of the memory array, a large number
of C0 programs can be successfully compiled to clac. Provided that all functions
only return at the end, the addition of return
to clac
is avoided. With under 20 simple lines of additional code, a 15-122 student
could have a way to run non-trivial programs in their own clac implementation
for testing and exploration purposes.
In terms of qualitative analysis, it appears that Clac code is in fact (although precedent tells us that this is not the case) more concise than x86-64 assembly. This is probably because of the discrepancy between Clac being a virtual machine language and x86-64 assembly being a hardware instruction set.
The hardest areas to implement involved the design and fleshing out of the
conventions having to do with control flow and calling functions. The
difficulties presented themselves because there were no existing conventions,
and many possibilities either allowed for ambiguities or did not ascribe to C0
conventions. For example, while some solutions for things like
if-else
control flow did ascribe to C0 semantics, they
left room for ambiguities in the manner in which code that returned from a
branch was supposed to execute. This particular problem was resolved with the
introduction of a return
token, but decisions like
these forced us to carefully develop a set of guidelines and rules to follow
when designing and writing the compiler. While making these convention decisions
were not easy, the decisions that were made allowed the runtime for Clac to be
(a) minimally altered and (b) very easy to invoke.
In conclusion, while the retargeting of a C0 compiler to Clac virtual machine code is not the most practically useful, we’ve learned quite a few skills that will have very practical applications both in academia and in the software industry.
By completing this compiler, we were able to develop our own calling convention and control flow convention. We also developed a code generation step from an intermediate tree representation to a stack-based intermediate representation. All of these tasks required us to be very deliberate about documentation, to consider tradeoffs between performance and correctness with a limited amount of resources, and develop a piece of software for a rapidly-changing target (like from x86-64 assembly to Clac virtual machine code).
Finally, it has been an amazing experience and a really fun one building the largest single piece of code we’ve seen from start to finish. We’ve enjoyed the class immensely, and we’d like to thank Rob and the course staff for making it such an enjoyable experience.