PA1: Assembler

As with PA0, you will develop and submit this assignment through Github. Here is the link to accept the assignment https://classroom.github.com/a/6hRQ_pnv.

Your job in this assignment is to implement an assembler for Grumpy assembly code. That is, you'll be implementing a program assem that reads assembly code for the Grumpy virtual machine (GrumpyVM) and outputs corresponding bytecode.

Assembly language program        Equivalent bytecode program
     |                                  |
     v          /-------------\         v        
<filename.s> => |    assem    | => <filename.o> 
                \-------------/                 
                     ^| This assignment

The assignment template is a workspace containing two crates:

a library crate called grumpy

a binary crate called assem.

The grumpy library crate is a collection of modules implementing the various components of the Grumpy compiler. For, now it has only three files:

lib.rs -- the crate root file
isa.rs -- Grumpy instruction set types
assemble.rs -- the Grumpy assembler.

To help familiarize yourself with the grumpy crate and its modules, you can run the following command to generate and open its documentation in a browser:

cargo doc --open

For now there are only two modules within grumpy: assemble and isa. In future assignments we'll be adding more.

The template files as provided are not complete. There are three main problems to solve:

Translating Grumpy assembly into an in-memory data structure
Laying out the parsed program in memory in order to resolve labels
Printing the resulting bytecode.

We discuss each in turn.

Parsing

The input to your assembler is a file <filename>.s containing text like:

setframe 0
push Lmain
call
halt
Lmain:
push 3
push 12
binary /
ret

This program returns the result of the operation 12 / 3.

Line by line, the program does the following:

Instruction	Effect
setframe 0	Set the frame pointer to 0
push Lmain	Push the address corresponding to label Lmain onto the stack
call	Call the address corresponding to Lmain
halt	Halt the machine
Lmain:	Declare label Lmain as the address of the following instruction
push 3	Push 3 onto the stack
push 12	Push 12 onto the stack
binary /	Pop 12, pop 3, push (12 / 3)
ret	Return to caller

Your first step is to write functions for reading such programs from a file, storing them in an internal data structure (described below).

The file grumpy/src/isa.rs contains the type of machine instructions that are natively supported by the Grumpy virtual machine:

pub enum Instr {
    /// Push(v): Push value v onto the stack.
    Push(Val),
    /// Pop a value from the stack, discarding it.
    Pop,
    ...

The pub in this type declaration makes Instr a type accessible in files that import the isa module.

If you look closely, you will see that labels don't appear anywhere in the native instruction set. This is because all jumps in GrumpyVM programs must be to statically known instruction locations (indices into the instruction cache). However, Grumpy assembly programs can contain labels, so we define a type of "pseudo-instructions" that augments native instructions with support for labels.

pub enum PInstr {
    /// Label the next instruction.
    PLabel(Label),
    /// Push a label onto the stack.
    PPush(Label),
    /// Native machine instruction.
    PI(Instr),
}

You must write parsing functions for reading Grumpy assembly programs from a file and storing them as arrays of PInstrs (i.e., Vec<PInstr>). The template contains starter code for implementing the FromStr trait for PInstr and related types, which is our recommended approach.

Here's a grammar that defines the set of programs your assembler should accept:

Unary Operations
u ::= neg

Binary Operations
b ::= + | * | - | / | < | ==

Values 
v ::= tt    // The unit value
    | n     // 32-bit two's complement signed integer
    | true  // Boolean literals
    | false 
    | undef // The undefined value

Labels
l ::= L[a-zA-Z0-9]+  // A label is the character 'L' followed by one or more alphanumeric characters
    | _L[a-zA-Z0-9]+ // or "_L" followed by one or more alphanumeric characters ("_L" is typically reserved 
                  // for fresh labels generated by a compiler producing GrumpyVM assembly).

Instructions
i ::= push v     // Push a value
    | push l     // Push a label
    | pop 
    | peek u32   // Where u32 stands for a 32-bit unsigned integer
    | unary u
    | binary b
    | swap
    | alloc
    | set
    | get
    | var u32
    | store u32
    | setframe u32
    | call
    | ret
    | branch
    | halt

Instructions or Labels
il ::= i         // An "il" is either an instruction 
     | l:        // or a label "l" followed by a colon ":", as in "Lmain:".

Grumpy Assembly Programs
p ::= il_1
      il_2 
      ... 
      il_N

Your assembler should be able to handle any assembly program that is valid according to this grammar. If you want additional examples of valid Grumpy assembly programs, take a look at the .s files in the tests directory.

Resolving Labels

A program such as

setframe 0
push Lmain
call
halt
Lmain:
...

contains labels like Lmain that stand in for the addresses of instructions in the program. To produce a program that can be executed on the GrumpyVM, the assembler (in grumpy/src/assemble.rs) needs to resolve the labels to these instruction addresses.

For the example program above, the labels could be resolved as:

Address	Instruction
0	setframe u32(0)
1	push loc(4)
2	call
3	halt
4	push i32(3)
5	push i32(12)
6	binary /
7	ret

Some uses of labels, like push Lmain, point to declarations (Lmain:) that appear only later in the program. One way to deal with this is to make two passes through the program, the first to resolve the addresses of labels, the second to replace the labels with their resolved addresses.

A Note on Instruction Addresses: When addressing instructions, your assembler should use addresses that correspond to the index of the instruction as it appears in the program. That is, the first instruction (above, setframe 0) is at address 0, the second (push loc(4)) is at address 1, and so on. In the bytecode representation of GrumpyVM programs (about which more below), different instructions are not necessarily the same size. Instruction addresses therefore abstract from the physical layout of GrumpyVM programs in bytecode; the GrumpyVM (PA2) will resolve this discrepancy when it decodes GrumpyVM bytecode files.

As a library crate, the grumpy crate is not executable by itself, but it provides the functionality of the Grumpy compiler for use by executable (binary) crates. The executable for this assignment is in the assem binary crate, where we perform all I/O and use functions and traits imported from the grumpy library crate to parse GrumpyVM assembly programs and assemble them to bytecode.

Printing the Bytecode

The output of your assembler should be a new binary file <filename>.o containing the bytecode for <filename>.s with all labels resolved. For the program above, this bytecode is:

000000 0000 0008 0b00 0000 0000 0400 0000 040c
000010 0f00 0100 0000 0300 0100 0000 0c04 030d
000020

as displayed in hexadecimal by the linux command od --endian big -A x -x <filename.o>. The first 6 characters of each line, e.g., 000000, are the addresses in hexadecimal of the beginning of the line. These addresses are output automatically by od; your assembler does not need to generate them.

How do you figure out the byte-level encodings of GrumpyVM's instructions? Your first point of reference should be the GrumpyVM documentation, which describes in detail the encodings of all the instructions (in particular, the section you should focus on is Instruction Bytecode Format). Stepping through the encodings to reverse engineer the .o output above, we get:

Bytes	Meaning
00 00 00 08	`N`, the number of instructions, equals `8`
0b 00 00 00 00	Instruction is `setframe` (`opcode = 11`), argument is `0`
00 04 00 00 00 04	Instruction is `push` (`opcode = 0`), argument is a location (prefix `04`), location is `00 00 00 04`
0c	Instruction is `call` (`opcode = 12`)
0f	Instruction is `halt` (`opcode = 15`)
00 01 00 00 00 03	Instruction is `push` (`opcode = 0`), argument is an i32 (prefix `01`), integer is `00 00 00 03`
00 01 00 00 00 0c	Instruction is `push` (`opcode = 0`), argument is an i32 (prefix `01`), integer is `00 00 00 0c` (`= 12`)
04 03	Instruction is `binary` (`opcode = 4`), argument is operation `/` (code `03`)
0d	Instruction is `ret` (`opcode = 13`)

Hints

GrumpyVM object files begin with a binary encoding of the unsigned 32-bit big-endian integer N, the number of instructions (not the number of bytes!) in the program. If you forget to output N, the GrumpyVM won't be able to execute your .o files.
Make sure to test your assembler by running the GrumpyVM on the object files you generate, for a variety of test programs. We will test your assembler against both student tests and against a test suite that we have designed.
Remember: all integers, etc., should be encoded in big-endian format. Take a look at the Rust documentation, e.g., for u32, to see how to convert integers to byte arrays in big-endian format.

Testing

One way to test your code is to run it on a variety of GrumpyVM assembly (.s) files, ensuring that for each program you output bytecode (a .o file) that's runnable by the GrumpyVM (and on which the VM produces the right result!).

To assist you, we've collected a number of sample programs in the tests/ directory. At a high level:

Files <file>.s are GrumpyVM assembly programs.
Files <file>.o are GrumpyVM bytecode files.

To minimally test your assembler, ensure that for each <file> in tests, the .o file your assembler produces when run on <file>.s is exactly the same as the reference <file>.o that we give you.

Be sure to test your program on the Ubuntu Prime machines before submitting to work out any bugs.

Specifics

Include at least one novel test in your code in the form of a GrumpyVM assembly program. This will be worth 10% of your grade.
Implement a program assem that reads Grumpy assembly programs from <filename>.s and outputs the resulting bytecodes to <filename>.o. You can assume that <filename>.s is the first argument to assem (the zeroth being your assembler's path). You may implement the assembler however you wish (although most everything except I/O and basic plumbing should be in the grumpy crate), but the starter code provided in grumpy/src/ suggests a reasonable strategy.
Make sure you update your Cargo.toml files (assem/Cargo.toml and grumpy/Cargo.toml). The authors section should be

authors = ["YOUR NAME <YOUR EMAIL>"]

On valid Grumpy assembly programs, assem should return exit code 0. On invalid programs, your assembler should exit with a nonzero exit code.
You must include at least 3 non-trivial unit tests in your code.
Submit your assignment through Github on or by the due date. Make sure you've pushed all your code to the Github repository (master branch). Please include your name and the word "submission" somewhere in the final commit message.

Submission

Submit the URL of your github repo to Blackboard, which should include your Github ID. If not, make sure to identify your Github ID in submission comments.

Helpful sites:

https://oucompilers.github.io/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.md

1.md

PA1: Assembler

Parsing

Resolving Labels

Printing the Bytecode

Hints

Testing

Specifics

Submission

Helpful sites:

Files

1.md

Latest commit

History

1.md

File metadata and controls

PA1: Assembler

Parsing

Resolving Labels

Printing the Bytecode

Hints

Testing

Specifics

Submission

Helpful sites: