-
Notifications
You must be signed in to change notification settings - Fork 7
Frontend
The frontend converts a validated abstract syntax tree (AST) in ESTree format to IR. The input AST is validated, in the sense that Source restriction checks are assumed to have already been made. (In practice, Sourceror Driver uses js-slang to parse the user program, so it is indeed validated already.)
In converting the ESTree to the IR format, the tree structure is largely preserved. The main things that are done in the frontend are:
- Variable names are converted to their locations (i.e. local/global variable indices)
- Some local variables are put into synthesised struct fields so that they can be referenced (a local variable needs to be referenced if it is captured by a function in a manner where modifications might be made after the capture point (in Java terms, this means that the local variable is not "effectively final")
- Function closures are explicitly created
- Top-level code is transformed as per usual top-level transformation rules
Source has five primitive types: undefined, number, boolean, string, function. If we count the "unassigned" value as a separate primitive (which is the case for the IR), then there are six primitive types.
Sourceror's IR introduces a way for creating arbitrary new record types, which are composed of other types. These record types are similar to Java objects — they are always placed in the heap (unless lifetime analysis moves them elsewhere) and held as references. They are pure data types — there is no OOP support of any kind, but individual fields of a record can appear in the IR whenever variable names are allowed to appear (for example, the equivalent of x.a = y.b;
is expressible in the IR). They are strongly typed — it is possible to have two different record types that contain the same list of constituent types, and it is possible to be recursive (e.g. we can define a type that is morally equivalent to L = Pair<T, L>
, which represents an infinite list). However it is debatable whether being strongly typed in this sense was a good decision. In Sourceror, such record types are called Structs.
The existence of Structs allows us to express local variables that need to be referenced in a closure — we simply synthesise a struct to place them inside. In practice, every scope gets its own struct type, and its fields contain the local variables that might be captured. (Currently all fields, whether captured or not, go into a struct; this is inefficient and should be changed in the future.)
Structs may also be useful in the future to implement pairs and arrays. There is a hope that an "extended" version of Source might be implemented — one that defines some syntax for creating records in the language. This would be enabled when compiling libraries, so that pairs (and perhaps some of the stuff in external libraries) can be implemented without any compiler magic. Source arrays may also be implemented using structs once Sourceror supports builtin raw arrays (a raw array is an unresizeable array, which is kind of like a Java array). (Note that Source arrays, like JavaScript arrays, need to be automatically growable.)
In general, the frontend is a depth-first traversal of the ESTree with a "Context" parameter. The context parameter contains the current map from names to locations, list of environments (i.e. structs that represents some scopes) visible to the current location, logging functionality, and a few other things.
The traversal is designed as methods on the context object (so we kind of do context.parse_function(estree_func) -> ir::Func
). This was somewhat of an experiment, on hindsight it might have been better to make the context be an explicit parameter like the design of the backend. The problem with the current design is that the context object "knows too much" — it is involved in almost everything.
When entering a scope, names and variables are added to the context; when leaving a scope, those names and variables are removed so the original context is restored.
The error handling is rather messy in this crate. Errors have a severity level that should be specified.
In general, we want to emit an error as soon as we know that there is an error — this allows us to provide the most specific information about the error. As a rule of thumb, if a function returns Option<T>
for some T
, it means that no error has been emitted yet (so the caller is free to decide if it is an error). However, if a function returns Result<T, Error>
for some T
(Error
is an alias for ()
), it means that the error (if any) has already been emitted (so the caller should not emit another error message (though it might emit a note or hint)).
Every scope (i.e. every block) gets its own struct to place local variables in. Currently, every local variable goes into a struct. However, it is expected that in the future, only captured variables that require references will be placed in a struct — other local variables simply exist on their own. This should lead to significant speed-up in execution.
We use a Source-like pseudocode to show the examples. In reality, they are encoded in the ESTree format and the IR format respectively.
Functions prefixed with double underscore (__
) in the examples below are for exposition only. In reality, there is a separate way of encoding them directly into the IR format.
Ignoring the synthesis of structs for simplicity, name resolution looks something like this:
ESTree | IR |
---|---|
const x = 2; const y = x + 5; const z = x * y; { const ans = z; return ans; } |
__locals(3) // register 3 locals local#0 = 2; local#1 = local#0 + 5; local#2 = local#0 * local#1; { __locals(1) local#3 = local#2; return local#3; } |
__locals(n)
is a magic function that declares n
locals that live until the end of the current scope.
Each variable name is converted to its location. The Context is used to maintain a map from name to location, adding/overwriting entries as we enter a scope, and removing/restoring entries as we leave the corresponding scope.
When we need to synthesise structs to place locals (because these locals might need to be referenced by a closure), we do something like this:
ESTree | IR |
---|---|
let x = 2; const f = (y) => { return x + y; } x = 3; f(10); // should return 13 |
__locals(1) // register 1 local // make new record with two fields on heaplocal#0 = __alloc_object(2); local#0.field#0 = 2; local#0.field#1 = __make_func(local#0, …) local#0.field#0 = 3; __call(local#0.field#1, 10); |
__make_func()
is a magic function that will return a function with its closure. So local#0
is put inside the closure of the function f
, so that we can refer to local#0.field#0
from within f
.
Since the heap is garbage collected, the object referenced by local#0
will live for as long as the value returned by __make_func()
lives.
When a function (e.g. (y) => { return x + y; }
) is parsed, a Function is created, and it is a pack containing a function pointer and a second pointer. Function is not like a struct — it is placed directly on the stack and does not have the reference semantics of structs. The "second pointer" is a type-erased value that will be used as the first parameter when making an indirect call to this Function. We say that it is type-erased because it is impossible to recover its type (unless the function held by the function pointer has some way of communicating the type back to the caller, or you happen to be using a GC implementation where you can interrogate the type of a value). In theory you can store anything that fits in an i32 in the second pointer (because it is treated as an opaque value until it is passed into the function being called), but currently we only store structs in the second pointer. Specifically, we synthesise a new struct type for every function definition, and this new struct type contains pointers to all environments referenceable from inside the function. (In the future, we will not store pointers that we will never use (see the "Synthesising structs" section).)
ESTree | IR |
---|---|
let x = 2; for (let i = 0; i < 10; i = i + 1) { const y = i; const f = (z) => { return x + y * z; } } |
The red boxes represent data that are stored on the stack. The orange boxes represent data that are stored on the heap. The white labels on top of each set of orange boxes represent the names of the structs. Names don't actually exist in the output binary. Whether each struct type has an "id" in the binary is an implementation detail of the heap manager (GC) and is not accessible to the IR.
In the future, it is hoped that optimisations will allow the IR to be transformed into something like this:
ESTree | Box diagram | Optimisation |
---|---|---|
let x = 2; for (let i = 0; i < 10; i = i + 1) { const y = i; const f = (z) => { return x + y * z; } } |
None | |
let x = 2; for (let i = 0; i < 10; i = i + 1) { const y = i; const f = (z) => { return x + y * z; } } |
y is never modified after the capture point so we can _copy_ the value of y into the closure struct instead of holding a reference to local#1 . We save an indirection when accessing y from the closure, and we unlock further optimisations.
|
|
let x = 2; for (let i = 0; i < 10; i = i + 1) { const y = i; const f = (z) => { return x + y * z; } } |
local#1 is never referenced by any closure so the local variables y and f can be placed on the stack itself. This saves an indirection, but more importantly it allows a WebAssembly implementation to store y and f in registers instead of in addressable memory, giving us a performance boost.
|
It is an open design question of whether such optimisations should be done in the frontend or in the IR (in a more general manner, because we would not know whether a struct is synthesised for a function closure).