-
Notifications
You must be signed in to change notification settings - Fork 7
REPL
Sourceror now supports a REPL, and is integrated with cadet-frontend.
The REPL allows the user to run additional code, making use of existing variables in the initial code. Consider the following sample execution:
Initial code
const val = 2;
function call(f, x) {
return f(x + val);
}
Initial code output
undefined
REPL #1
const y = val + 5;
REPL #1 output
7
REPL #2
call(x => x * x, 10) + y;
REPL #2 output
151
The above example illustrates the following hard requirements of a REPL:
- There may be an unlimited number of REPL invocations
- Variables (and functions) from earlier code (i.e. the initial code and earlier REPL invocations) may be used in each REPL invocation, and their values are "remembered" from previous invocations
- New functions (including arrow expressions) may be defined in the REPL, and they can be called by earlier code
There is another soft requirement for REPLs: Earlier code should not be re-run by a REPL invocation, because that would lead to observable slowness if the earlier code takes a noticeable amount of time. This is a soft requirement because it's not strictly necessary (since it's possible to simply delete the repeated portion of the output before showing it to the user), but is bad for usability because it would get slower as more REPL invocations are made.
WebAssembly does not have native REPL or dynamic linking support. Sharing linear memory between WebAssembly instances is not yet standardized, but even if it is standardized, we would still not be able to pass function pointers between modules (which is required for new functions to be callable by earlier code).
Observe that the state of a WebAssembly instance is limited only to the content of its linear memory and its global variables. This means that if we have a "lightly modified" version of the WebAssembly code from the previous instance, we can load the linear memory and global variables from the previous instance into the new instance before running it, and it would probably be able to understand the content of the linear memory and global variables, and hopefully continue running from there.
(Note: Local variables (on the stack) need not be injected into the new instance because it is not possible to access them from subsequent REPL invocations.)
On a high level, we append the new REPL code to the old program and recompile the combined program in its entirety, but we omit top-level code that were from the old program. Since the top-level is the only code that is actually invoked when running the program, omitting the top-level code from the old program means that we don't re-run what we have run previously.
As an optimisation, Sourceror archives bits of the input program as it goes through the various stages of compilation (specifically, the front-end parse state and the IR are archived), to make recompiling the combined program faster and to minimise possible incompatibility in the new code (see later section on compatibility).
Sourceror's C-like IR generates a new function from the top-level code of the input program, and tags this function as the "entry point" (i.e. the function that will be invoked from JavaScript to kick off execution). The generation of this function is modified to take into consideration only the new code from the REPL, so that we do not re-run the old program. All functions from the old program are copied into the new program, so that those functions can still be called by the new program.
Sourceror Driver handles the saving and restoration of the linear memory and global variables.
How can we be sure that the new program understands the linear memory and global variables from the old program?
We say that a new instance is compatible with the previous instance if the new instance is able to run with the old linear memory and global variables. So the question becomes "What additions to the old program preserve compatibility?"
Consider the content of the linear memory, as explained in Backend#content-of-linear-memory. There are three sections in the linear memory: the stack, heap, and static data (string pool and error locations):
- The stack has a fixed size, and is empty at the end of each invocation, so it doesn't matter.
- The heap would contain various objects, including live ones (i.e. those referenced (possibly transitively) from a global variable). Objects in the heap cannot be moved around arbitrarily, since there may be references to them from global variables or other objects. The ability to move objects around depends on the GC implementation, and we would prefer not to mess around with that.
- The static data refers to data in the linear memory that will not change during the course of the program. Currently, this contains two things: the string pool (containing all string constants) and the error locations (20-byte blocks containing information for possible runtime type errors). Every failable typecasting operation (including indirect function calls) carry a pointer to one of these blocks, and the data from the block will be passed to the host (JavaScript) error handler when an error occurs. String constants cannot be moved because globals and objects may hold a reference to them.
Since the heap and string constants cannot be moved, it follows that we cannot add any data to the static data region of the linear memory.
Another consideration is the WebAssembly table (containing the list of indirect functions that may be called). Since the index of indirect functions may be stored in the heap or globals (for example when storing a Source "function" object), we cannot move existing functions around in the table. However, we can add new functions to the end of the table, and we make use of this.
In summary, if we do not modify the static data (that is generated at compilation time) and do not modify existing table entries, the new program will be compatible with the old one.
To avoid modifying the table, we ensure the the backend generates table entries in order of encounter, and it processes functions in order. This ensures that new table entries are placed after the old ones.
Hence, REPL code that do not have string constants and do not have failable typecasts will work automatically.
For now, we sidestep this problem by placing one dummy error location block into the static data initially, and add a pointer to this block every time we need to compile an error location into code from the REPL. This means that there is no helpful error location for errors arising from the REPL code. Note that this is not transitive - if REPL code calls some initial code and the failure occurs inside the initial code, then the correct error location will still be displayed.
Since we cannot add to static memory from the REPL, we construct string constants on the heap as they are encountered at runtime. The series of instructions to write each character to the heap memory is encoded directly into the program. This means that:
- There is no string interning done for string constants from the REPL
- String constants make a heap allocation
- String constants take linear time to create, and cost linear space of instruction memory
Import resolution behaves in a somewhat peculiar way in Sourceror due to overload resolution and the way the dependency graph is built. The table might also get reordered depending on how imports are ordered. In theory it should be possible to make imports work from the REPL, but it may be difficult to do so. For now, we disable imports from the REPL (an error will be shown to the user).
Some GCs (notably Cheney) make use of indirect functions, and generate a few indirect functions per struct type. These functions need to be stored in the WebAssembly table since they are indirect. Since environments synthesise new structs, the number of indirect functions in Cheney may grow after being recompiled, which would throw off the indices of all subsequent functions in the table. To work around this, we simply pre-declare a large number of entries (currently set to 4096) initially, and leave the unused ones empty.
A side-effect of this implementation is that the initial code is "different" from the REPL code, and each REPL invocation must have some associated initial code. This means that users cannot invoke the REPL without first invoking the initial code (even if empty). It's certainly possible to compile and run an empty initial code first if the user wants to run REPL code without previously running the initial code, but this has not been implemented.