-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheme for parallel/background vat execution #6447
Comments
Some issues raised in today's kernel meeting:
|
Messages pipelined to a promise are queued into the deciding vat, but that vat might resolve the promise (to an object on some other vat) before the delivery finally executes. Currently we detect this just before the moment of delivery and requeue the VDO onto the target-object owning vat's queue. We've talked about proactively yanking all such VDOs out of the deciding vat's queue at the moment of resolution instead. For this scheme, yanking it from the worker's queue is more difficult. One possibility is for the deciding vat to react to VDOs that target an unknown promise to just reflect them right back out into a |
vref/kref translation happens on the kernel side, independent of the time of delivery/syscall, which changes our synchronous execution model:
|
We actually have some other ones, like timer polls, which do seem like it relies on a synchronous result. I also see some devices like |
Yeah this approach would require changes to our syscall/device model.
Bundlecaps would need to be rewritten as read-only immutable "blobcaps", described above, instead of using The timer device is no longer intended for general access: only the timer vat should get a reference, and all userspace code should interact with the timer vat instead
These other devices are pretty specialized and would need new approaches.
I think I wrote the first version of |
Worker StateLet's look at what state each worker needs to record, and what features that state is supporting. Delivery ExecutionVats live inside orthogonally-persistent JavaScript runtimes. The primary state is contained within the object graph (the JavaScript "heap") of a running XS engine (although see about the "Vatstore" below). To process a delivery like --image: deliveries and in-RAM xsnap worker state If this program could remain running forever, we might not need anything else. But sooner or later the host computer will get rebooted, and/or there might be too many simultaneous workers and we need to shut some down to make room for more active/important ones. So we need a way to preserve the Writing a heap snapshot is expensive, so we don't do it after every delivery. Instead, we record a "transcript" of what happens during execution, allowing us to snapshot only once every few hundred deliveries. When we need to reload the worker, we start from the most recent snapshot, then re-execute the transcript to bring the worker back up-to-date, relying on the strictly deterministic execution model of our vats. Each transcript entry allows a unidirectional transition from one The transcript entries include both the delivery and a record of all the syscalls made by the worker during its execution, plus their results. Syscalls (especially The heap snapshot interval drives a tradeoff between normal-runtime performance (more frequent snapshots slows down delivery execution) and restart performance (more transcript entries to re-execute). --image: deliveries, RAM worker, transcripts, heap snapshots We want to do as little replay work as possible, so we always start from the most recent snapshot available. That means each time we write out (and commit) a new heap snapshot, we can delete the older transcript entries, as well as any older snapshots. "Sleeper Agent" replay-based upgradeWe plan to support two forms of vat upgrade. We have implemented the first, named "baggage-style", which uses the This is independent of retaining transcript entries for worker-reload/replay purposes. We do not retain syscalls or syscall results forever, because a future sleeper-agent -style replay can make different syscalls. revert()Sometimes, we need to roll back a vat to some recent state. There are two circumstances that might require this. The first is a failed baggage-style upgrade. We currently implement these upgrades with a The second would be if an arbitrary delivery causes some sort of invariant or metering failure, and we decide the best course of action is to pretend that delivery never happened. We don't have any way to express this right now without also terminating the vat: both metering failures and vats requesting self-termination ( If we support this, we'll need to retain heap snapshots and transcript entries for longer than before. We wouldn't keep them around forever (some new We don't anticipate needing to revert a vat back more than a single delivery: the execution model becomes much more difficult to reason about otherwise. We might choose to disallow --picture: transcripts, snapshots, retirement schedule VatstoreTo reduce RAM requirements for vat workers, we try to keep all the inactive high-cardinality state on disk. Vats use "virtual objects" via APIs like These live in a key/value backing store named the "vatstore", accessed with APIs Worker replay (e.g. when bringing a worker back online) does not reference the vatstore: it uses recorded Reverting the vat to a previous state requires that the vatstore also be reverted to a previous state. One approach would be a multi-version vatstore:
Another approach would be record an undo buffer (thanks to @Tartuffo for the insight):
The multi-version vatstore adds more overhead to queries and modifications, and would take some experimentation to see how much that affects performance. The undo/redo buffer might be more efficient if the buffer remains short (we retire deliveries quickly, giving up the ability to revert() very far), and may provide better support for worker-state snapshots (see below). --picture: vatstore, reverse deltas, retirement window Worker-state snapshotsThe final constraint comes from the need for Tendermint/Cosmos "state-sync" snapshots. These enable a brand new validator to download recent data from other validators, check hashes to ensure its correctness, then initialize their state from the download instead of replaying the entire (months/years-long) history of the chain. Chains which keep their entire state in the Cosmos-SDK IAVL tree will get this for free, because IAVL is a multi-version Merklized key-value DB. The root hash of every version is included as the block's Our chain holds considerable state outside the IAVL tree. To add our vat worker state into this system, we need several features:
(kernel state must be included too, but we'll defer that to later: there's only one kernel, so the problem should be easier to solve) First, we define the overall state to include only the deliveries that have been collected by the kernel within that particular block. The worker may have seen newer "head start" deliveries, which may or may not have been executed yet, but the state-sync snapshot will ignore these. The new validator's kernel will be responsible for re-submitting them when it starts the new worker. Now, to rebuild a worker at that state-sync point, we need:
Of these, old
The multi-version vatstore DB (with the The new validator's bringup work would be minimized if we could somehow perform an A third approach might be record a vatstore snapshot at the same time as we write out the XS snapshot. Then, to rebuild a later state, we perform a special kind of replay that draws from the deliveries, not the transcript entries, and uses the real vatstore DB. This would increase the periodic cost of snapshot writes (dumping the whole vatstore is even more expensive than writing an XS snapshot), but would be simpler, and wouldn't require forward deltas. The new validator would have to replay more expensive work (delivery execution, rather than merely applying DB deltas), but perhaps there would be less of it. Kernel-state snapshotsThe kernel state includes run-queues, c-lists, object/promise tables, and meta-information about each vat. It does not include vat heap snapshots, vatstore contents, completed deliveries, or transcript entries. The "head-start" deliveries (where the kernel has committed to the order of their delivery, and has provided the worker with a copy so it can get ahead, but has not yet collected the results) live in the kernel's queues (in particular the per-vat input queue), and will be (re-)submitted to workers in a brand new validator. We never need to revert the kernel state (we only commit on block boundaries, and our chain has finality, so we never unwind a block), making this task simpler than with vats. I think we can follow a similar approach: periodically write out the entire kernel DB into a single snapshot, write out a forward delta after every block, hash the most recent snapshot plus the hashes of the deltas to compute a "kernel state hash". We include both the kernel state hash and the vat worker hashes into a single swingset hash, and submit that to the cosmos-sdk side (i.e. write it into an IAVL slot) at the end of every block. The vat worker hashes will only change for vats that have been touched during a block: delivery execution does not change their state hash, but collecting results does. If/when someone asks for a state-sync snapshot, the kernel must spend some time (in the "background") pulling deltas from the DB and writing them to retrievable chunks. The previously-published hash can be used to validate the contents of these chunks. |
The multi-version DB is related to a concept of "soft deletes". I saw an article (and some discussion) that described some common practices for this, and one thing that jumped out at me was the idea of using a separate VIEW to hold the "current version" table, to reduces the chances of a mistake (if accessing code fails to pay attention to the |
The
The first 3 use the results from chain sends, however only |
@FUDCo and I were talking about how to get vatstore operations out of the transcript, and how the main reason they're in there is so that we can replay a vat from the snapshot point back up to the present without consulting the vatstore data (which is too new to be referenced by the That would look like:
We'd wind up with a bunch of outstanding state in the WAL file, which degrades performance when it gets too big, so we'd need to measure that and see how bad it gets. It would be limited by the computrons-before-heap-snapshot limit, which might be sufficient to keep the perf degradation low enough. A tricky part would be that all other state (especially whatever transcript parts need to be store) must go into a different DB, which we do commit after each delivery. And that opens up the question of hangovers when we crash between the commit points of the two DBs. But this might be much simpler than trying to build a versioned DB out of SQLite, either by using a schema with generations and deletion tombstones, or by accumulating a bunch of forward/reverse deltas. |
I love the idea. My main concern is ensuring consistency on replay and whether we want to abandon that. Right now we quickly detect XS or liveslots bugs because they result in vatstore anachrophobias. Without vatstore in transcript, the execution will still diverge but we'll only notice later when something more drastic happens (e.g. different promise resolution order). This may be mitigated by putting delivery metering under consensus check. I'm wondering however if we shouldn't keep a separate non-exported/local vatstore transcript which could be used to enforce consistency on replay. This could be optional and enabled only on nodes we run? It would also allow us to generate "full transcripts" so that we can replay in tools without implementing an actual vatstore. |
I gave this some thoughts, and along the need to replicate vatstore in vstorage, this actually makes things a bit more complicated than I hoped. Since I believe we should look at moving vatstore in its own DB separately / before parallel execution, I've written up the details in the existing issue for this: #6254 (comment) |
Notes on parallel execution of vatsThis is my take on doing this, largely written (at Brian's suggestion) without close study of the discussions above so as to avoid unconsciously prejudicing my thoughts and possibly missing problems or solutions. OverviewVats interact with the kernel at three points: dispatch, syscalls, and crank completion. Dispatch and crank completion are paired: (1) The kernel transmits a delivery to the vat. The delivery typically, though not necessarily, represents a message from another vat, but it can also represent requests or notifications from the kernel directly. All deliveries originate with entries on the kernel's run queue, which are transmitted to vats in FIFO order (modulo having multiple run queues with different priorities, which, while affecting the order in which messages get delivered, does not affect the fundamental logic of the kernel-vat interaction -- plus, of course, multiple prioritized kernel run queues are not yet actually implemented). (2) The vat executes, while the kernel simply waits (albeit processing syscalls, about which more shortly). (3) The vat transmits a completion to the kernel. The completion indicates the status of the dispatch (i.e., success or failure) along with a precis of resources (computrons) consumed. At any given time, either the kernel is executing or one particular vat is executing, even though the kernel and the vat (in fact, all the various vats) may each be in separate processes (or, potentially, even on separate machines). This costs us in two ways: First, only one thing at a time gets to execute, even though we have the resources to execute N things at a time. Second, each exchange of control from kernel to vat or from vat to kernel introduces significant, unavoidable latency (this latency applies to syscalls as well as to the dispatch/completion handshake). The opportunities for parallelism here should be obvious, both between the kernel and the vat(s) and among the various vats themselves. The highly synchronous nature of the current vat-kernel interaction exists mostly to simplify implementation, principally in service of preserving determinism in the order of execution of the work specified by the run queue and in the management of vat state that is held in the kernel's database. However, it should be possible to achieve the same ordering though cleverness rather than brute force, and thereby enable things to run considerably faster. The sequence of deliveries transmitted into a vat can be treated as a FIFO stream. The sequence of completions transmitted out of a vat can also be treated as a FIFO stream. As long as the FIFO ordering is maintained*, there is no reason in principle that vats cannot be made free running and no reason the streams themselves cannot be pipelined. The things in the current implementation that would prevent this from just working out of the box are few and enumerable: (1) syscalls, (2) platform failures (either of processes or of the communications channels between them), and (3) cross-vat event ordering. (* There is an additional possible complication in ensuring that the vat itself maintains the proper FIFO ordering without engaging in "causal lookahead"; this will be addressed in a separate section below.) SyscallsSyscalls represent the other important point of kernel-vat contact aside from delivery and completion. However, the syscalls that a vat makes do not, for the most part, actually require synchronous coordination, since they do not typically have a return value. They can, instead, be viewed simply as outputs from the vat that could, in principle, be bundled with the completion (though potentially taking advantage of pipelining opportunities, i.e., start transmitting the "completion" before it is actually complete). The exceptions to this are (a) the various vatstore requests, which currently require synchronous access to the kernel database, and (b) the 'invoke' syscall, which currently entails a synchronous interaction with the device being invoked. We can address (a) by giving each vat its own vatstore database. Since operations on one vat's vatstore do not interact in any way with another vat's interactions with its vatstore (nor do vatstore operations interact in any way with the kernel's own access to its own durable state) this ought to be straightforward. It also provides a further opportunity to speed up computation since the logically independent database operations of unrelated entities can then be allowed to overlap arbitrarily without consequence. Addressing (b) is more complicated. I can see two general strategies for dealing with this: (i) make all device interactions fully asynchronous, thereby allowing them to be made an ordinary part of the regular dispatch/completion flow, or (ii) bind a device exclusively to a given vat for the duration of their interaction, so that even though their interaction is nominally synchronous, it does not flow through the kernel (or other vats) while it is happening -- even if this entailed an interprocess handshake between the vat and the device's pseudo-vat (much like the current vat-kernel interaction), it would still be a net win because most vats don't interact with devices at all and even those that do don't do it much compared to the volume of other things that they do. Platform failuresI believe issues of process or communications failure can be dealt with by adopting a Waterken-like protocol between kernel and vat. Although the specific contents and meaning of the communications between vat and kernel are quite asymmetric, the communications relationship itself can be largely symmetric: each side retains a record of what it has transmitted, which it retains until it receives acknowledgement that its transmissions have been successfully received and durably recorded by its counterpart (we can leave open for now the design question of whether the acknowledgements themselves are specific acts or whether they piggyback on top of the other normal acts of communication that flow between the two parties). If one side or the other of the relationship crashes, upon restart and reconnection the first thing each party does is communicate to the other the identity (e.g., sequence number) of the last thing it remembers successfully receiving, whereupon its counterpart resumes transmission from the point in the stream corresponding to the first thing that was lost. As long as vats are engineered to atomically record their end-of-crank state together with their crank completion communications, I think the issues of database commit synchronization between vat and kernel are handled correctly without any additional special logic. If the communications channel between kernel and vat is broken and then reconnected (which I don't think is currently a thing that can happen but that could change) all the same logic should apply -- from the kernel's perspective, a vat restarting and a vat reconnecting in many ways look the same. An unanswered design question is whether we would ever even want to have such a "reconnection" operation and if so which side of the kernel/vat relationship would be responsible for doing it. In particular, if a vat loses contact with the kernel, I think the best course is probably for it to unilaterally kill its own process (abandoning whatever crank is running) and wait for the kernel to restart it, rather than trying to do some kind of special communications resumption dance. A related open question is what the appropriate policy for dealing with actual kernel failure (as opposed to mere connection loss) should be. If the kernel process dies unexpectedly, what should the vats do? My intuition is that this should be treated by them as a form of loss-of-communications and so follow the corresponding exit-on-connection-loss strategy (i.e., ALL the vats kill themselves). When the kernel restarts, it can assume there are no vats running and that part of its job will be to restart all the vat processes. This seems like the cleanest approach, since in this world we don't have to make the kernel responsible for surveying its environment on startup to determine what other processes are running. However, it does raise one dangling issue that I'm not entirely sure how to deal with: what do we do if the kernel dies, restarts, and attempts to restart the vats, all before some vat processes have noticed and are therefor still running? I think some kind of incarnation number scheme may be able to deal with this; the main problem that I see is that the (old) running vat process still has possession of the vat's database. This may be the biggest unsolved problem in all of this. Another piece of making vat processes more autonomous is that alongside moving the vatstore into the vat's own database, I think we should also make the vat itself responsible for its snapshot and transcript stores. The jobs of loading from a snapshot and replaying from a transcript should be given to Liveslots rather than the kernel, with the kernel's role in vat restart being to initiate the operation rather than to orchestrate it. Another related open question is how aggressively a vat should record the communications that are sent to it. Since, in the simplest version of the scheme I'm proposing, deliveries are immediately streamed to vats, the vats could, in principle, durably record those messages as soon (more or less) as they are received, thus minimizing retransmission on restart (I believe, possibly mistakenly, that this is the approach Waterken follows). However, since we expect vat process failures and, therefor, restarts, to be very rare compared to normal, successful cranks, it's probably not the ideal strategy to optimize for the restart case. Further, since the communications pathway is likely an IPC connection rather than, say, a TCP session running over the internet, the communications channel is probably highly reliable as well. Thus it might make the most sense for a vat to record an incoming delivery at the same time it records the associated crank completion. I believe this would maximize speed. To my sensibilities the primary virtue of more aggressive message capture is to allow a vat's message backlog to pile up on disk rather than consuming RAM, which might minimize the need for more complicated backpressure schemes. Cross-vat event orderingIn the world described above there is no longer a component that corresponds to the kernel run queue per se, at least as it has been hitherto realized. Instead, each vat would have its own local input queue. However, the kernel does need to track not only the order of deliveries on a per-vat basis but the total order of deliveries across all vats, so as to maintain the illusion of synchronous execution of the (now entirely virtual) run queue. For each delivery that is issued to some vat, the kernel will expect to eventually receive a corresponding completion, and these completions must be made to appear to happen in the same order as the original deliveries were issued, even if they in fact arrive back at the kernel in an entirely different order. For example, if the kernel sends message #1 to vat A, then message #2 to vat B, then message #3 to vat A, but then vat A reports completions for messages #1 and #3 prior to when vat B reports completion for message #2, the kernel must nevertheless act on these completions in order 1, 2, 3. This is because acting on these completions may entail issuing additional deliveries. Message order must be preserved so that even if kernels run by different validators see completions for the same message in different orders from each other, the chain consensus will still reflect a common ordering. I'm sure we will need some kind of interesting kernel data structure for this, which I look forward to programming. Preventing causal lookaheadIf in some cases deliveries can get pipelined to a vat more quickly than it can process them, the vat has the opportunity to look into the future of its message stream prior to handling any particular message. This is almost certainly a thing we don't want to allow. However, we already trust Liveslots to represent the kernel's interests even though it runs on the vat side of the kernel/vat process boundary. One of the things that Liveslots will have to do is ensure that a given crank execution releases agency prior to the next delivery being given to user code. This is probably the path of least resistance for the Liveslots implementation anyway (i.e., it's how things would likely end up working even if we weren't paying attention), but we should nevertheless remain aware of the need to maintain this invariant, especially if we start getting very clever in our efforts to speed things up. |
That analysis seem mostly in line with my own understanding.
That is a very interesting suggestion. We had already considered 2 types of devices: readonly data (e.g. bundles) and write-only (async). This would effectively give us back read-write devices under certain conditions. We could start with devices that are only ever allowed to be in the clist of at most one vat before going onto a more complicated dance of acquiring a "lock" (which could be layered on top of single vat ownership).
Strongly agreed, except for the detail that I see "liveslots" as the piece of the vat logic that runs inside the JS engine executing the vat's user code, and as such would not be the party responsible for handling that part of the vat state. I do see the need for some vat logic to run independently of and handle communication with the kernel. Where this code executes is TBD. It would be great to have it run in the same process as the JS engine running the vat code to avoid further transmission overheads (this is one reason I'm somewhat looking at running XS as a wasm module).
I think it may not be as simple. The main problem is that the vat and kernel state would be in 2 separate processes, with their own independent DB connections (and likely independent DB themselves). As such there is no atomic commit point. We basically have to deal with hangover inconsistency between the kernel and the vat commit points.
I think this is mostly the concern of #5025. In that world I don't believe there should remain a presumptive total ordering of messages. There must be a deterministic effective ordering, but it does not have to be dependent on the relative order of deliveries between vats.
I would argue that order "1, 3, 2" or "2, 1, 3" are just as valid. The important part is that this order must be deterministic between validators. In the way I see it, the host embedding swingset will have some input in deciding the effective order between vats. |
Indeed, both orders are arguably valid, but in order to achieve determinism we need an objective rule for all validators to follow that everybody can understand, and matching delivery order seems like a good, clean one. But my main thinking was that this scheme makes the overt behavior of the parallelized swingset match the current implementation, so that the apparent nature of causality (as exhibited in slogs and other historical records) does not change at the point when we switch over to a parallelized implementation. It also allows us to swap in a non-parallelized implementation for debugging purposes without disruption. |
I expect the switch away from a global run queue and to per vat queues to happen first before this parallel execution change. That said a mode to have the actual vat executions be globally sequential and match the kernel/host deterministically chosen order of processing the vats input/output queues is indeed valuable for debugging and likely a matter of local configuration (given we take this as a requirement in our implementation). Edit: To summarize my point of view, I expect that any observable changes to the processing order of messages be a consequence of moving to per vat queues, and that subsequently switching to a parallelized execution does not introduce further changes to the order of message processing. |
Note to self: the VBANK_GRAB device invocation returns a balance of some sort (maybe of the account being modified?). I don't know if we use the result, but if so, that might be a device read which interferes with this scheme, and we might need to find a way to do without the result. e.g.
|
@warner all these go through the bridge device, and the chain storage vat, which means they're already async and in a different crank. In #6741 I make the bridge handling fully async so that the bridge outbound can return a result in a separate block. TLDR all bridge devices usages are compatible with background execution, even the ones that return a result |
TIL about SQLite support for concurrent transactions, documented at https://www.sqlite.org/cgi/src/doc/begin-concurrent/doc/begin_concurrent.md It behaves as you'd expect: the second txn fails at COMMIT time if it overlaps with an earlier one, and "overlap" happens when both touch the same table or index at "nearby" keys (so in a large table, two modifications of random keys will probably not overlap, but modifying two sequential keys will almost certainly overlap). I'm not sure how we could use this, but it's worth remembering. If every vat's Or the queues with which the kernel and worker talk to each other could be placed in a shared DB file, with separate tables for the kernel-to-worker and the worker-to-kernel directions. In that world, the netstring pipe might only be used to trigger the other side to read from the incoming table (OTOH updating the "tail" pointer, to delete messages, would require more coordination, not unlike what the real comms vat mailbox protocol does: kernel tells worker how far it's read, worker deletes the stale entries when it gets a chance). The utility of that probably depends upon how exactly the atomicity domain should be shaped. If the kernel has its own DB, in which it remembers both "I've told the worker about delivery 4" and all the other changes that the kernel is tracking, then sending something to the worker may or may not have been received and committed by the worker. Or, if the kernel tracks this memory in the worker's DB, then the kernel might not have committed some related state by that point. Each extra DB means an extra set of interrupt-between-commits cases to analyze. |
I am skeptical of any parallelization scheme that involves going back to the kernel process for every vatStore operation. I would really like the interface between kernel and vat to be akin to a machine to machine channel, similar to CapTP |
One open question is how the kernel-side scheduler should work. There will be an in-consensus "kernel order", a sequence of events like "process kernel IO input event 4" (eg a timer or bridge update), "deliver (retire) vat 5 input-queue event 6", or "process vat 5 output-queue event 7". The vat input events have hopefully been executed earlier, but this is the point at which we demand their results, and any replica which didn't manage to execute that input by now will be forced to block until it is complete. @erights and I talked today about whether we'd benefit from sampled non-determinism instead of an algorithmically in-consensus scheme. The latter would mean all replicas would share some policy, by which they would decide (independently) which input/output events must be processed/retired when. The former would mean that the block proposer would decide, perhaps by a special kind of transaction. There would be exactly one txn of this type in each block, and it would dictate which events must be processed, and in which order. One possible benefit is that the order might be more realistic: the block proposer could sample their own workers, find out which deliveries had been processed already, and then declare that set to be the official one. If other replicas had similar hardware, and their local (non-deterministic) worker schedulers had made similar progress, then they would also have the same deliveries done by "background" execution by the time the block needed to be executed, and they could vote immediately, instead of stalling for "foreground" execution to catch up. Of course, this means the block proposer is in a favored position. Also, we must guard against the proposer asking for the impossible. Replicas should not vote for a block that demands they retire invalid deliveries. But I think this introduces some lock-step-ness into the schedule: if block-A executes delivery-1, and delivery-1 creates delivery-2, then the proposer of block-A might demand the retirement of both delivery-1 and delivery-2. But a replica only knows about delivery-1 so far (it won't discover -2 until it executes -1, and it hasn't pre-executed that far yet). So the replica cannot vote for the proposal until it has executed delivery-1 and become convinced that delivery-2 could be executed, otherwise it might be approving a chain-halting action. And it cannot vote against the proposal until it knows for sure that delivery-2 can never exist (perhaps because some alternative delivery got enqueued with the same event number). So I'm not sure how to safely take advantage of this. |
This ticket documents a design that should allow swingset to execute deliveries on multiple vats in parallel, continuously (even while the kernel is not inside a
controller.run()
, such as during the 5 seconds that Tendermint/Cosmos-SDK spends doing voting/consensus work), while still maintaining a consensus order of deliveries. We'd like this to improve performance over our current scheme, which only executes one vat delivery at a time, single-threaded, and only withincontroller.run
.The first section defines the "CDP" model, upon which the rest is based. The second section describes how to rewrite the vat worker to fit this model, and how DB commits are arranged to avoid "hangover inconsistency".
CDPs: Communicating Deterministic Processes
This section defines a "CDP" (Communicating Deterministic Process). This is a refinement of CSP (where the S means "Sequential") and the Waterken model, designed to tolerate processes being terminated at unpredictable times, but also to make more opportunities for parallelism.
A CDP is an abstract process which evolves through a sequence of "steps", each of which accepts an input and produces an output. At its core, the CDP is defined by a completely deterministic transition function (of input and previous state) which produces an output and the new state.
The CDP might exist as a standalone process, or it might share a process with another CDP. Each CDP has it's own commit points, and interact in such a way that one CDP can crash without undermining the assumptions and reliances that other CDPs might have on it.
All blockchains with finalization (e.g. Cosmos-SDK -based chains) behave like a CDP, in which each step is called a "block" or "blockHeight", and the inputs are the contents of the finalized block (mostly the set of transactions to be executed). In the Agoric system, the host application (cosmic-swingset / agd) acts as one CDP, the swingset kernel acts as a second (because of the separate kerneldb commit point), and there are additional CDPs for each of the N active vats.
State vs Runtime Incarnation
We split the CDP into "state" and "runtime". The state is long-lived, created when the CDP is initialized, and surviving until we really don't need it anymore. The state is recorded in a database, and each commit records one or more steps (never a partial step). The state can only be changed by an active runtime.
The runtime is a live OS process (either standalone or embedded in some other process, possibly shared with other CDPs). At any moment, each CDP has either exactly one active runtime (the CDP is currently "online"), or zero ("offline"). CDPs may be kept offline if they are not currently in use (because keeping one around costs some RAM even if nobody talks to it), or they may be preemptively brought online (because starting one costs some CPU time and latency, slowing down response time). All the API calls are made on a runtime, so offline CDPs cannot be manipulated.
Each time a runtime is created, we get a new "runtime incarnation". An CDP's current incarnation may remember slightly more than its persistent state, e.g. steps that it has performed but which have not yet been committed. When an incarnation is shut down or killed, the next incarnation may not remember those steps.
Inputs and Ouptuts
The
submitInput(stepNum, input)
method is used to submit a step input, andcollectOutput(stepNum)
is used to collect a step output. The type/shape of the inputs and outputs are specific to the CDP and its state transformation function.submitInput()
does not block, returns nothing, and is not obligated to execute the given step right away. It can be called multiple times (without interveningcollectOutput
calls) to fill the execution pipeline, allowing the CDP to perform work in the background or in parallel with other processes. Within any single incarnation (and barring arevert
, described below), the sequence ofsubmitInput()
calls must use sequentialstepNum
values: no gaps, no repeats.collectOutput(stepNum)
is an async function whose return Promise does not resolve until the given step has been executed. If the CDP managed to process the requested step in the background,collectOutput
may resolve immediately, otherwise the caller must wait until enough steps have been executed to produce the requested answer. It is an error to collect an output of astepNum
for which the CDP has never been given an input (although the input may have been given to an earlier incarnation). As withsubmitInput
, within a single incarnation, the outputs must be collected from sequentialstepNum
values, with no gaps or repeats.In general, the caller should provide inputs as soon as they are known (and well before the outputs are needed), to keep the pipeline full and maximize the opportunities for efficient parallel execution. If each
submitInput()
is followed by an immediateawait collectOutput()
for the samestepNum
, there will be few opportunities for parallelism.In the Agoric system, the kernel CDP only performs one step (block) at a time, but vats may be able to execute many steps (deliveries) in parallel.
Replayed Steps
When a CDP runtime is brought online, its DB will remember some number of inputs, and some number of outputs, based upon which steps had been completed before the last
commit()
performed by its predecessor.The caller is using a different database, with different commit points. So the caller may have submitted an input and collected and output for e.g. step 5, but then crashed before it was able to perform its own commit. When the caller starts back up again, it will re-submit the input for step 5. This is called a "replayed input", and CDPs must tolerate them.
There are two cases. In the first, the CDP received
submitInput(5, input5)
but crashed before acommit()
. In this case, the second incarnation can trivially tolerate the replay, because it does not remember the original. We rely upon the caller submitting exactly the same input, to avoid confusion.In the other, the CDP did
commit()
after receivinginput5
, and perhaps others. We call the highest such committed input "highestSubmittedInputStep". We track another number named "highestRetiredStep" (whereretire
is described below).When a CDP runtime is started, the new incarnation's first
submitInput()
call will accept anystepNum
in a range from "highestRetiredStep" (exclusive) to "highestSubmittedInputStep" (inclusive). The secondsubmitInput()
call must use astepNum
exactly one higher than the first one, etc.If the CDP detects that a
submitInput()
call is replaying an input, it refrains from actually executing the step. The "next expected stepNum" counter is incremented (or initialized), but no other state changes are made. The caller cannot tell, but CDP merely "pretends" to perform the step.A CDP should ideally have a way to compare the replayed inputs it receives against their originals, to throw an error if they diverge. This might help detect errors made by the caller. A single hash of the input would be sufficient to detect divergence and throw an error, however to actually diagnose the problem, it would be better to remember the complete input (so the error message can display both original and the faulty replay). However given that a replay of an uncommitted original cannot be detected by the CDP, it's not worth putting too much energy or code into remembering the originals.
Likewise, the caller may ask for
collectOutput(7)
from one incarnation, then crash before persisting the results. When the caller restarts, it creates a new CDP runtime incarnation, and asks it (again) forcollectOutput(7)
. The CDP must return the same outputs that its predecessor did. As above, the firstcollectOutput()
call that the new incarnation receives is allowed to use astepNum
that meetshighestRetiredStep < stepNum <= highestSubmittedInputStep
, and all successive calls within that incarnation must use the next highest sequentialstepNum
. The CDP also trackshighestCollectedOutputStep
.This means that each time a step is executed, the CDP store the full output into its DB, so it can report a consistent+stable value to the caller (now, soon, or from some future incarnation).
Commit
To improve performance, the CDP is not obligated to commit its DB after every single step. It must never commit the partial states that occur in the middle of a step, but it is allowed to commit at any point between two complete steps.
But to enable the caller to rely upon the stability of the outputs, the caller can insist upon a commit point, by calling the CDP's
commit()
API. This async method will not resolve until the state and outputs of every step up to and includinghighestCollectedOutputStep
has been committed. The CDP may have performed more work (whose output hasn't been collected yet): if so, these extra steps are committed too. But the caller is only allowed to rely upon the retention of the steps that it has collected.In general, the CDP should use DB transactions or nested transactions to ensure that only complete steps might be committed, but refrain from doing an actual commit until requested by
commit()
. If the RAM requirements of the uncommitted base transaction becomes an issue, it can perform an unsolicited commit without causing problems, but minimizing commits (and their slowfsync()
calls) will generally yield better performance.Note: this is specifically intended to take advantage of SQLite's fast "WAL" mode, with
PRAGMA synchronous = NORMAL
. In this mode, the DB only records changes when a transaction is closed (preventing partial-step commits), but the DB is vulnerable to power failures and host OS crashes until a "checkpoint" occurs. These checkpoints happen spontaneously when the WAL grows large enough (so we must tolerate them), but can be forced with aPRAGMA wal_checkpoint(FULL)
. The CDPcommit()
API should force a checkpoint before returning.Retire
To keep the CDP's "old-outputs storage" obligation bounded, the caller tells the CDP when it is safe to "retire" a step.
Once the caller collects the outputs of e.g. step 7, it will perform some other work, and then eventually commit its own state. Once the caller commits, it can use
retire(7)
to inform the CDP that it will never ask for the output of step 7 again (nor will it attempt to submit the inputs for step 7). The CDP can then delete the saved output for that step, as well as whatever diagnostic information it retained about the inputs (perhaps a hash, perhaps the full input data). The caller cannot safely retire a step until it has itself committed, otherwise a crash (after thecollectOutput(7)
but before the callers' commit) would violate the caller's promise. The CDP won't necessarily commit its memory of the retirement promptly (it is not obligated to do so until thecommit()
API is called), but since it is allowed to commit earlier, the caller must not reveal its intention toretire()
until it is really ready for the old data to go away.The caller might collect the outputs of multiple steps (7, 8, 9) before it gets to a convenient commit point. In that case, it only has to call
retire(9)
, and that will implicitly retire all earlier steps.A call to
retire(9)
will also sethighestRetiredStep = 9
. This value lives in the DB, but is not deserving of its own commit. The value will be committed as a side effect of the next step-execution orcommit()
-triggered DB commit.In general, the a non-pipelining caller will execute in a loop like the following (suppose the recorded
stepNum
in the first pass is 6, so the first input submitted is for step 7):stepNum
stepNum
for whichcollectOutput()
was called and processed)cdp.retire(stepNum)
stepNum += 1
cdp.submitInput(stepNum)
await collectOutput(stepNum)
await commit()
The caller/CDP might crash at various points in this loop:
submitInput(7)
but beforecollectOutput
: the CDP committed state may or may not include the execution ofstepNum
7, so the new caller incarnation will replay the old input, which will either be ignored (if the CDP did execute and commit it), or re-executed (if it did not manage to commit it)collectOutput
but beforecommit
: again the CDP committed state may or may not include that step 7, same cases as abovecommit
but before the caller commits: the CDP state definitely includes step 7, so the new caller incarnation will request a replay, which the CDP will ignore. Thecdp.commit()
might commit extra steps (if it managed to make process on some pipelined inputs) or might be a NOP.retire
: the CDP has definitely not been told it is safe to retire step 7, so it will still be in the CDP DB. The new caller incarnation will start by retiring step 7, then submitting inputs for step 8retire(7)
but beforesubmitInput(8)
: the change tohighestRetiredStep
may or may not have been committed, but the consequence is minimal: slightly higher DB usage until the next loop reaches a commit pointBy performing the
retire()
at the beginning of the loop, the caller will start each incarnation by clearing out the unnecessary data from its predecessor.Revert
Some CDPs are one-way, but others can support the
revert(backToStepNum)
operation. If supported, the caller can instruct the CDP to revert its internal state back to that of some previous step, at which point the caller is allowed to submit different inputs (and can observe different outputs) than before.revert
is generally expensive, and should only be used for exceptional cases. In the Agoric system, the primary example is an important vat whose latest delivery causes an internal fault, which we think we can repair through some sort of vat upgrade process. By reverting the vat to the state just before the fatal delivery, we can arrange for the next delivery to be something which changes/upgrades the internal state (to fix the bug). Another possibility is a metering fault (too much computation), where we don't want the vat to get away with using more computrons than it paid for, so we unwind the expensive delivery and require the vat to pay their compute bills before allowing that delivery to be repeated (note there are some questionable economics here, but the example is still useful). An Agoric vat worker would implementrevert
by discarding thexsnap
worker and building a new one from the most recent heap snapshot (which must have been made before the target step), and omitting a few deliveries from the end of the transcript it replays (everything past the target step). It would also delete the omitted transcript entries, so they don't accidentally reappear in some future incarnation.There is a limit to how many steps can be reverted:
backToStepNum > highestRetiredStep
, wherehighestRetiredStep
is updated by eachretire()
call. A revert-capable CDP must remember (or be capable of regenerating) all old states back tohighestRetiredStep + 1
. For the Agoric vat worker, that means we must retain at least one heap snapshot withstepNum <= highestRetiredStep + 1
(which might cause us to remember multiple heap snapshots for a single vat).API Summary
The library that provides a CDP with a given transition function will provide the following API:
initialize(stateDirectory, initialParameters)
: create an empty database in the given directory, then populate it with some initial state based upon the parametersruntime = createRuntime(stateDirectory)
: start a "runtime" object, on which other API calls can be made. Multiple runtimes may not exist simultaneously.runtime.submitInput(stepNum, input)
: submit input for eventual execution. Does not wait for execution. Does not return anything.runtime.collectOutput(stepNum) -> Promise<output>
: request the step to be executed promptly, and waits for the output to be returnedruntime.commit() -> Promise<void>
: request the worker commit at least all previously-collected steps to the DB before returning, so the caller can rely upon the results being present in a later incarnation. The worker is allowed to commit on any step boundary, but is not obligated to do so until thiscommit()
is invoked.runtime.retire(upToStepNum)
: inform the worker that the given step (and all previous steps) can be forgotten, so the worker can delete the state required to remember them. The actual deletion will not touch the persistent state until the next commit happens. Does not wait for commit, does not return anything.runtime.revert(backToStepNum) -> Promise<void>
: tell the worker to revert tobackToStepNum
and commit immediately. The resulting worker will behave as if the last calls it saw weresubmitInput(stepNum)
,collectOutput(stepNum)
, andcommit()
, thus forgetting about previous calls for>= stepNum+1
retire(upToStepNum)
, they may not userevert()
beyond that point:backToStepNum > upToStepNum
Swingset Kernel as a CDP
Now that we have the CDP framework, how are the various Agoric/Swingset layers expressed?
The cosmic-swingset layer is already pretty well suited. As a block is processed, our cosmos-sdk Handlers push "actions" like
bridgeInbound
anddeliverInbound
(defined in https://github.com/Agoric/agoric-sdk/blob/master/packages/cosmic-swingset/src/action-types.js) into a queue on the Golang side. When we get to the END_BLOCK event (https://github.com/Agoric/agoric-sdk/blob/master/packages/cosmic-swingset/src/launch-chain.js), the JS code submits these actions into the kernel, and then runs the kernel until it reaches therunPolicy
limit. As the kernel runs, it emits responses (chainSend
) like storage writes. Betweenchain-main.js
andlaunch-chain.js
, the cosmic-swingset package has code that tolerates one block's worth of replay, to handle the case where the kernel commits its DB but the process crashes before cosmos-sdk can commit its own.We can thus define cosmic-swingset's kernel as a non-reverting CDP whose "steps" are blocks, whose inputs are the "actions", and whose outputs are the
chainSend
messages. The host application performs steps without any sort of pipelining: eachsubmitInput
is immediately followed by a matchingawait collectOutput()
, then anawait commit()
. Theretire
method is omitted, and the kernel is implicitly allowed to forget everything beyond one step in the past.One constraint is that the outputs/
chainSend
calls are not allowed to have return values. The most likely example would be a chain-storage read, however I don't think we currently do any of those (there might be a method to return the externally-queryable chain-storage IAVL/RPC path: we'd need to implement that in different way).Swingset Vat as a CDP
To take advantage of the parallelism offered by CDPs, we need to make some changes to the kernel/worker boundary:
vatstore
, inside that DBsyscall.vatstoreGet
/etc are no longer routed to the kernel, but are handled locally by the workerWith
vatstoreGet
/vatstoreGetNext
/etc out of the way,syscall.callNow
(i.e. device-node invocation) is the only remaining syscall with a non-trivial return value. In practice, we have two kinds of device-node interaction:getBundle
readsThe chain-storage writes do not have a return value, so we really only need an answer for
getBundle
. I think we can manage this by following through on "blobcaps" (#46), which would be a new form of kref/vref (e.g.bNN
andvb-NN
), that refers to some sequence of immutable bytes. The kernel needs to pay attention to the vrefs being translated in a VatDeliveryObject and notice when a new one has been added to the c-list. That delivery should include a copy of the data, so the worker can stash it in its DB for later access. We'll probably need refcounts on these, or (given how infrequently we use them) we might just decide to declare that vat workers remember the blob forever.Once that's done, we can express each vat as a CDP where:
VatDeliveryObject
To satisfy other compatibility/stability requirements, I think we'll wind up with all of the vat worker's components (endo/lockdown, supervisor, liveslots) living in a single
swingset-vat-worker-xsnap-1
package. The kernel will no longer send the lockdown/supervisor bundles into the worker; instead they'll be bundled by the worker library itself and fed to xsnap as needed. The worker's DB will include any heap snapshots and transcripts that it needs to satisfy the CDP API, as well as additional calls to perform a "sleeper-agent" full-transcript-replay kind of upgrade (#1691).The
-1
suffix is a version identifier that combines everything which could influence the deterministic behavior. Once you've deployed a vat with the-1
worker, its behavior will remain stable despite any changes on the kernel side. No new versions ofswingset-vat-worker-xsnap-1
will change that behavior: at most they will fix/improve things that do not threaten determinism. To make any significant changes (upgrade liveslots, new version of XS, etc), we must release a new package namedswingset-vat-worker-xsnap-2
, and (baggage-style) upgrade vats to use it. The first swingset kernel package will depend upon the-1
package, the second can depend upon both-1
and-2
, the third can depend solely upon-2
(and all vat upgrades must be completed before you can upgrade a deployment from the second kernel package to the third).Kernel Scheduler
Our improved scheduling plans call for each vat to have an input queue (of deliveries), and an output queue (of delivery results, including message-like syscalls like
syscall.send
andsyscall.resolve
). The kernel will also have a single inbound queue for device input events likebridge-inbound
anddeliver-inbound
and something for the timer device. The kernel scheduler will decide which queue to service next, based on criteria like priority, how many messages a vat has emitted recently, how many messages have been delivered into the vat, etc. Hopefully we can find an algorithm that allows short multi-vat operations to complete fairly promptly, but larger ones to be scheduled fairly.The scheduler will be invoked multiple times during the processing of a block. The input will be the size of all the queues (2*numVats+1) and the computron budget remaining. The output will be which queue to service, and how many items to process. If the output is "none", the block ends and control returns to the host.
When a vat's input queue is serviced, the kernel performs a "delivery crank", whose nominal behavior is to pull the VDO from the head of the input queue and deliver it to the vat, which produces zero or more output events, all of which are pushed onto the tail of the output queue. When the output queue is serviced, the kernel performs a "routing crank", which examines the item and routes any
send
s to some other vat's input queue (or enqueues them on a promise queue in the kernelDB), and process anyresolve
s by updating the kernel promise table and maybe enqueueingnotify
events onto input queues (as well as transferring any queued messages to the new target vat). When the kernel's singular inbound queue is serviced, it also performs a routing crank, to route the message to some device or handling vat.The scheduler's decisions form a total ordering of all input/output events. This ordering is in-consensus, and each event should produce a hash (like the current
crankHash
) which can be folded into the shared state, so divergence is detected quickly. The total order is also broken up into blocks, and the kernel does not run (nor can it change state) between the blocks, which is when the consensus algorithm does its voting work.That ordering is too constrained to allow much room for performance improvements, but fortunately it is merely the kernel's required order. We can achieve parallel execution between vats, and also perform execution during the voting time, by allowing each vat to run independently, and only observing a partial order that fits within the kernel-wide total order. The vat charges ahead and executes as many deliveries as it can (out of the ones it has been given so far), and the kernel merely refrains from collecting the results until the appointed moment.
To take advantage of this, as soon as the routing crank pushes a VDO onto the tail of vat-A's input queue, it should immediately call
workerA.submitInput()
with the VDO, instead of waiting for it to reach the head of that queue. That allows the worker to execute the crank in the background, unsynchronized with anything else, well in advance of when the scheduler actually decides to process that item on the input queue. We might fiddle with thenice
settings on each worker to influence the OS kernel's execution priority, but in general we just want to fill each worker with work.Suppose VDOs 1/2/3 are pushed onto the input queue. All three have been
submitInput()
ed to the worker, but none have been collected. At some point, the scheduler will process the input queue and "deliver" VDO-1 to the vat. What it really does at that point is to callawait collectOutput(1)
, which tells the worker "ok now I really do need that execution", and blocks if the worker didn't manage to get that far yet. The output is pushed onto the output queue for later processing, however if we can process that output quickly (andsubmitInput()
the results to some other vat quickly), that will keep the pipelines as full as possible.There is a tradeoff between fairness (which wants to limit the servicing of a deep output-queue), and using parallelism to improve performance (by feeding more work to other vats). There is also a tradeoff between head-of-line latency (which is minimized by only doing one large-scale operation at a time) and performance (having lots of tasks available to keep the pipelines full).
At some point the kernel scheduler decides that we've done enough work for the block. The workers continue on in the background, but the kernek won't call
collectOutput()
again until the next block. Instead, the kernel callscommit()
(in parallel) on all workers with which it interacted during this block. That ensures that all the previoussubmitInput
andcollectOutput
data will not be forgotten by a worker which crashes in the meantime. Once all workers have completed their commit, the kernel can commit its own DB (which means it will never again submit or collect those steps). Then the kernel can provide results (chain-storage events, etc) back to the host application.When the kernel starts the next block (actually when sends the first message of the block to any given vat), it can send a
retire()
to enable the worker to free up the deliveries that are now safely committed by the kernel. (The kernel could do this as soon as it finishes its own commit, but it saves time on the critical path to defer this until after the host app has regained control).Rewinding Deliveries
Sometimes, the kernel must be able to rewind a delivery. The main use case we have so far is the
dispatch.stopVat
just before a vat upgrade: the old vat is stopped, the new vat (with new code) is started, but if the new vat fails to launch, we want to revert the old vat to the point just beforestopVat
, as if the upgrade were never requested. We can also imagine some forms of metering faults or invariant checks that would justify unwinding a delivery.The kernel won't know whether or not the delivery should be unwound until sometime after the delivery ouputs are collected. But the kernel can refrain from calling
retire()
until it knows we no longer need to roll back that far.The vat worker will support
revert(backToStepNum)
by retaining at least one heap snapshot withstepNum <= highestRetiredStep + 1
. When asked to revert, the worker will kill off any currentxsnap
process, delete all inputs and outputs beyondbackToStepNum
from its DB, commit the changes, then start up normally. This will load the most recent heap snapshot into a freshxsnap
and resume execution of the saved inputs. By this re-execution process "early" (before the now-deleted inputs), we wind up with a worker state equivalent to the old step, ready to proceed in a different direction.(Another option is to not restart the worker right away, and instead wait until the kernel scheduler decides to submit a new input or collect a previous output. However that might induce latency in a later block that could have been paid for during earlier idle time.)
Rewinding a worker is not cheap: it takes about as much work as evicting the worker (i.e. to limit concurrent memory consumption) and immediately restarting it. That expense must be considered when deciding how to implement higher-level fault management procedures.
Vat Upgrade, Worker Replacement
Our "baggage-style" vat upgrade process works by shutting down the old worker with a
stopVat()
delivery, to let it flush any remaining data to its vatstore DB, then launching a new worker with new code, and sending it astartVat()
to spin everything back up. The two workers have independent transcripts and run different code: any of SES/xsnap/supervisor/liveslots/vat-bundle/contract-bundle can be changed across the upgrade. Only the vat's identity and the vat store survive (plus all the kernel-side data like c-lists and responsibility for exported kernel objects/promises).To complicate matters, if either the
stopVat
orstartVat
fail, we want to abandon the upgrade, and rewind everything to the point just before thestopVat
. So the upgrade process must retain enough data to support the rewind.Within the CDP model, the CDP state is retained (with modifications), but the CDP runtime worker is replaced.
I think that means our worker must be aware of upgrade, and must be prepared to cooperate with a new worker version (
swingset-vat-worker-xsnap-2
). The two workers will share a SQLite DB, and the-2
version must not make any changes/migrations that would prevent the-1
from taking over, at least not until the new version is known to be successful.This suggests:
upgradeCount
)transcripts
DB table has bothupgradeCount
anddeliveryNum
columns, to retain old transcriptsDELETE FROM transcripts WHERE ugpradeCount=oldCount
once the ugprade is successful, or we might retain it for future sleeper-agent upgrades that need the full multi-version transcriptupgradeCount
: step numbers are cross-upgradestopVat
is delivered and collected, andcommit()
called, then the worker instance is shut downupgradeCount
in the DB and starts working with a new empty transcriptstartVat
and immediately waits for the outputcommit()
and a special "you can abandon the old version now" APIabandonUpgrade()
, which deletes the changes and returns the DB to the previous (post-v1-stopVat
) state, and shut down the v2 workerrevert
thestopVat
, and now we're back in the original statestartVat
must not commit any changes to the vatstore until we we're beyond the possibility of reverting to v1We may also want an API to report back the transcripts of each upgraded version, concatenated together, to facilite a sleeper-agent -style "deep replay" upgrade.
The text was updated successfully, but these errors were encountered: