-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non traumatic major XS upgrades #7855
Comments
Minor nit:
That's not quite right. Liveslots sees organic gc but then does various things to hide it from user code. |
Nope, liveslots no longer sees organic GC because we couldn't trust liveslots to correctly hide organic gc impacts from the kernel (in which syscalls are made). We have always trusted liveslots to hide all gc (organic or forced) from user code. |
Then what are those uses of |
They are only cleared our during forced gc (bringOutYourDead and snapshots). See #6784 (comment) Edit: I updated the issue here to hopefully clarify the gc revealing story. |
Thought: if a majority of a quorum of validators approves the results of a replay, the others could get the results via state sync rather than replaying themselves. If replays of different vats can be executed independently, you might be able to get some additional scaling by farming out different vats to different subsets of the validator population. |
Unfortunately for consensus, we're in an all or nothing situation. A single validator need to come up with all the right answers. There is no way to vote partially on the result. |
Yeah, this would be something like a mainnet 4 thing, when we start branching off interweaving sub chains and whatnot for scaling. I could imagine entities bidding for which vats should get priority in upgrade much as we anticipate bidding for priority in message delivery. |
A note that some changes to XS may end up having spec mandated execution differences, and thus directly observable by the program. While unlikely, this highlights that a replay based upgrade is not 100% foolproof, and that only an XS upgrade requiring a restart/upgrade of the vat is safe (see #8405). More details in #6929 (comment) |
What is the Problem Being Solved?
#6361 describes conditions for which we can upgrade XS in a chain upgrade and have all vats use that version of XS going forward. The main expectation is that snapshot are at least compatible: the new version of XS can load from the old version of XS, and keep executing as previously recorded.
The main problem is of incompatible snapshots, such as when a major or minor version update of XS occurs, or when new globals are implemented by XS. All the other requirements are believed to be possible already: the execution as seen by the transcript in newer versions of XS will be the same as what was recorded in the previous version.
#6361 references using multiple versions of XS (further defined in #6596) and performing vat upgrades to switch vats to the newer version. This issue explores an alternative that doesn't introduce any upgrade trauma, nor requires multiple versions of XS being distributed.
Pre-requisite knowledge on the current implementation
While liveslots's implementation is still revealing organic gc, in #7498 (and its follow up #7552), we've basically hidden organic gc from liveslots. In #7558 we make sure that the effects of snapshots (which perform a full forced gc) are not observable in transcripts after the snapshot is taken. We believe that together, this makes our vat transcripts fully independent of any engine allocation behavioral differences.
In #7484 we introduced transcript entries that capture snapshot information (hashes) in the transcript. This makes the transcript somewhat dependent on the version of XS, but these are not actual deliveries, so they can be handled.
There is still the possibility that metering limits would cause a single crank to fail where it previously succeeded, but that is currently unlikely.
With the introduction of state sync (#7225), validators may not have the full transcript content of previous spans (between the latest incarnation start, and the latest snapshot taken). However the hashes of previous spans are kept in the swing-store to support repopulating these historical transcript entries.
Newer versions of XS may introduce new intrinsics. In general these new intrinsics should not impact code execution, however our current SES version is sensitive to new well-known symbols (endojs/endo#1577), and thus would fail on new XS versions that add any unsupported symbols.
Description of the Design
The general idea is to rely on vat transcript replays to regenerate the snapshots and transcript span hashes.
We believe that validators are ok with an upgrade taking some reasonable amount of time to complete (in the order of multiple minutes, likely less than an hour). As such we may be able to perform at least part of this vat transcript replay during the upgrade, but we likely want to streamline the process by making it possible to preprocess some of the replay task.
Replay and regeneration of transcript
The regeneration process would be roughly as follow:
Offline pre-processing
This rough process allows doing partial replays of transcripts which can be later resumed. If applied as a pseudo-diff, it also allows the transcript to keep growing after being exported for offline processing:
Other replay considerations
To mitigate XS changes that impact the execution, it may be possible to change the lockdown or supervisor bundles used when replaying the vat (see #6929 for validation of new XS versions)
Security Considerations
All validators should perform these steps independently. If they share the "offline" data with each other, the chain is vulnerable to corruption. This is not too much of a concern as this process is verifiable.
Since the hashes being recomputed would be captured in the swingstore export to cosmos DB, a super majority of validators must agree on the result of the replay to be identical for the upgrade to succeed.
Scaling Considerations
The replay of multiple vats can be performed in parallel to speed up the restart process.
The offline partial pre-processing allows speeding up the time needed to replay during the actual upgrade
Test Plan
TBD, but likely using the docker based upgrade testing framework, verifying various scenarios such as offline processing capturing partial (older) vat transcripts, or a vat being upgraded after the capture is made.
The text was updated successfully, but these errors were encountered: