Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(SwingSet): Don't send stopVat during upgrade #7244
fix(SwingSet): Don't send stopVat during upgrade #7244
Changes from 8 commits
aa6518b
a82124c
03f8518
95e6215
be73588
61d235f
5cc47d2
8324ea2
8998e4a
aed1282
0fbb5c4
7b718d7
0efcb48
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, in the old version, a hard-meter overrun or liveslots internal error during
stopVat
would have unwound the upgrade, leaving the vat in the pre-upgrade state (but also still receiving messages).In this new version, the same happening during the final
bringOutYourDead
will panic the kernel, since errors thrown during thisprocessUpgradeVat
are in a different category than errors happening within the worker.I'm trying to figure out how I feel about that. On one hand, kernel panics are bad. Userspace should never be able to prevent kernel progress. We tolerate slightly more from liveslots, to give it agency to perform end-of-crank work, but even then, a dropped promise or infinite loop in most of
liveslots.js
would either return an error or trip the hard meter (onlydispatch()
itself has the ability to stall the kernel forever), meaning the worst it can do is get itself terminated. This change grants an effectivesyscall.panicKernel()
to liveslots by making an illegal syscall during the final BOYD.On the second hand, this isn't directly extending that power to userspace. The worst case I can think of is if userspace managed to build up such a large collection of unreachable objects that the BOYD crank overran the hard-meter limit. Userspace could do that at any time, without this change, but previously that could only cause the vat to be terminated (and userspace can pull that off trivially with an infinite loop, or
vatPowers.terminate()
). With this change, there's a tiny window during upgrade when careful preparation could panic the kernel.On the third hand, if something goes wrong during during upgrade, what can we do?
A vat-fatal error during upgrade means the vat was already in trouble, especially now that we're only doing BOYD and no other unusual deliveries (imagine a bug in the now-removed
stopVat
, which only broke upgrade, that would have been super annoying). If this BOYD kills the vat, it would have died soon anyways.Although, if the vat is marked as critical and it terminates during some delivery (not necessarily the upgrade BOYD, nor any BOYD, just a regular message), the kernel will panic (to prevent the termination from being committed), and we're talking about waking up in a state where we've upgraded the kernel code to do something different, like upgrade the vat before allowing any other messages through. In that case, if our emergency ahead-of-the-line upgrade fails, we probably do want to re-panic the kernel, rather than re-delivering the fatal message, under the "death before confusion" rule.
That suggests that
isCritical
should play a role, or at least that any "error during upgrade causes vat termination" case should be amended with ".. unlessisCritical
, in which case panic the kernel".If the vat is not marked
isCritical
, and we aren't in some emergency manually-driven-recovery situation, then what's the best way to handle a problem during upgrade? I've assumed we should just unwind it, because that covers the "version 2 is busted" case (which feels like the most likely one). Terminating the vat feels harsh, and panicing the kernel is clearly too much.If the vat is marked
isCritical
but we're in a normal userspace-driven upgrade, then unwinding the upgrade and letting them try again with a better version feels appropriate. There are regular messages (already queued up) that will get delivered to the old version: upgrade is async from the POV of the parent vat, so we aren't violating any rules by not doing the upgrade at all (equivalent to deferring the upgrade forever).So that leaves the emergency-manual-upgrade case. We'll have this code running (in the bulldozer release), then we imagine a kernel-panicing critical vat failure. We'll have some late nights frantically figuring out what the problem is, and how to recover from it. We'll have an opportunity to change the kernel code (as well as the rest of the release), then we'll deploy that release, and the kernel will wake up, and some special new code (that isn't being written today) will run, which will probably perform an ahead-of-the-queue upgrade of the vat that's in trouble. If that upgrade fails, then immediately panicing the kernel is probably best: if we didn't catch it testing, at least the validators will report failure as they try to come back up, and we'll spend another couple of late nights trying to figure out what went wrong.
I think this means that the upgrade-fails-now-what response wants to be "unwind" in most cases, and only "panic" if we're doing this emergency manual thing, and we haven't written the emergency manual thing yet. And we probably don't know enough to define that case yet. So rather than adding a half-baked
isEmergencyUpgradeSoPanicIsBetter
flag, let's leave a comment reminding ourselves of this analysis, and pointing out that some emergency upgrade situations might warrant panic-on-error. And then change this code to unwind the upgrade in case of error-during-BOYD, to match the previous behavior of error-during-stopVat.That means retaining the old
if (stopVatResults.terminate)
clause, but looking atboydResults
instead.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reasoning makes sense to me, and I've done my best to condense it into the source code. Please re-review the latest changes to kernel.js.
This file was deleted.