-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(SwingSet): Don't send stopVat during upgrade #7244
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #7244 +/- ##
==========================================
+ Coverage 71.04% 71.33% +0.29%
==========================================
Files 450 450
Lines 86477 86126 -351
Branches 3 3
==========================================
+ Hits 61434 61435 +1
+ Misses 24975 24623 -352
Partials 68 68
|
Datadog ReportBranch report: ❌ ❌ Failed Tests (2)
|
c1e09d0
to
8324ea2
Compare
@warner Turns out the unexpectedly-reachable import was being sent back in by the post-upgrade scripted interactions. I've added clarifying comments, and this is ready for full review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Behavior change to make if BOYD fails during upgrade, plus some minor nits. Nice job!
const boydStatus = await deliverAndLogToVat(vatID, boydKD, boydVD); | ||
const boydResults = deliveryCrankResults(vatID, boydStatus, false); | ||
(!boydResults.abort && !boydResults.terminate) || | ||
Fail`Unexpected abort/terminate result from upgrade-internal bringOutYourDead: ${boydResults}`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, in the old version, a hard-meter overrun or liveslots internal error during stopVat
would have unwound the upgrade, leaving the vat in the pre-upgrade state (but also still receiving messages).
In this new version, the same happening during the final bringOutYourDead
will panic the kernel, since errors thrown during this processUpgradeVat
are in a different category than errors happening within the worker.
I'm trying to figure out how I feel about that. On one hand, kernel panics are bad. Userspace should never be able to prevent kernel progress. We tolerate slightly more from liveslots, to give it agency to perform end-of-crank work, but even then, a dropped promise or infinite loop in most of liveslots.js
would either return an error or trip the hard meter (only dispatch()
itself has the ability to stall the kernel forever), meaning the worst it can do is get itself terminated. This change grants an effective syscall.panicKernel()
to liveslots by making an illegal syscall during the final BOYD.
On the second hand, this isn't directly extending that power to userspace. The worst case I can think of is if userspace managed to build up such a large collection of unreachable objects that the BOYD crank overran the hard-meter limit. Userspace could do that at any time, without this change, but previously that could only cause the vat to be terminated (and userspace can pull that off trivially with an infinite loop, or vatPowers.terminate()
). With this change, there's a tiny window during upgrade when careful preparation could panic the kernel.
On the third hand, if something goes wrong during during upgrade, what can we do?
- 1: unwind the upgrade
- 2: terminate the vat
- 3: panic the kernel
A vat-fatal error during upgrade means the vat was already in trouble, especially now that we're only doing BOYD and no other unusual deliveries (imagine a bug in the now-removed stopVat
, which only broke upgrade, that would have been super annoying). If this BOYD kills the vat, it would have died soon anyways.
Although, if the vat is marked as critical and it terminates during some delivery (not necessarily the upgrade BOYD, nor any BOYD, just a regular message), the kernel will panic (to prevent the termination from being committed), and we're talking about waking up in a state where we've upgraded the kernel code to do something different, like upgrade the vat before allowing any other messages through. In that case, if our emergency ahead-of-the-line upgrade fails, we probably do want to re-panic the kernel, rather than re-delivering the fatal message, under the "death before confusion" rule.
That suggests that isCritical
should play a role, or at least that any "error during upgrade causes vat termination" case should be amended with ".. unless isCritical
, in which case panic the kernel".
If the vat is not marked isCritical
, and we aren't in some emergency manually-driven-recovery situation, then what's the best way to handle a problem during upgrade? I've assumed we should just unwind it, because that covers the "version 2 is busted" case (which feels like the most likely one). Terminating the vat feels harsh, and panicing the kernel is clearly too much.
If the vat is marked isCritical
but we're in a normal userspace-driven upgrade, then unwinding the upgrade and letting them try again with a better version feels appropriate. There are regular messages (already queued up) that will get delivered to the old version: upgrade is async from the POV of the parent vat, so we aren't violating any rules by not doing the upgrade at all (equivalent to deferring the upgrade forever).
So that leaves the emergency-manual-upgrade case. We'll have this code running (in the bulldozer release), then we imagine a kernel-panicing critical vat failure. We'll have some late nights frantically figuring out what the problem is, and how to recover from it. We'll have an opportunity to change the kernel code (as well as the rest of the release), then we'll deploy that release, and the kernel will wake up, and some special new code (that isn't being written today) will run, which will probably perform an ahead-of-the-queue upgrade of the vat that's in trouble. If that upgrade fails, then immediately panicing the kernel is probably best: if we didn't catch it testing, at least the validators will report failure as they try to come back up, and we'll spend another couple of late nights trying to figure out what went wrong.
I think this means that the upgrade-fails-now-what response wants to be "unwind" in most cases, and only "panic" if we're doing this emergency manual thing, and we haven't written the emergency manual thing yet. And we probably don't know enough to define that case yet. So rather than adding a half-baked isEmergencyUpgradeSoPanicIsBetter
flag, let's leave a comment reminding ourselves of this analysis, and pointing out that some emergency upgrade situations might warrant panic-on-error. And then change this code to unwind the upgrade in case of error-during-BOYD, to match the previous behavior of error-during-stopVat.
That means retaining the old if (stopVatResults.terminate)
clause, but looking at boydResults
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reasoning makes sense to me, and I've done my best to condense it into the source code. Please re-review the latest changes to kernel.js.
// This used to be MUCH more extensive, but GC was cut to the bone | ||
// in commits like 91480dee8e48ae26c39c420febf73b93deba6ea5 | ||
// basically reverting 1cfbeaa3c925d0f8502edfb313ecb12a1cab5eac | ||
// (see also #5342 and #6650). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding those notes. It'd also be nice to have a pointer to the git commit-ID from shortly before this lands, so exclude the various test changes that have happened since those two commits. Either cite this PR number (from which we can figure it out via github history), or sample current trunk's commit-ID (ec6a7f4eded916afb8c927638b4a699e2ffd99c9
as of this morning, which is close enough).
closes: #6650
closes: #7001
TODO:
Description
Replaces upgrade-time
stopVat
withbringOutYourDead
and leaves explanatory comments for the removed functionality. Also includes simplification and expanded coverage of some SwingSet tests.Security and Scaling Considerations
We're knowingly accepting some storage leakage in exchange for not relying on correct behavior from potentially broken vats. The former issue will hopefully be cleaned up in the future as part of fixing issues like #7212.
Documentation Considerations
None not already covered as part of this PR.
Testing Considerations
I believe the tests in this PR are sufficient to cover the changes.