Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(SwingSet): Don't send stopVat during upgrade #7244

Merged
merged 13 commits into from
Apr 4, 2023

Conversation

gibson042
Copy link
Member

@gibson042 gibson042 commented Mar 27, 2023

closes: #6650
closes: #7001

TODO:

  • Fix or abandon failing assertions relating to dropping imports

Description

Replaces upgrade-time stopVat with bringOutYourDead and leaves explanatory comments for the removed functionality. Also includes simplification and expanded coverage of some SwingSet tests.

Security and Scaling Considerations

We're knowingly accepting some storage leakage in exchange for not relying on correct behavior from potentially broken vats. The former issue will hopefully be cleaned up in the future as part of fixing issues like #7212.

Documentation Considerations

None not already covered as part of this PR.

Testing Considerations

I believe the tests in this PR are sufficient to cover the changes.

@gibson042 gibson042 requested a review from warner March 27, 2023 03:24
@gibson042 gibson042 marked this pull request as draft March 27, 2023 03:24
@codecov
Copy link

codecov bot commented Mar 27, 2023

Codecov Report

Merging #7244 (7b718d7) into master (8497399) will increase coverage by 0.29%.
The diff coverage is 19.48%.

❗ Current head 7b718d7 differs from pull request most recent head 0efcb48. Consider uploading reports for the commit 0efcb48 to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7244      +/-   ##
==========================================
+ Coverage   71.04%   71.33%   +0.29%     
==========================================
  Files         450      450              
  Lines       86477    86126     -351     
  Branches        3        3              
==========================================
+ Hits        61434    61435       +1     
+ Misses      24975    24623     -352     
  Partials       68       68              
Impacted Files Coverage Δ
packages/SwingSet/src/kernel/kernel.js 67.63% <1.61%> (-0.04%) ⬇️
packages/SwingSet/src/kernel/state/kernelKeeper.js 94.49% <90.90%> (+0.09%) ⬆️
packages/swingset-liveslots/src/liveslots.js 93.76% <100.00%> (+1.69%) ⬆️

... and 3 files with indirect coverage changes

Impacted file tree graph

@datadog-full-agoric
Copy link

datadog-full-agoric bot commented Mar 27, 2023

Datadog Report

Branch report: gibson-7001-upgrade-vat-kernel-boyd
Commit report: 3e1cbed

agoric-sdk: 2 Failed (0 Known Flaky), 0 New Flaky, 3443 Passed, 57 Skipped, 26m 58.26s Wall Time

❌ Failed Tests (2)

  • upgrade › upgrade › vat upgrade - local without automatic GC - agoric.SwingSet - Details

    Expand for error
     ---
         name: AssertionError
         message: import-32 reachability after upgrade
         assertion: is
         values:
           'Difference (- actual, + expected):': |-
             - true
             + false
         at: |-
           verifyObjectTracking (packages/SwingSet/test/upgrade/test-upgrade.js:421:9)
     ...
    
  • upgrade › upgrade › vat upgrade - xsnap without automatic GC - agoric.SwingSet - Details

    Expand for error
     ---
         name: AssertionError
         message: import-32 reachability after upgrade
         assertion: is
         values:
           'Difference (- actual, + expected):': |-
             - true
             + false
         at: |-
           verifyObjectTracking (packages/SwingSet/test/upgrade/test-upgrade.js:421:9)
     ...
    

@gibson042 gibson042 force-pushed the gibson-7001-upgrade-vat-kernel-boyd branch from c1e09d0 to 8324ea2 Compare March 28, 2023 22:13
@gibson042 gibson042 marked this pull request as ready for review March 28, 2023 22:23
@gibson042
Copy link
Member Author

@warner Turns out the unexpectedly-reachable import was being sent back in by the post-upgrade scripted interactions. I've added clarifying comments, and this is ready for full review!

Copy link
Member

@warner warner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavior change to make if BOYD fails during upgrade, plus some minor nits. Nice job!

packages/SwingSet/src/kernel/kernel.js Outdated Show resolved Hide resolved
const boydStatus = await deliverAndLogToVat(vatID, boydKD, boydVD);
const boydResults = deliveryCrankResults(vatID, boydStatus, false);
(!boydResults.abort && !boydResults.terminate) ||
Fail`Unexpected abort/terminate result from upgrade-internal bringOutYourDead: ${boydResults}`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, in the old version, a hard-meter overrun or liveslots internal error during stopVat would have unwound the upgrade, leaving the vat in the pre-upgrade state (but also still receiving messages).

In this new version, the same happening during the final bringOutYourDead will panic the kernel, since errors thrown during this processUpgradeVat are in a different category than errors happening within the worker.

I'm trying to figure out how I feel about that. On one hand, kernel panics are bad. Userspace should never be able to prevent kernel progress. We tolerate slightly more from liveslots, to give it agency to perform end-of-crank work, but even then, a dropped promise or infinite loop in most of liveslots.js would either return an error or trip the hard meter (only dispatch() itself has the ability to stall the kernel forever), meaning the worst it can do is get itself terminated. This change grants an effective syscall.panicKernel() to liveslots by making an illegal syscall during the final BOYD.

On the second hand, this isn't directly extending that power to userspace. The worst case I can think of is if userspace managed to build up such a large collection of unreachable objects that the BOYD crank overran the hard-meter limit. Userspace could do that at any time, without this change, but previously that could only cause the vat to be terminated (and userspace can pull that off trivially with an infinite loop, or vatPowers.terminate()). With this change, there's a tiny window during upgrade when careful preparation could panic the kernel.

On the third hand, if something goes wrong during during upgrade, what can we do?

  • 1: unwind the upgrade
  • 2: terminate the vat
  • 3: panic the kernel

A vat-fatal error during upgrade means the vat was already in trouble, especially now that we're only doing BOYD and no other unusual deliveries (imagine a bug in the now-removed stopVat, which only broke upgrade, that would have been super annoying). If this BOYD kills the vat, it would have died soon anyways.

Although, if the vat is marked as critical and it terminates during some delivery (not necessarily the upgrade BOYD, nor any BOYD, just a regular message), the kernel will panic (to prevent the termination from being committed), and we're talking about waking up in a state where we've upgraded the kernel code to do something different, like upgrade the vat before allowing any other messages through. In that case, if our emergency ahead-of-the-line upgrade fails, we probably do want to re-panic the kernel, rather than re-delivering the fatal message, under the "death before confusion" rule.

That suggests that isCritical should play a role, or at least that any "error during upgrade causes vat termination" case should be amended with ".. unless isCritical, in which case panic the kernel".

If the vat is not marked isCritical, and we aren't in some emergency manually-driven-recovery situation, then what's the best way to handle a problem during upgrade? I've assumed we should just unwind it, because that covers the "version 2 is busted" case (which feels like the most likely one). Terminating the vat feels harsh, and panicing the kernel is clearly too much.

If the vat is marked isCritical but we're in a normal userspace-driven upgrade, then unwinding the upgrade and letting them try again with a better version feels appropriate. There are regular messages (already queued up) that will get delivered to the old version: upgrade is async from the POV of the parent vat, so we aren't violating any rules by not doing the upgrade at all (equivalent to deferring the upgrade forever).

So that leaves the emergency-manual-upgrade case. We'll have this code running (in the bulldozer release), then we imagine a kernel-panicing critical vat failure. We'll have some late nights frantically figuring out what the problem is, and how to recover from it. We'll have an opportunity to change the kernel code (as well as the rest of the release), then we'll deploy that release, and the kernel will wake up, and some special new code (that isn't being written today) will run, which will probably perform an ahead-of-the-queue upgrade of the vat that's in trouble. If that upgrade fails, then immediately panicing the kernel is probably best: if we didn't catch it testing, at least the validators will report failure as they try to come back up, and we'll spend another couple of late nights trying to figure out what went wrong.

I think this means that the upgrade-fails-now-what response wants to be "unwind" in most cases, and only "panic" if we're doing this emergency manual thing, and we haven't written the emergency manual thing yet. And we probably don't know enough to define that case yet. So rather than adding a half-baked isEmergencyUpgradeSoPanicIsBetter flag, let's leave a comment reminding ourselves of this analysis, and pointing out that some emergency upgrade situations might warrant panic-on-error. And then change this code to unwind the upgrade in case of error-during-BOYD, to match the previous behavior of error-during-stopVat.

That means retaining the old if (stopVatResults.terminate) clause, but looking at boydResults instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reasoning makes sense to me, and I've done my best to condense it into the source code. Please re-review the latest changes to kernel.js.

packages/swingset-liveslots/src/liveslots.js Outdated Show resolved Hide resolved
// This used to be MUCH more extensive, but GC was cut to the bone
// in commits like 91480dee8e48ae26c39c420febf73b93deba6ea5
// basically reverting 1cfbeaa3c925d0f8502edfb313ecb12a1cab5eac
// (see also #5342 and #6650).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding those notes. It'd also be nice to have a pointer to the git commit-ID from shortly before this lands, so exclude the various test changes that have happened since those two commits. Either cite this PR number (from which we can figure it out via github history), or sample current trunk's commit-ID (ec6a7f4eded916afb8c927638b4a699e2ffd99c9 as of this morning, which is close enough).

@gibson042 gibson042 added the automerge:no-update (expert!) Automatically merge without updates label Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge:no-update (expert!) Automatically merge without updates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add BOYD to upgradeVat sequence remove stopVat(): allow vat upgrade without participation of old vat version
2 participants