fix(SwingSet): Don't send stopVat during upgrade #7244

gibson042 · 2023-03-27T03:24:35Z

closes: #6650
closes: #7001

TODO:

Fix or abandon failing assertions relating to dropping imports

Description

Replaces upgrade-time stopVat with bringOutYourDead and leaves explanatory comments for the removed functionality. Also includes simplification and expanded coverage of some SwingSet tests.

Security and Scaling Considerations

We're knowingly accepting some storage leakage in exchange for not relying on correct behavior from potentially broken vats. The former issue will hopefully be cleaned up in the future as part of fixing issues like #7212.

Documentation Considerations

None not already covered as part of this PR.

Testing Considerations

I believe the tests in this PR are sufficient to cover the changes.

codecov · 2023-03-27T03:28:26Z

Codecov Report

Merging #7244 (7b718d7) into master (8497399) will increase coverage by 0.29%.
The diff coverage is 19.48%.

❗ Current head 7b718d7 differs from pull request most recent head 0efcb48. Consider uploading reports for the commit 0efcb48 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7244      +/-   ##
==========================================
+ Coverage   71.04%   71.33%   +0.29%     
==========================================
  Files         450      450              
  Lines       86477    86126     -351     
  Branches        3        3              
==========================================
+ Hits        61434    61435       +1     
+ Misses      24975    24623     -352     
  Partials       68       68

Impacted Files	Coverage Δ
packages/SwingSet/src/kernel/kernel.js	`67.63% <1.61%> (-0.04%)`	⬇️
packages/SwingSet/src/kernel/state/kernelKeeper.js	`94.49% <90.90%> (+0.09%)`	⬆️
packages/swingset-liveslots/src/liveslots.js	`93.76% <100.00%> (+1.69%)`	⬆️

... and 3 files with indirect coverage changes

datadog-full-agoric · 2023-03-27T03:33:31Z

Datadog Report

Branch report: gibson-7001-upgrade-vat-kernel-boyd
Commit report: 3e1cbed

❌ agoric-sdk: 2 Failed (0 Known Flaky), 0 New Flaky, 3443 Passed, 57 Skipped, 26m 58.26s Wall Time

❌ Failed Tests (2)

upgrade › upgrade › vat upgrade - local without automatic GC - agoric.SwingSet - Details

Expand for error

 ---
     name: AssertionError
     message: import-32 reachability after upgrade
     assertion: is
     values:
       'Difference (- actual, + expected):': |-
         - true
         + false
     at: |-
       verifyObjectTracking (packages/SwingSet/test/upgrade/test-upgrade.js:421:9)
 ...

upgrade › upgrade › vat upgrade - xsnap without automatic GC - agoric.SwingSet - Details

Expand for error

 ---
     name: AssertionError
     message: import-32 reachability after upgrade
     assertion: is
     values:
       'Difference (- actual, + expected):': |-
         - true
         + false
     at: |-
       verifyObjectTracking (packages/SwingSet/test/upgrade/test-upgrade.js:421:9)
 ...

Fixes #7001

Fixes #6650

gibson042 · 2023-03-28T22:26:00Z

@warner Turns out the unexpectedly-reachable import was being sent back in by the post-upgrade scripted interactions. I've added clarifying comments, and this is ready for full review!

warner

Behavior change to make if BOYD fails during upgrade, plus some minor nits. Nice job!

packages/SwingSet/src/kernel/kernel.js

warner · 2023-03-30T18:25:44Z

packages/SwingSet/src/kernel/kernel.js

+    const boydStatus = await deliverAndLogToVat(vatID, boydKD, boydVD);
+    const boydResults = deliveryCrankResults(vatID, boydStatus, false);
+    (!boydResults.abort && !boydResults.terminate) ||
+      Fail`Unexpected abort/terminate result from upgrade-internal bringOutYourDead: ${boydResults}`;


Hm, in the old version, a hard-meter overrun or liveslots internal error during stopVat would have unwound the upgrade, leaving the vat in the pre-upgrade state (but also still receiving messages).

In this new version, the same happening during the final bringOutYourDead will panic the kernel, since errors thrown during this processUpgradeVat are in a different category than errors happening within the worker.

I'm trying to figure out how I feel about that. On one hand, kernel panics are bad. Userspace should never be able to prevent kernel progress. We tolerate slightly more from liveslots, to give it agency to perform end-of-crank work, but even then, a dropped promise or infinite loop in most of liveslots.js would either return an error or trip the hard meter (only dispatch() itself has the ability to stall the kernel forever), meaning the worst it can do is get itself terminated. This change grants an effective syscall.panicKernel() to liveslots by making an illegal syscall during the final BOYD.

On the second hand, this isn't directly extending that power to userspace. The worst case I can think of is if userspace managed to build up such a large collection of unreachable objects that the BOYD crank overran the hard-meter limit. Userspace could do that at any time, without this change, but previously that could only cause the vat to be terminated (and userspace can pull that off trivially with an infinite loop, or vatPowers.terminate()). With this change, there's a tiny window during upgrade when careful preparation could panic the kernel.

On the third hand, if something goes wrong during during upgrade, what can we do?

1: unwind the upgrade

2: terminate the vat

3: panic the kernel

A vat-fatal error during upgrade means the vat was already in trouble, especially now that we're only doing BOYD and no other unusual deliveries (imagine a bug in the now-removed stopVat, which only broke upgrade, that would have been super annoying). If this BOYD kills the vat, it would have died soon anyways.

Although, if the vat is marked as critical and it terminates during some delivery (not necessarily the upgrade BOYD, nor any BOYD, just a regular message), the kernel will panic (to prevent the termination from being committed), and we're talking about waking up in a state where we've upgraded the kernel code to do something different, like upgrade the vat before allowing any other messages through. In that case, if our emergency ahead-of-the-line upgrade fails, we probably do want to re-panic the kernel, rather than re-delivering the fatal message, under the "death before confusion" rule.

That suggests that isCritical should play a role, or at least that any "error during upgrade causes vat termination" case should be amended with ".. unless isCritical, in which case panic the kernel".

If the vat is not marked isCritical, and we aren't in some emergency manually-driven-recovery situation, then what's the best way to handle a problem during upgrade? I've assumed we should just unwind it, because that covers the "version 2 is busted" case (which feels like the most likely one). Terminating the vat feels harsh, and panicing the kernel is clearly too much.

If the vat is marked isCritical but we're in a normal userspace-driven upgrade, then unwinding the upgrade and letting them try again with a better version feels appropriate. There are regular messages (already queued up) that will get delivered to the old version: upgrade is async from the POV of the parent vat, so we aren't violating any rules by not doing the upgrade at all (equivalent to deferring the upgrade forever).

So that leaves the emergency-manual-upgrade case. We'll have this code running (in the bulldozer release), then we imagine a kernel-panicing critical vat failure. We'll have some late nights frantically figuring out what the problem is, and how to recover from it. We'll have an opportunity to change the kernel code (as well as the rest of the release), then we'll deploy that release, and the kernel will wake up, and some special new code (that isn't being written today) will run, which will probably perform an ahead-of-the-queue upgrade of the vat that's in trouble. If that upgrade fails, then immediately panicing the kernel is probably best: if we didn't catch it testing, at least the validators will report failure as they try to come back up, and we'll spend another couple of late nights trying to figure out what went wrong.

I think this means that the upgrade-fails-now-what response wants to be "unwind" in most cases, and only "panic" if we're doing this emergency manual thing, and we haven't written the emergency manual thing yet. And we probably don't know enough to define that case yet. So rather than adding a half-baked isEmergencyUpgradeSoPanicIsBetter flag, let's leave a comment reminding ourselves of this analysis, and pointing out that some emergency upgrade situations might warrant panic-on-error. And then change this code to unwind the upgrade in case of error-during-BOYD, to match the previous behavior of error-during-stopVat.

That means retaining the old if (stopVatResults.terminate) clause, but looking at boydResults instead.

This reasoning makes sense to me, and I've done my best to condense it into the source code. Please re-review the latest changes to kernel.js.

packages/swingset-liveslots/src/liveslots.js

warner · 2023-03-30T19:03:50Z

packages/SwingSet/test/upgrade/test-upgrade.js

+  // This used to be MUCH more extensive, but GC was cut to the bone
+  // in commits like 91480dee8e48ae26c39c420febf73b93deba6ea5
+  // basically reverting 1cfbeaa3c925d0f8502edfb313ecb12a1cab5eac
+  // (see also #5342 and #6650).


Thanks for adding those notes. It'd also be nice to have a pointer to the git commit-ID from shortly before this lands, so exclude the various test changes that have happened since those two commits. Either cite this PR number (from which we can figure it out via github history), or sample current trunk's commit-ID (ec6a7f4eded916afb8c927638b4a699e2ffd99c9 as of this morning, which is close enough).

…y failure cf. #7244 (comment)

test(SwingSet): Clean up testUpgrade

aa6518b

gibson042 requested a review from warner March 27, 2023 03:24

gibson042 marked this pull request as draft March 27, 2023 03:24

gibson042 added 7 commits March 28, 2023 18:10

test(SwingSet): Improve testUpgrade GC coverage

a82124c

refactor(SwingSet): Use natural sort in kernel.dump() output ordering

03f8518

test(SwingSet): Add coverage for kernel delivery of vat-upgrade BOYD

95e6215

fix(SwingSet): Add BOYD to vat upgrade

be73588

Fixes #7001

test(SwingSet): Use new options to simplify a test

61d235f

fix(SwingSet): Don't send stopVat during upgrade

5cc47d2

Fixes #6650

test(SwingSet): Serialize some test execution to dodge incomplete V8 GC

8324ea2

gibson042 force-pushed the gibson-7001-upgrade-vat-kernel-boyd branch from c1e09d0 to 8324ea2 Compare March 28, 2023 22:13

gibson042 marked this pull request as ready for review March 28, 2023 22:23

gibson042 mentioned this pull request Mar 29, 2023

[LOW PRIORITY] chore: Improve liveslots-related types and documentation #7265

Open

warner requested changes Mar 30, 2023

View reviewed changes

gibson042 added 4 commits April 4, 2023 13:32

refactor(SwingSet): Introduce an abortUpgrade helper

8998e4a

fix(SwingSet): Abort-and-unwind vat upgrade upon any internal-deliver…

aed1282

…y failure cf. #7244 (comment)

test(SwingSet): Add a PR reference for future recovery of tests

0fbb5c4

chore(swingset-liveslots): Replace the stopVat error with a log message

7b718d7

warner approved these changes Apr 4, 2023

View reviewed changes

gibson042 added the automerge:no-update (expert!) Automatically merge without updates label Apr 4, 2023

Merge branch 'master' into gibson-7001-upgrade-vat-kernel-boyd

0efcb48

mergify bot merged commit 954b3dc into master Apr 4, 2023

mergify bot deleted the gibson-7001-upgrade-vat-kernel-boyd branch April 4, 2023 22:02

warner mentioned this pull request Apr 7, 2023

liveslots double-free of virtual objects/collections when RAM and vdata pillars drop on same BOYD #7353

Closed

mhofman mentioned this pull request May 8, 2024

Leak of durable objects through old virtual and heap references after upgrade #9338

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(SwingSet): Don't send stopVat during upgrade #7244

fix(SwingSet): Don't send stopVat during upgrade #7244

gibson042 commented Mar 27, 2023 •

edited

Loading

codecov bot commented Mar 27, 2023 •

edited

Loading

datadog-full-agoric bot commented Mar 27, 2023 •

edited

Loading

gibson042 commented Mar 28, 2023

warner left a comment

warner Mar 30, 2023

gibson042 Apr 4, 2023

warner Mar 30, 2023

fix(SwingSet): Don't send stopVat during upgrade #7244

fix(SwingSet): Don't send stopVat during upgrade #7244

Conversation

gibson042 commented Mar 27, 2023 • edited Loading

Description

Security and Scaling Considerations

Documentation Considerations

Testing Considerations

codecov bot commented Mar 27, 2023 • edited Loading

Codecov Report

datadog-full-agoric bot commented Mar 27, 2023 • edited Loading

Datadog Report

❌ Failed Tests (2)

gibson042 commented Mar 28, 2023

warner left a comment

Choose a reason for hiding this comment

warner Mar 30, 2023

Choose a reason for hiding this comment

gibson042 Apr 4, 2023

Choose a reason for hiding this comment

warner Mar 30, 2023

Choose a reason for hiding this comment

gibson042 commented Mar 27, 2023 •

edited

Loading

codecov bot commented Mar 27, 2023 •

edited

Loading

datadog-full-agoric bot commented Mar 27, 2023 •

edited

Loading