Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(SwingSet): Don't send stopVat during upgrade #7244

Merged
merged 13 commits into from
Apr 4, 2023
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions packages/SwingSet/docs/garbage-collection.md
Original file line number Diff line number Diff line change
Expand Up @@ -360,9 +360,7 @@ If the last importing vat had previously called `syscall.retireImport`, there wi

## syscall.abandonExport processing

During vat upgrade, the last delivery made to the old version is a special `dispatch.stopVat()`. This instructs the soon-to-be-deleted worker to destroy anything that will not survive the upgrade. All Remotables will be abandoned here, because they live solely in RAM, and the RAM heap does not survive. All non-durable virtual objects will also be abandoned, since only durable ones survive. We may also drop otherwise-durable objects which were only referenced by abandoned non-durable objects.

During this phase, the vat performs a `syscall.abandonExports()` with all the abandoned vrefs. The kernel reacts to this in roughly the same way it reacts to the entire vat being terminated (which it is, sort of, at least the heap is being terminated). Each vref is deleted from the exporting vat's c-list, because *that* vat isn't going to be referencing it any more. The kernel object table is updated to clear the `owner` field: while the kernel object retains its identity, it is now orphaned, and any messages sent to it will be rejected (by the kernel) with a "vat terminated" error.
Although not currently used (see [#6650](https://github.com/Agoric/agoric-sdk/issues/6650)), the kernel still supports vat issuance of `syscall.abandonExports()` with vrefs to be abandoned as part of vat upgrade (i.e., Remotables that live solely in the RAM heap and non-durable merely-virtual objects, and ideally also otherwise-durable objects not stored in baggage and referenced only by other abandoned objects). The kernel reacts to this in roughly the same way it reacts to the entire vat being terminated (which it is, sort of, at least the heap is being terminated). Each vref is deleted from the exporting vat's c-list, because *that* vat isn't going to be referencing it any more. The kernel object table is updated to clear the `owner` field: while the kernel object retains its identity, it is now orphaned, and any messages sent to it will be rejected (by the kernel) with a "vat terminated" error.

No other work needs to be done. Any importing vats will continue to hold their reference as before. They can only tell that the object has been abandoned if they try to send it a message. Eventually, if all the importing vats drop their reference, and nothing else in the kernel is holding one, the kernel object entry will be deleted. In this case, no `dispatch.retireExports` is sent to the old exporting vat, since it's already been removed from their c-list.

Expand Down
77 changes: 28 additions & 49 deletions packages/SwingSet/src/kernel/kernel.js
Original file line number Diff line number Diff line change
Expand Up @@ -802,7 +802,6 @@ export default function buildKernel(
}

/**
*
* @param {RunQueueEventUpgradeVat} message
* @returns {Promise<CrankResults>}
*/
Expand All @@ -825,21 +824,6 @@ export default function buildKernel(
incarnationNumber: vatKeeper.getIncarnationNumber(),
};
const disconnectionCapData = kser(disconnectObject);
/** @type { import('../types-external.js').KernelDeliveryStopVat } */
const stopVatKD = harden(['stopVat', disconnectionCapData]);
const stopVatVD = vatWarehouse.kernelDeliveryToVatDelivery(
vatID,
stopVatKD,
);
const stopVatStatus = await deliverAndLogToVat(vatID, stopVatKD, stopVatVD);
const stopVatResults = deliveryCrankResults(vatID, stopVatStatus, false);

// We don't meter stopVat, since no user code is running, but we
// still report computrons to the runPolicy
let { computrons } = stopVatResults; // BigInt or undefined
if (computrons !== undefined) {
assert.typeof(computrons, 'bigint');
}

/**
* Make a method-arguments structure representing failure
Expand All @@ -856,41 +840,36 @@ export default function buildKernel(
return ['vatUpgradeCallback', [upgradeID, false, error]];
};

// cleanup on behalf of the worker
// This used to be handled by a stopVat delivery to the vat itself,
// but the implementation of that was cut to the bone
// in commits like 91480dee8e48ae26c39c420febf73b93deba6ea5
// basically reverting 1cfbeaa3c925d0f8502edfb313ecb12a1cab5eac
// and then ultimately moved to the kernel in a MUCH diminished form
// (see #5342 and #6650, and testUpgrade in
// {@link ../../test/upgrade/test-upgrade.js}).
// We hope to eventually add back correct sophisticated logic
// by e.g. having liveslots sweep the database when restoring a vat.

// send BOYD so the terminating vat has one last chance to clean
// up, drop imports, and delete durable data.
// If a vat is so broken it can't do BOYD, we can make this optional.
/** @type { import('../types-external.js').KernelDeliveryBringOutYourDead } */
const boydKD = harden(['bringOutYourDead']);
const boydVD = vatWarehouse.kernelDeliveryToVatDelivery(vatID, boydKD);
const boydStatus = await deliverAndLogToVat(vatID, boydKD, boydVD);
const boydResults = deliveryCrankResults(vatID, boydStatus, false);
(!boydResults.abort && !boydResults.terminate) ||
gibson042 marked this conversation as resolved.
Show resolved Hide resolved
Fail`Unexpected abort/terminate result from upgrade-internal bringOutYourDead: ${boydResults}`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, in the old version, a hard-meter overrun or liveslots internal error during stopVat would have unwound the upgrade, leaving the vat in the pre-upgrade state (but also still receiving messages).

In this new version, the same happening during the final bringOutYourDead will panic the kernel, since errors thrown during this processUpgradeVat are in a different category than errors happening within the worker.

I'm trying to figure out how I feel about that. On one hand, kernel panics are bad. Userspace should never be able to prevent kernel progress. We tolerate slightly more from liveslots, to give it agency to perform end-of-crank work, but even then, a dropped promise or infinite loop in most of liveslots.js would either return an error or trip the hard meter (only dispatch() itself has the ability to stall the kernel forever), meaning the worst it can do is get itself terminated. This change grants an effective syscall.panicKernel() to liveslots by making an illegal syscall during the final BOYD.

On the second hand, this isn't directly extending that power to userspace. The worst case I can think of is if userspace managed to build up such a large collection of unreachable objects that the BOYD crank overran the hard-meter limit. Userspace could do that at any time, without this change, but previously that could only cause the vat to be terminated (and userspace can pull that off trivially with an infinite loop, or vatPowers.terminate()). With this change, there's a tiny window during upgrade when careful preparation could panic the kernel.

On the third hand, if something goes wrong during during upgrade, what can we do?

  • 1: unwind the upgrade
  • 2: terminate the vat
  • 3: panic the kernel

A vat-fatal error during upgrade means the vat was already in trouble, especially now that we're only doing BOYD and no other unusual deliveries (imagine a bug in the now-removed stopVat, which only broke upgrade, that would have been super annoying). If this BOYD kills the vat, it would have died soon anyways.

Although, if the vat is marked as critical and it terminates during some delivery (not necessarily the upgrade BOYD, nor any BOYD, just a regular message), the kernel will panic (to prevent the termination from being committed), and we're talking about waking up in a state where we've upgraded the kernel code to do something different, like upgrade the vat before allowing any other messages through. In that case, if our emergency ahead-of-the-line upgrade fails, we probably do want to re-panic the kernel, rather than re-delivering the fatal message, under the "death before confusion" rule.

That suggests that isCritical should play a role, or at least that any "error during upgrade causes vat termination" case should be amended with ".. unless isCritical, in which case panic the kernel".

If the vat is not marked isCritical, and we aren't in some emergency manually-driven-recovery situation, then what's the best way to handle a problem during upgrade? I've assumed we should just unwind it, because that covers the "version 2 is busted" case (which feels like the most likely one). Terminating the vat feels harsh, and panicing the kernel is clearly too much.

If the vat is marked isCritical but we're in a normal userspace-driven upgrade, then unwinding the upgrade and letting them try again with a better version feels appropriate. There are regular messages (already queued up) that will get delivered to the old version: upgrade is async from the POV of the parent vat, so we aren't violating any rules by not doing the upgrade at all (equivalent to deferring the upgrade forever).

So that leaves the emergency-manual-upgrade case. We'll have this code running (in the bulldozer release), then we imagine a kernel-panicing critical vat failure. We'll have some late nights frantically figuring out what the problem is, and how to recover from it. We'll have an opportunity to change the kernel code (as well as the rest of the release), then we'll deploy that release, and the kernel will wake up, and some special new code (that isn't being written today) will run, which will probably perform an ahead-of-the-queue upgrade of the vat that's in trouble. If that upgrade fails, then immediately panicing the kernel is probably best: if we didn't catch it testing, at least the validators will report failure as they try to come back up, and we'll spend another couple of late nights trying to figure out what went wrong.

I think this means that the upgrade-fails-now-what response wants to be "unwind" in most cases, and only "panic" if we're doing this emergency manual thing, and we haven't written the emergency manual thing yet. And we probably don't know enough to define that case yet. So rather than adding a half-baked isEmergencyUpgradeSoPanicIsBetter flag, let's leave a comment reminding ourselves of this analysis, and pointing out that some emergency upgrade situations might warrant panic-on-error. And then change this code to unwind the upgrade in case of error-during-BOYD, to match the previous behavior of error-during-stopVat.

That means retaining the old if (stopVatResults.terminate) clause, but looking at boydResults instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reasoning makes sense to me, and I've done my best to condense it into the source code. Please re-review the latest changes to kernel.js.


// we don't meter bringOutYourDead since no user code is running, but we
// still report computrons to the runPolicy
let { computrons } = boydResults;
computrons === undefined || assert.typeof(computrons, 'bigint');

// TODO: if/when we implement vat pause/suspend, and if
// deliveryCrankResults changes to not use .terminate to indicate
// a problem, this should change to match: where we would normally
// pause/suspend a vat for a delivery error, here we want to
// unwind the upgrade.

if (stopVatResults.terminate) {
// get rid of the worker, so the next delivery to this vat will
// re-create one from the previous state
// eslint-disable-next-line @jessie.js/no-nested-await
await vatWarehouse.stopWorker(vatID);

// notify vat-admin of the failed upgrade
const vatAdminMethargs = makeFailureMethargs(
stopVatResults.terminate.info,
);

// we still report computrons to the runPolicy
const results = harden({
...stopVatResults,
computrons,
abort: true, // always unwind
consumeMessage: true, // don't repeat the upgrade
terminate: undefined, // do *not* terminate the vat
vatAdminMethargs,
});
return results;
}

// stopVat succeeded, now we finish cleanup on behalf of the worker

// TODO: send BOYD so the terminating vat has one last chance to clean
// up, drop imports, and delete durable data.
// If a vat is so broken it can't do BOYD, we can make that optional.
// https://github.com/Agoric/agoric-sdk/issues/7001
// a problem, we might want to unwind the upgrade here.

// reject all promises for which the vat was decider
for (const kpid of kernelKeeper.enumeratePromisesByDecider(vatID)) {
Expand Down
11 changes: 11 additions & 0 deletions packages/SwingSet/src/kernel/state/kernelKeeper.js
Original file line number Diff line number Diff line change
Expand Up @@ -1483,6 +1483,17 @@ export default function makeKernelKeeper(kernelStorage, kernelSlog) {
}

function compareStrings(a, b) {
// natural-sort strings having a shared prefix followed by digits
// (e.g., 'ko42' and 'ko100')
const [_a, aPrefix, aDigits] = /^(\D+)(\d+)$/.exec(a) || [];
if (aPrefix) {
const [_b, bPrefix, bDigits] = /^(\D+)(\d+)$/.exec(b) || [];
if (bPrefix === aPrefix) {
return compareNumbers(aDigits, bDigits);
}
}

// otherwise use the default string ordering
if (a > b) {
return 1;
}
Expand Down
4 changes: 2 additions & 2 deletions packages/SwingSet/test/bootstrap-relay.js
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,10 @@ export const buildRootObject = () => {

getTimer: async () => encodePassable(timer),

getVatRoot: async ({ name }) => {
getVatRoot: async ({ name, rawOutput = false }) => {
const vat = vatData.get(name) || Fail`unknown vat name: ${q(name)}`;
const { root } = vat;
return encodePassable(root);
return rawOutput ? root : encodePassable(root);
},

createVat: async ({ name, bundleCapName, vatParameters = {} }) => {
Expand Down
21 changes: 14 additions & 7 deletions packages/SwingSet/test/upgrade/bootstrap-scripted-upgrade.js
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ import { Far } from '@endo/marshal';
import { assert } from '@agoric/assert';
import { makePromiseKit } from '@endo/promise-kit';

import { NUM_SENSORS } from './num-sensors.js';
const NUM_SENSORS = 39;

const insistMissing = (ref, isCollection = false) => {
let p;
Expand All @@ -27,9 +27,12 @@ export const buildRootObject = () => {
let ulrikRoot;
let ulrikAdmin;
const marker = Far('marker', {});
// sensors are numbered from '1' to match the vobj o+N/1.. vrefs
// for debugging, this array starts with a dummy element so
// the vref of each contained object in an importing vat
// (o-NN where NN starts at 1) is aligned with its index
/** @type {[string, ...object]} */
const importSensors = ['skip0'];
for (let i = 1; i < NUM_SENSORS + 1; i += 1) {
for (let i = 1; i <= NUM_SENSORS; i += 1) {
importSensors.push(Far(`import-${i}`, {}));
}
const { promise, resolve } = makePromiseKit();
Expand Down Expand Up @@ -64,13 +67,17 @@ export const buildRootObject = () => {

dur = await E(ulrikRoot).getDurandal({ d1: 'd1' });
const d1arg = await E(dur).get();
assert.equal(d1arg.d1, 'd1');
// poor man's deepEqual
// (each enumerable own property as a [key, value] pair)
assert.equal(JSON.stringify(Object.entries(d1arg)), '[["d1","d1"]]');

// give ver1 a promise that won't be resolved until ver2
// give version 1 a promise that won't be resolved until after upgrade
await E(ulrikRoot).acceptPromise(promise);
const { p1 } = await E(ulrikRoot).getEternalPromise();

// get promises that never resolve (and will be rejected at upgrade)
const [p1] = await E(ulrikRoot).getEternalPromiseInArray();
p1.catch(() => 'hush');
const p2 = E(ulrikRoot).returnEternalPromise(); // never resolves
const p2 = E(ulrikRoot).getEternalPromise();
p2.catch(() => 'hush');

return { version, data, p1, p2, retain, ...parameters };
Expand Down
1 change: 0 additions & 1 deletion packages/SwingSet/test/upgrade/num-sensors.js

This file was deleted.

Loading