feat(swingset): get a small upgrade to work #4927

warner · 2022-03-26T03:53:38Z

This enables a partial vat upgrade test to work. It makes a lot of
assumptions and is missing a lot of functionality, but it's a concrete
step forward. Limitations include:

if the stopVat or startVat fail somehow, the vat will be left in an
unusable state, and the caller will not be notified
stopVat does not clean up the non-durable objects
stopVat does not reject the lingering promises
startVat does not assert that userspace reconnects all Kinds
nothing about reconnecting Kinds is tested at all
nothing about remaining durable objects is tested at all
the kernel does not unsubscribe the vat from remaining promises
transcript pointers are cleared, but the contents are not deleted
the snapshot decref/deletion is not tested
stopVat is metered, but probably shouldn't be
startVat metering needs more thought

refs #1848

warner · 2022-03-26T03:54:55Z

@FUDCo there are a lot of TODOs in these changes, but they should all be in code that's only exercised when a vat is upgraded. I'm thinking that it's better to land incremental functionality than to wait until everything is polished, so we can start playing with it sooner.

The next step will be to have stopVat reject any lingering promises, and add tests of the same.

FUDCo

Approving because I don't want to block your progress, but a couple of concerns in the comments below.

FUDCo · 2022-03-28T00:57:18Z

packages/SwingSet/src/kernel/kernel.js

+    for (const kref of vatParameters.slots) {
+      kernelKeeper.decrementRefCount(kref, 'upgrade-vat-event');
+    }
+    const status2 = await deliverAndLogToVat(vatID, kd2, vd2);


I'm a little nervous about doing two deliverAndLogToVat calls within a single crank. While there's nothing specifically I can point at that seems wrong per se, up until now cranks and deliveries have been in 1:1 correspondence. In particular, we should be extra careful that there's nothing in deliverAndLogToVat that bakes in assumptions that it is at the start of a crank on the way in or at the end of a crank on the way out.

Yep, this is the first time we're taking advantage of the difference between "crank" and "delivery". I think vatWarehouse.deliverToVat and the slogger is entirely ready for it, but the part that I know needs some polish is on the delivery results. The only sort of errors we've dealt with so far are all the kinds that terminate the vat. We're now entering into territory where e.g. startVat will soon be able to notice that userspace failed to reconnect all of the durable Kinds, and if that happens we want to roll back the entire upgrade. We'll need a response that says "hey caller, the worker is wedged and no longer viable, but if you abortCrank and jettison the worker, you could resume from the earlier snapshot without problems". My plan is to get roll-back-failed-upgrade working after getting the successful upgrade paths done, at which point I'll be looking more closely at these delivery results and their error modes.

FUDCo · 2022-03-28T01:03:28Z

packages/SwingSet/src/kernel/state/vatKeeper.js

+    }
+    // TODO: same rollback concern
+    // TODO: streamStore.deleteStream(transcriptStream);
+    const newStart = streamStore.STREAM_START;


Do we really want to reset the stream pointer to the start position or merely remove the prior transcript entries? Seems like for debugging purposes it would be nicer to maintain the continuity of transcript position across the upgrade, so that, for example, if one was looking at a historical record of a transcript there would be no ambiguity about which incarnation of the vat code it was associated with.

Yeah, I was wondering about that too. The conclusion I came to was that the transcript is intimately tied to the bundleID being executed: it doesn't make sense to retain a transcript without also retaining the bundle. It's not like you can replay the concatenated transcripts on top of only the original source bundle and get to the current state.

To rebuild the vat from nothing requires a data structure that looks like "load bundle1, execute these N1 transcript entries, then shutdown and load bundle2, then execute these other N2 entries, etc". To do that properly we'd need to put the bundleID in the startVat delivery, and record the stopVat deliveries too. And you'd need to retain the source bundles for all the bundleIDs in all the startVats in the extended transcript.

That's not a bad design, but I haven't stopped to think about how long it would take to make the necessary changes to achieve it. I'll spend a bit evaluating that before I land this PR. At the very least I'll sketch out what the complete design would be so I can compare.

Performance-wise, I was hoping to delete as much data as possible, and being able to delete the transcript and the old bundles is appealing. A new validator who's trying to catch up doesn't need to replay the transcripts from bundle-1: it's sufficient for them to replay from just the most recent startVat. That helps their catchup performance a lot, especially if we manage periodic "null upgrades" whose main job is to discard the transcripts. But of course we don't have to delete the historical transcripts to achieve this, as long as new validators can merkle-root-validate the most recent transcript segment without needing to fetch all of the earlier ones too (i.e. we need to be judicious with our merkle tree layout).

Deleting the old transcript sounds fine to me. I was just thinking that having continuity of the crank numbers in the historical timeline would be helpful when debugging timelines that cross an update boundary.

It looks like it's non-trivial to make this change now. We could change startVat to include the bundleID, but the manager would have to snoop that and rewrite the delivery to include the whole bundle (or, better, send a setBundle command with the bundle contents before sending the deliver(startVat) command). But given the way the supervisor is currently written, that's too late, because the supervisor needs the full bundle to define buildVatNamespace before passing it into makeLiveSlots. We'd have to change makeLiveSlots to return a setBuildVatNamespace, sort of like how it used to return a setBuildRootObject.

I'll open a new ticket to see if we can come back to this later. There's an open policy question as to whether we should retain the historical transcripts from the very beginning of time, or just enough to enable #1691 -style replay (which, to be clear, cannot benefit from replay beyond the most recent version, which is a significant downside of the null-upgrade step).

I created #4940 to track this idea.

This enables a partial vat upgrade test to work. It makes a lot of assumptions and is missing a lot of functionality, but it's a concrete step forward. Limitations include: * if the stopVat or startVat fail somehow, the vat will be left in an unusable state, and the caller will not be notified * stopVat does not clean up the non-durable objects * stopVat does not reject the lingering promises * startVat does not assert that userspace reconnects all Kinds * nothing about reconnecting Kinds is tested at all * nothing about remaining durable objects is tested at all * the kernel does not unsubscribe the vat from remaining promises * transcript pointers are cleared, but the contents are not deleted * the snapshot decref/deletion is not tested * stopVat is metered, but probably shouldn't be * startVat metering needs more thought refs #1848

warner added the SwingSet package: SwingSet label Mar 26, 2022

warner requested a review from FUDCo as a code owner March 26, 2022 03:53

warner self-assigned this Mar 26, 2022

warner mentioned this pull request Mar 26, 2022

kernel API for upgrading vats #1848

Closed

warner force-pushed the 1848-restart-vat branch 3 times, most recently from 9122c6d to fed609c Compare March 27, 2022 00:24

FUDCo approved these changes Mar 28, 2022

View reviewed changes

warner force-pushed the 1848-initialize-store-kind-info-earlier branch from 30eef46 to 84ff2e3 Compare March 28, 2022 06:12

warner force-pushed the 1848-restart-vat branch from fed609c to 25ecef5 Compare March 28, 2022 06:12

warner force-pushed the 1848-initialize-store-kind-info-earlier branch from 84ff2e3 to 28e500b Compare March 28, 2022 06:33

Base automatically changed from 1848-initialize-store-kind-info-earlier to master March 28, 2022 06:48

warner force-pushed the 1848-restart-vat branch from 25ecef5 to a7d996b Compare March 28, 2022 06:55

warner mentioned this pull request Mar 28, 2022

retain transcripts from all versions of a vat, or just the most recent one? #4940

Closed

warner merged commit f438597 into master Mar 28, 2022

warner deleted the 1848-restart-vat branch March 28, 2022 07:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(swingset): get a small upgrade to work #4927

feat(swingset): get a small upgrade to work #4927

warner commented Mar 26, 2022

warner commented Mar 26, 2022

FUDCo left a comment

FUDCo Mar 28, 2022

warner Mar 28, 2022

FUDCo Mar 28, 2022

warner Mar 28, 2022

FUDCo Mar 28, 2022

warner Mar 28, 2022

warner Mar 28, 2022

feat(swingset): get a small upgrade to work #4927

feat(swingset): get a small upgrade to work #4927

Conversation

warner commented Mar 26, 2022

warner commented Mar 26, 2022

FUDCo left a comment

Choose a reason for hiding this comment

FUDCo Mar 28, 2022

Choose a reason for hiding this comment

warner Mar 28, 2022

Choose a reason for hiding this comment

FUDCo Mar 28, 2022

Choose a reason for hiding this comment

warner Mar 28, 2022

Choose a reason for hiding this comment

FUDCo Mar 28, 2022

Choose a reason for hiding this comment

warner Mar 28, 2022

Choose a reason for hiding this comment

warner Mar 28, 2022

Choose a reason for hiding this comment