Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(swingset): get a small upgrade to work #4927

Merged
merged 1 commit into from
Mar 28, 2022
Merged

Conversation

warner
Copy link
Member

@warner warner commented Mar 26, 2022

This enables a partial vat upgrade test to work. It makes a lot of
assumptions and is missing a lot of functionality, but it's a concrete
step forward. Limitations include:

  • if the stopVat or startVat fail somehow, the vat will be left in an
    unusable state, and the caller will not be notified
  • stopVat does not clean up the non-durable objects
  • stopVat does not reject the lingering promises
  • startVat does not assert that userspace reconnects all Kinds
  • nothing about reconnecting Kinds is tested at all
  • nothing about remaining durable objects is tested at all
  • the kernel does not unsubscribe the vat from remaining promises
  • transcript pointers are cleared, but the contents are not deleted
  • the snapshot decref/deletion is not tested
  • stopVat is metered, but probably shouldn't be
  • startVat metering needs more thought

refs #1848

@warner warner added the SwingSet package: SwingSet label Mar 26, 2022
@warner warner requested a review from FUDCo as a code owner March 26, 2022 03:53
@warner warner self-assigned this Mar 26, 2022
@warner
Copy link
Member Author

warner commented Mar 26, 2022

@FUDCo there are a lot of TODOs in these changes, but they should all be in code that's only exercised when a vat is upgraded. I'm thinking that it's better to land incremental functionality than to wait until everything is polished, so we can start playing with it sooner.

The next step will be to have stopVat reject any lingering promises, and add tests of the same.

@warner warner force-pushed the 1848-restart-vat branch 3 times, most recently from 9122c6d to fed609c Compare March 27, 2022 00:24
Copy link
Contributor

@FUDCo FUDCo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving because I don't want to block your progress, but a couple of concerns in the comments below.

for (const kref of vatParameters.slots) {
kernelKeeper.decrementRefCount(kref, 'upgrade-vat-event');
}
const status2 = await deliverAndLogToVat(vatID, kd2, vd2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little nervous about doing two deliverAndLogToVat calls within a single crank. While there's nothing specifically I can point at that seems wrong per se, up until now cranks and deliveries have been in 1:1 correspondence. In particular, we should be extra careful that there's nothing in deliverAndLogToVat that bakes in assumptions that it is at the start of a crank on the way in or at the end of a crank on the way out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is the first time we're taking advantage of the difference between "crank" and "delivery". I think vatWarehouse.deliverToVat and the slogger is entirely ready for it, but the part that I know needs some polish is on the delivery results. The only sort of errors we've dealt with so far are all the kinds that terminate the vat. We're now entering into territory where e.g. startVat will soon be able to notice that userspace failed to reconnect all of the durable Kinds, and if that happens we want to roll back the entire upgrade. We'll need a response that says "hey caller, the worker is wedged and no longer viable, but if you abortCrank and jettison the worker, you could resume from the earlier snapshot without problems". My plan is to get roll-back-failed-upgrade working after getting the successful upgrade paths done, at which point I'll be looking more closely at these delivery results and their error modes.

}
// TODO: same rollback concern
// TODO: streamStore.deleteStream(transcriptStream);
const newStart = streamStore.STREAM_START;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to reset the stream pointer to the start position or merely remove the prior transcript entries? Seems like for debugging purposes it would be nicer to maintain the continuity of transcript position across the upgrade, so that, for example, if one was looking at a historical record of a transcript there would be no ambiguity about which incarnation of the vat code it was associated with.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was wondering about that too. The conclusion I came to was that the transcript is intimately tied to the bundleID being executed: it doesn't make sense to retain a transcript without also retaining the bundle. It's not like you can replay the concatenated transcripts on top of only the original source bundle and get to the current state.

To rebuild the vat from nothing requires a data structure that looks like "load bundle1, execute these N1 transcript entries, then shutdown and load bundle2, then execute these other N2 entries, etc". To do that properly we'd need to put the bundleID in the startVat delivery, and record the stopVat deliveries too. And you'd need to retain the source bundles for all the bundleIDs in all the startVats in the extended transcript.

That's not a bad design, but I haven't stopped to think about how long it would take to make the necessary changes to achieve it. I'll spend a bit evaluating that before I land this PR. At the very least I'll sketch out what the complete design would be so I can compare.

Performance-wise, I was hoping to delete as much data as possible, and being able to delete the transcript and the old bundles is appealing. A new validator who's trying to catch up doesn't need to replay the transcripts from bundle-1: it's sufficient for them to replay from just the most recent startVat. That helps their catchup performance a lot, especially if we manage periodic "null upgrades" whose main job is to discard the transcripts. But of course we don't have to delete the historical transcripts to achieve this, as long as new validators can merkle-root-validate the most recent transcript segment without needing to fetch all of the earlier ones too (i.e. we need to be judicious with our merkle tree layout).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting the old transcript sounds fine to me. I was just thinking that having continuity of the crank numbers in the historical timeline would be helpful when debugging timelines that cross an update boundary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it's non-trivial to make this change now. We could change startVat to include the bundleID, but the manager would have to snoop that and rewrite the delivery to include the whole bundle (or, better, send a setBundle command with the bundle contents before sending the deliver(startVat) command). But given the way the supervisor is currently written, that's too late, because the supervisor needs the full bundle to define buildVatNamespace before passing it into makeLiveSlots. We'd have to change makeLiveSlots to return a setBuildVatNamespace, sort of like how it used to return a setBuildRootObject.

I'll open a new ticket to see if we can come back to this later. There's an open policy question as to whether we should retain the historical transcripts from the very beginning of time, or just enough to enable #1691 -style replay (which, to be clear, cannot benefit from replay beyond the most recent version, which is a significant downside of the null-upgrade step).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created #4940 to track this idea.

@warner warner force-pushed the 1848-initialize-store-kind-info-earlier branch from 30eef46 to 84ff2e3 Compare March 28, 2022 06:12
@warner warner force-pushed the 1848-restart-vat branch from fed609c to 25ecef5 Compare March 28, 2022 06:12
@warner warner force-pushed the 1848-initialize-store-kind-info-earlier branch from 84ff2e3 to 28e500b Compare March 28, 2022 06:33
Base automatically changed from 1848-initialize-store-kind-info-earlier to master March 28, 2022 06:48
This enables a partial vat upgrade test to work. It makes a lot of
assumptions and is missing a lot of functionality, but it's a concrete
step forward. Limitations include:

* if the stopVat or startVat fail somehow, the vat will be left in an
unusable state, and the caller will not be notified
* stopVat does not clean up the non-durable objects
* stopVat does not reject the lingering promises
* startVat does not assert that userspace reconnects all Kinds
* nothing about reconnecting Kinds is tested at all
* nothing about remaining durable objects is tested at all
* the kernel does not unsubscribe the vat from remaining promises
* transcript pointers are cleared, but the contents are not deleted
* the snapshot decref/deletion is not tested
* stopVat is metered, but probably shouldn't be
* startVat metering needs more thought

refs #1848
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants