-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(swingset): get a small upgrade to work #4927
Conversation
@FUDCo there are a lot of TODOs in these changes, but they should all be in code that's only exercised when a vat is upgraded. I'm thinking that it's better to land incremental functionality than to wait until everything is polished, so we can start playing with it sooner. The next step will be to have |
9122c6d
to
fed609c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving because I don't want to block your progress, but a couple of concerns in the comments below.
for (const kref of vatParameters.slots) { | ||
kernelKeeper.decrementRefCount(kref, 'upgrade-vat-event'); | ||
} | ||
const status2 = await deliverAndLogToVat(vatID, kd2, vd2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little nervous about doing two deliverAndLogToVat
calls within a single crank. While there's nothing specifically I can point at that seems wrong per se, up until now cranks and deliveries have been in 1:1 correspondence. In particular, we should be extra careful that there's nothing in deliverAndLogToVat
that bakes in assumptions that it is at the start of a crank on the way in or at the end of a crank on the way out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this is the first time we're taking advantage of the difference between "crank" and "delivery". I think vatWarehouse.deliverToVat
and the slogger is entirely ready for it, but the part that I know needs some polish is on the delivery results. The only sort of errors we've dealt with so far are all the kinds that terminate the vat. We're now entering into territory where e.g. startVat
will soon be able to notice that userspace failed to reconnect all of the durable Kinds, and if that happens we want to roll back the entire upgrade. We'll need a response that says "hey caller, the worker is wedged and no longer viable, but if you abortCrank
and jettison the worker, you could resume from the earlier snapshot without problems". My plan is to get roll-back-failed-upgrade working after getting the successful upgrade paths done, at which point I'll be looking more closely at these delivery results and their error modes.
} | ||
// TODO: same rollback concern | ||
// TODO: streamStore.deleteStream(transcriptStream); | ||
const newStart = streamStore.STREAM_START; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to reset the stream pointer to the start position or merely remove the prior transcript entries? Seems like for debugging purposes it would be nicer to maintain the continuity of transcript position across the upgrade, so that, for example, if one was looking at a historical record of a transcript there would be no ambiguity about which incarnation of the vat code it was associated with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was wondering about that too. The conclusion I came to was that the transcript is intimately tied to the bundleID being executed: it doesn't make sense to retain a transcript without also retaining the bundle. It's not like you can replay the concatenated transcripts on top of only the original source bundle and get to the current state.
To rebuild the vat from nothing requires a data structure that looks like "load bundle1, execute these N1 transcript entries, then shutdown and load bundle2, then execute these other N2 entries, etc". To do that properly we'd need to put the bundleID in the startVat
delivery, and record the stopVat
deliveries too. And you'd need to retain the source bundles for all the bundleIDs in all the startVat
s in the extended transcript.
That's not a bad design, but I haven't stopped to think about how long it would take to make the necessary changes to achieve it. I'll spend a bit evaluating that before I land this PR. At the very least I'll sketch out what the complete design would be so I can compare.
Performance-wise, I was hoping to delete as much data as possible, and being able to delete the transcript and the old bundles is appealing. A new validator who's trying to catch up doesn't need to replay the transcripts from bundle-1: it's sufficient for them to replay from just the most recent startVat
. That helps their catchup performance a lot, especially if we manage periodic "null upgrades" whose main job is to discard the transcripts. But of course we don't have to delete the historical transcripts to achieve this, as long as new validators can merkle-root-validate the most recent transcript segment without needing to fetch all of the earlier ones too (i.e. we need to be judicious with our merkle tree layout).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleting the old transcript sounds fine to me. I was just thinking that having continuity of the crank numbers in the historical timeline would be helpful when debugging timelines that cross an update boundary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like it's non-trivial to make this change now. We could change startVat
to include the bundleID, but the manager would have to snoop that and rewrite the delivery to include the whole bundle (or, better, send a setBundle
command with the bundle contents before sending the deliver(startVat)
command). But given the way the supervisor is currently written, that's too late, because the supervisor needs the full bundle to define buildVatNamespace
before passing it into makeLiveSlots
. We'd have to change makeLiveSlots
to return a setBuildVatNamespace
, sort of like how it used to return a setBuildRootObject
.
I'll open a new ticket to see if we can come back to this later. There's an open policy question as to whether we should retain the historical transcripts from the very beginning of time, or just enough to enable #1691 -style replay (which, to be clear, cannot benefit from replay beyond the most recent version, which is a significant downside of the null-upgrade step).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created #4940 to track this idea.
30eef46
to
84ff2e3
Compare
fed609c
to
25ecef5
Compare
84ff2e3
to
28e500b
Compare
This enables a partial vat upgrade test to work. It makes a lot of assumptions and is missing a lot of functionality, but it's a concrete step forward. Limitations include: * if the stopVat or startVat fail somehow, the vat will be left in an unusable state, and the caller will not be notified * stopVat does not clean up the non-durable objects * stopVat does not reject the lingering promises * startVat does not assert that userspace reconnects all Kinds * nothing about reconnecting Kinds is tested at all * nothing about remaining durable objects is tested at all * the kernel does not unsubscribe the vat from remaining promises * transcript pointers are cleared, but the contents are not deleted * the snapshot decref/deletion is not tested * stopVat is metered, but probably shouldn't be * startVat metering needs more thought refs #1848
25ecef5
to
a7d996b
Compare
This enables a partial vat upgrade test to work. It makes a lot of
assumptions and is missing a lot of functionality, but it's a concrete
step forward. Limitations include:
unusable state, and the caller will not be notified
refs #1848