-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(SwingSet): force reload worker from snapshot #7561
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, modulo some name suggestions.
@@ -5,6 +5,32 @@ import { xsnap, recordXSnap } from '@agoric/xsnap'; | |||
|
|||
const NETSTRING_MAX_CHUNK_SIZE = 12_000_000; | |||
|
|||
/** | |||
* @typedef {object} StartXSnapLoadFromBundleDetails |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW the types I set up in the transcript for this case are TranscriptDeliveryInitializeWorkerOptions
(with { source: { bundleID }, workerOptions: WorkerOptions }
, where the XSnapWorkerOptions
flavor of WorkerOptions
has { type: 'xsnap', bundleIDs }
). This file shouldn't reference "Transcript" anything, but we should name the internal pieces somehow, and share those between here and the vat-warehouse.js transcript management, because the goal is for vat-warehouse.js or vat-loader.js to notice that the transcript starts with initialize-worker
and then pass the second arg of the entry mostly-intact to this function.
Although.. startXSnap
is not the piece that handles vat bundles, they need to be sent as a command payload, after the supervisor is running (they must be evaluated by liveslots, which must first prepare the vatGlobals
for it). So I don't think StartXSnapLoadFromBundleDetails
is quite right, maybe more like "initialize worker" or "empty worker". But this is fine for now, we'll refine this later as we clean up the vat-loader path to follow the transcript instructions more directly.
For load-from-snapshot, the corresponding thing is a "loadConfig", defined as part of the TranscriptDeliveryLoadSnapshot
type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 high level cases:
- start xsnap process from scratch and send command to load some bundles
- start xsnap process from snapshot (which really should be start from scratch and send command to load a snapshot, but that's an improvement for another day)
That's what options.load
attempts to capture. The other top level options for startXSnap
should be independent of how the load is performed, which means the bundleIDs
have no reason to appear in the top-level startXSnap
options if we're loading from a snapshot. I did not have time to change the plumbing for the load from bundle path, but the bundleIDs
should move under the options.load = { source: 'bundle', bundlesIDs };
.
Although..
startXSnap
is not the piece that handles vat bundles, they need to be sent as a command payload, after the supervisor is running (they must be evaluated by liveslots
I am confused, from what I gather, the bundleIDs
are the lockdown and supervisor bundles. This file does not deal with the vat's source bundle at all from what I can tell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok, the singular "bundle" in the old StartXSnapLoadFromBundleDetails
name made me think the (singular) vat bundle was involved, and that was at odds with the "load" verb (which I associate with snapshots). I definitely agree that this first "from scratch" case is the only one where the helper bundleIDs (plural) should appear.
And now I see that you're aiming for a single argument which captures both the "from scratch (with helper bundles)" case and the "from snapshot (either 'here is a stream' or 'go ask snapstore')" case, where I was thinking about separate arguments, so finding a single name for the combined one (and a union type) is awkward.
Let's not change it now, but in a later cleanup, let's consider splitting startXSnap
into two functions, one for from-scratch, the other for load-snapshot: I've been thinking about doing the same at the manager-subprocess-xsnap.js and vat-warehouse.js layers. The vat-warehouse.js ensureVatOnline
will look at the first entry in the transcript and then either vector off into the "from scratch" path (using initOptions
, which includes workerOptions
and the helper bundleIDs), or the "from snapshot" path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should consider always starting a blank worker, then a command to load bundle, or another command to load a snapshot. At least at the xsnap layer it would make sense.
return { snapshotStream, snapshotDescription }; | ||
} | ||
|
||
return undefined; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this last case an error? When would loadDetails
be defined but not have a usable .source
? Maybe this should throw instead.
Or.. .source: 'bundle'
means "don't load a snapshot"? I guess that works, but I think I'd rather have the caller not provide details.load
at all, rather than suggesting that the vat bundle (which must be processed much later, and not by startXSnap
) is useful to appear here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These would be the "init" bundles, aka lockdown and supervisor, and are the ones this file explicitely deal with. I will try to make that more obvious.
* @param {string} name | ||
* @param {object} details | ||
* @param {import('../types-external.js').BundleID[]} details.bundleIDs | ||
* @param {(request: Uint8Array) => Promise<Uint8Array>} details.handleCommand | ||
* @param {boolean} [details.metered] | ||
* @param {boolean} [details.reload] | ||
* @param {StartXSnapLoadDetails} [details.load] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd like this option to include "snapshot" in the name, maybe defailts.snapshotInfo
or snapshotLoadInfo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. This is meant to inform the mode in which the worker is created: from scratch and load supervisor and lockdown bundles, or from snapshot. Ultimately it is meant to be mandatory, but the bundleIDs plumbing was a bit too much to change in this PR.
* @returns {Promise<void>} | ||
*/ | ||
async function saveSnapshot(manager) { | ||
async function saveSnapshot(manager, keepWorker) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest flipping the polarity and calling the option restartWorker
. For some reason "keepWorker" doesn't tell me what the two possibilities are as much as something with "restart" in the name. Like, if keepWorker=false
, does it mean we're killing the worker entirely, and just relying upon the next delivery to page it back in? "restartWorker=true" tells me that we're going to kill it and start a new one right away, which means that not doing that might be an option. I dunno, maybe just me.
Also, not something we can do anything about now, but it's a bit weird that vatKeeper, being a frontend for the DB and in a subdirectory named state/
, has an option that talks about how to manage child processes. Ideally vatKeeper would be only about state, and the relationship between vat-warehouse (the caller) and vatKeeper and the manager would let vatKeeper not take this option (and actually not take manager
either). Like, maybe vat-warehouse gets a snapshotReadStream from the manager, and passes it into vatKeeper.saveSnapshot. Something to think about for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very, very allergic to positional boolean parameters that default to true. I guess we need too decide which state is the default, and which state is set by policy.
However I do see the possible confusion with "page out the worker".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, good point, I'll think about potential future cleanups for this, maybe I can find a word that triggers the right association in my mind but also has a good "false" default. (I'm still not used to JS being so casual about missing arguments, and haven't internalized the idea that omitting the argument is an accepted way of saying "false", which then obviously conflicts with needing a default of "true").
@@ -236,6 +236,7 @@ export {}; | |||
* | |||
* @typedef {object} VatWarehousePolicy | |||
* @property { number } [maxVatsOnline] Limit the number of simultaneous workers | |||
* @property { boolean } [keepWorkerOnSnapshot] Disable reloading from snapshot when one is taken |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, maybe name it restartWorkerOnSnapshot
function makeSnapshot(snapPos, snapStore) { | ||
return snapStore.saveSnapshot( | ||
vatID, | ||
async function makeSnapshot(snapPos, snapStore, keepWorker) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, restartWorker
} | ||
|
||
/** @type {AsyncGenerator<Uint8Array, void, void>[]} */ | ||
const [loadStream, saveStream] = synchronizedTee(snapshotStream, 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might name these restartWorkerStream
and dbSaveStream
(or saveToDBStream
), the "load vs save" directions are a bit ambiguous because they don't tell you it's the replacement worker doing the loading.
}; | ||
|
||
kernelSlog.write({ | ||
type: 'heap-snapshotStream-load', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've generally been avoiding camelCase in slog type:
values, so maybe heap-snapshot-stream-load
.
Is this slog event in addition to the save-snapshot
we were producing previously? In that case, if we're distinguishing between just "we saved a snapshot with hash X" and "oh BTW we restarted the worker", then this type
should look a bit more like the load-snapshot
event. Maybe heap-snapshot-reload
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no I have no idea how this snuck in, looks like a rebase conflict resolution gona bad! It was meant to be heap-snapshot-load
since it is exactly that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can change this later, but I think the new version will write the following slog items:
heap-snapshot-load
(first, beacuse it happens when vatKeepersaveSnapshot()
callsmanager.makeSnapshot()
)heap-snapshot-save
(second, back in vatKeepersaveSnapshot()
)
which feels backwards. When we get to a cleanup pass, I'd like to remove the heap-snapshot-load
, because we now have a restartWorker
item to the heap-snapshot-save
slog entry, and I don't want us to confuse this case with a "worker got evicted and now vat-warehouse is bringing it back online" case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a good point.
@@ -243,7 +243,8 @@ export function makeVatWarehouse({ | |||
panic, | |||
warehousePolicy, | |||
}) { | |||
const { maxVatsOnline = 50 } = warehousePolicy || {}; | |||
const { maxVatsOnline = 50, keepWorkerOnSnapshot = false } = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, restartWorkerOnSnapshot = true
f990398
to
ac46574
Compare
case 'snapStore': { | ||
if (!snapStore) { | ||
// Fallback to load from bundles | ||
return undefined; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note for future: this should be an error, if the caller expected a load-snapshot
but the implementation somehow failed to provide a snapStore
(which probably isn't even a thing anymore, the notion that snapStore
was optional is pretty old), then a blank worker is not a functional substitute
(the original did the same thing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course, this whole clause will go away if/when the vat-loader gives us from: 'snapshotStream'
instead of from: 'snapStore'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but the goal of this commit was as a pure refactor, and some tests seemed to rely on this behavior.
Hopefully we can clean all that up in a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good.. we can do a cleanup pass later, none of the functionality needs to change
I'll change that one types-external.js comment to be correct, then land this. Thanks!
@@ -236,6 +236,7 @@ export {}; | |||
* | |||
* @typedef {object} VatWarehousePolicy | |||
* @property { number } [maxVatsOnline] Limit the number of simultaneous workers | |||
* @property { boolean } [restartWorkerOnSnapshot] Disable reloading from snapshot when one is taken |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment is now backwards: I'll add a fixup commit to correct it
4efb9f0
to
9061e20
Compare
9061e20
to
dd79275
Compare
dd79275
to
a1e0d09
Compare
closes: #6943
Description
Introduce a runtime option (enabled by default), to force reload a worker when a snapshot is taken.
This is done in a streaming fashion by tee-ing the snapshot stream into the snapStore and a new xsnap worker, then substituting the old worker transparently.
Security Considerations
This increases the likelihood that validators will stay in consensus at the cost of potentially hiding reload based defects in xsnap. To mitigate the latter, this is a runtime option that can be disabled, and we validate new xsnap versions with the replay tool which controls and compares the behavior of reload from snapshot.
Scaling Considerations
This would reduce memory fragmentation for large vats, potentially mitigating issues like #6661.
Documentation Considerations
Types for new runtime option. Not documented otherwise.
Testing Considerations
Modified the vat warehouse unit test to run both versions (reload and keep).
Manually tested on a loadgen branch (need update to support transient multiple process per worker).
(removed a
#loadgen-branch: mhofman/fix-shutdown-and-multi-processcomment here)