Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(SwingSet): force reload worker from snapshot #7561

Merged
merged 4 commits into from
Apr 30, 2023

Conversation

mhofman
Copy link
Member

@mhofman mhofman commented Apr 30, 2023

closes: #6943

Description

Introduce a runtime option (enabled by default), to force reload a worker when a snapshot is taken.
This is done in a streaming fashion by tee-ing the snapshot stream into the snapStore and a new xsnap worker, then substituting the old worker transparently.

Security Considerations

This increases the likelihood that validators will stay in consensus at the cost of potentially hiding reload based defects in xsnap. To mitigate the latter, this is a runtime option that can be disabled, and we validate new xsnap versions with the replay tool which controls and compares the behavior of reload from snapshot.

Scaling Considerations

This would reduce memory fragmentation for large vats, potentially mitigating issues like #6661.

Documentation Considerations

Types for new runtime option. Not documented otherwise.

Testing Considerations

Modified the vat warehouse unit test to run both versions (reload and keep).
Manually tested on a loadgen branch (need update to support transient multiple process per worker).

(removed a # loadgen-branch: mhofman/fix-shutdown-and-multi-process comment here)

@mhofman mhofman requested a review from warner April 30, 2023 04:30
Copy link
Member

@warner warner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, modulo some name suggestions.

@@ -5,6 +5,32 @@ import { xsnap, recordXSnap } from '@agoric/xsnap';

const NETSTRING_MAX_CHUNK_SIZE = 12_000_000;

/**
* @typedef {object} StartXSnapLoadFromBundleDetails
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW the types I set up in the transcript for this case are TranscriptDeliveryInitializeWorkerOptions (with { source: { bundleID }, workerOptions: WorkerOptions }, where the XSnapWorkerOptions flavor of WorkerOptions has { type: 'xsnap', bundleIDs }). This file shouldn't reference "Transcript" anything, but we should name the internal pieces somehow, and share those between here and the vat-warehouse.js transcript management, because the goal is for vat-warehouse.js or vat-loader.js to notice that the transcript starts with initialize-worker and then pass the second arg of the entry mostly-intact to this function.

Although.. startXSnap is not the piece that handles vat bundles, they need to be sent as a command payload, after the supervisor is running (they must be evaluated by liveslots, which must first prepare the vatGlobals for it). So I don't think StartXSnapLoadFromBundleDetails is quite right, maybe more like "initialize worker" or "empty worker". But this is fine for now, we'll refine this later as we clean up the vat-loader path to follow the transcript instructions more directly.

For load-from-snapshot, the corresponding thing is a "loadConfig", defined as part of the TranscriptDeliveryLoadSnapshot type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 high level cases:

  • start xsnap process from scratch and send command to load some bundles
  • start xsnap process from snapshot (which really should be start from scratch and send command to load a snapshot, but that's an improvement for another day)

That's what options.load attempts to capture. The other top level options for startXSnap should be independent of how the load is performed, which means the bundleIDs have no reason to appear in the top-level startXSnap options if we're loading from a snapshot. I did not have time to change the plumbing for the load from bundle path, but the bundleIDs should move under the options.load = { source: 'bundle', bundlesIDs };.

Although.. startXSnap is not the piece that handles vat bundles, they need to be sent as a command payload, after the supervisor is running (they must be evaluated by liveslots

I am confused, from what I gather, the bundleIDs are the lockdown and supervisor bundles. This file does not deal with the vat's source bundle at all from what I can tell.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok, the singular "bundle" in the old StartXSnapLoadFromBundleDetails name made me think the (singular) vat bundle was involved, and that was at odds with the "load" verb (which I associate with snapshots). I definitely agree that this first "from scratch" case is the only one where the helper bundleIDs (plural) should appear.

And now I see that you're aiming for a single argument which captures both the "from scratch (with helper bundles)" case and the "from snapshot (either 'here is a stream' or 'go ask snapstore')" case, where I was thinking about separate arguments, so finding a single name for the combined one (and a union type) is awkward.

Let's not change it now, but in a later cleanup, let's consider splitting startXSnap into two functions, one for from-scratch, the other for load-snapshot: I've been thinking about doing the same at the manager-subprocess-xsnap.js and vat-warehouse.js layers. The vat-warehouse.js ensureVatOnline will look at the first entry in the transcript and then either vector off into the "from scratch" path (using initOptions, which includes workerOptions and the helper bundleIDs), or the "from snapshot" path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should consider always starting a blank worker, then a command to load bundle, or another command to load a snapshot. At least at the xsnap layer it would make sense.

return { snapshotStream, snapshotDescription };
}

return undefined;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this last case an error? When would loadDetails be defined but not have a usable .source? Maybe this should throw instead.

Or.. .source: 'bundle' means "don't load a snapshot"? I guess that works, but I think I'd rather have the caller not provide details.load at all, rather than suggesting that the vat bundle (which must be processed much later, and not by startXSnap) is useful to appear here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These would be the "init" bundles, aka lockdown and supervisor, and are the ones this file explicitely deal with. I will try to make that more obvious.

* @param {string} name
* @param {object} details
* @param {import('../types-external.js').BundleID[]} details.bundleIDs
* @param {(request: Uint8Array) => Promise<Uint8Array>} details.handleCommand
* @param {boolean} [details.metered]
* @param {boolean} [details.reload]
* @param {StartXSnapLoadDetails} [details.load]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like this option to include "snapshot" in the name, maybe defailts.snapshotInfo or snapshotLoadInfo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This is meant to inform the mode in which the worker is created: from scratch and load supervisor and lockdown bundles, or from snapshot. Ultimately it is meant to be mandatory, but the bundleIDs plumbing was a bit too much to change in this PR.

* @returns {Promise<void>}
*/
async function saveSnapshot(manager) {
async function saveSnapshot(manager, keepWorker) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest flipping the polarity and calling the option restartWorker. For some reason "keepWorker" doesn't tell me what the two possibilities are as much as something with "restart" in the name. Like, if keepWorker=false, does it mean we're killing the worker entirely, and just relying upon the next delivery to page it back in? "restartWorker=true" tells me that we're going to kill it and start a new one right away, which means that not doing that might be an option. I dunno, maybe just me.

Also, not something we can do anything about now, but it's a bit weird that vatKeeper, being a frontend for the DB and in a subdirectory named state/, has an option that talks about how to manage child processes. Ideally vatKeeper would be only about state, and the relationship between vat-warehouse (the caller) and vatKeeper and the manager would let vatKeeper not take this option (and actually not take manager either). Like, maybe vat-warehouse gets a snapshotReadStream from the manager, and passes it into vatKeeper.saveSnapshot. Something to think about for later.

Copy link
Member Author

@mhofman mhofman Apr 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very, very allergic to positional boolean parameters that default to true. I guess we need too decide which state is the default, and which state is set by policy.

However I do see the possible confusion with "page out the worker".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, good point, I'll think about potential future cleanups for this, maybe I can find a word that triggers the right association in my mind but also has a good "false" default. (I'm still not used to JS being so casual about missing arguments, and haven't internalized the idea that omitting the argument is an accepted way of saying "false", which then obviously conflicts with needing a default of "true").

@@ -236,6 +236,7 @@ export {};
*
* @typedef {object} VatWarehousePolicy
* @property { number } [maxVatsOnline] Limit the number of simultaneous workers
* @property { boolean } [keepWorkerOnSnapshot] Disable reloading from snapshot when one is taken
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, maybe name it restartWorkerOnSnapshot

function makeSnapshot(snapPos, snapStore) {
return snapStore.saveSnapshot(
vatID,
async function makeSnapshot(snapPos, snapStore, keepWorker) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, restartWorker

}

/** @type {AsyncGenerator<Uint8Array, void, void>[]} */
const [loadStream, saveStream] = synchronizedTee(snapshotStream, 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might name these restartWorkerStream and dbSaveStream (or saveToDBStream), the "load vs save" directions are a bit ambiguous because they don't tell you it's the replacement worker doing the loading.

};

kernelSlog.write({
type: 'heap-snapshotStream-load',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've generally been avoiding camelCase in slog type: values, so maybe heap-snapshot-stream-load.

Is this slog event in addition to the save-snapshot we were producing previously? In that case, if we're distinguishing between just "we saved a snapshot with hash X" and "oh BTW we restarted the worker", then this type should look a bit more like the load-snapshot event. Maybe heap-snapshot-reload ?

Copy link
Member Author

@mhofman mhofman Apr 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no I have no idea how this snuck in, looks like a rebase conflict resolution gona bad! It was meant to be heap-snapshot-load since it is exactly that!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change this later, but I think the new version will write the following slog items:

  • heap-snapshot-load (first, beacuse it happens when vatKeeper saveSnapshot() calls manager.makeSnapshot())
  • heap-snapshot-save (second, back in vatKeeper saveSnapshot())

which feels backwards. When we get to a cleanup pass, I'd like to remove the heap-snapshot-load, because we now have a restartWorker item to the heap-snapshot-save slog entry, and I don't want us to confuse this case with a "worker got evicted and now vat-warehouse is bringing it back online" case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good point.

@@ -243,7 +243,8 @@ export function makeVatWarehouse({
panic,
warehousePolicy,
}) {
const { maxVatsOnline = 50 } = warehousePolicy || {};
const { maxVatsOnline = 50, keepWorkerOnSnapshot = false } =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, restartWorkerOnSnapshot = true

@mhofman mhofman added force:integration Force integration tests to run on PR and removed force:integration Force integration tests to run on PR labels Apr 30, 2023
@mhofman mhofman force-pushed the mhofman/6943-force-reload branch from f990398 to ac46574 Compare April 30, 2023 15:18
case 'snapStore': {
if (!snapStore) {
// Fallback to load from bundles
return undefined;
Copy link
Member

@warner warner Apr 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note for future: this should be an error, if the caller expected a load-snapshot but the implementation somehow failed to provide a snapStore (which probably isn't even a thing anymore, the notion that snapStore was optional is pretty old), then a blank worker is not a functional substitute

(the original did the same thing)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course, this whole clause will go away if/when the vat-loader gives us from: 'snapshotStream' instead of from: 'snapStore'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but the goal of this commit was as a pure refactor, and some tests seemed to rely on this behavior.

Hopefully we can clean all that up in a follow up.

Copy link
Member

@warner warner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good.. we can do a cleanup pass later, none of the functionality needs to change

I'll change that one types-external.js comment to be correct, then land this. Thanks!

@@ -236,6 +236,7 @@ export {};
*
* @typedef {object} VatWarehousePolicy
* @property { number } [maxVatsOnline] Limit the number of simultaneous workers
* @property { boolean } [restartWorkerOnSnapshot] Disable reloading from snapshot when one is taken
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment is now backwards: I'll add a fixup commit to correct it

@warner warner added SwingSet package: SwingSet automerge:rebase Automatically rebase updates, then merge and removed automerge:rebase Automatically rebase updates, then merge labels Apr 30, 2023
@warner warner force-pushed the mhofman/6943-force-reload branch from 4efb9f0 to 9061e20 Compare April 30, 2023 19:45
@warner warner added the automerge:rebase Automatically rebase updates, then merge label Apr 30, 2023
@mhofman mhofman force-pushed the mhofman/6943-force-reload branch from 9061e20 to dd79275 Compare April 30, 2023 20:05
@warner warner force-pushed the mhofman/6943-force-reload branch from dd79275 to a1e0d09 Compare April 30, 2023 20:20
@warner warner added automerge:rebase Automatically rebase updates, then merge and removed automerge:rebase Automatically rebase updates, then merge labels Apr 30, 2023
@mergify mergify bot merged commit 317cfb6 into master Apr 30, 2023
@mergify mergify bot deleted the mhofman/6943-force-reload branch April 30, 2023 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge:rebase Automatically rebase updates, then merge force:integration Force integration tests to run on PR SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Force xsnap reload from snapshot after writing each snapshot
2 participants