Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(swingset): allow slow deletion of terminated vats #9227

Merged
merged 1 commit into from
Aug 13, 2024

Conversation

warner
Copy link
Member

@warner warner commented Apr 12, 2024

This introduces new runPolicy() controls which enable "slow
termination" of vats. When configured, terminated vats are immediately
dead (all promises are rejected, all new messages go splat, they never
run again), however the vat's state is deleted slowly, one piece at a
time. This makes it safe to terminate large vats, with a long history,
lots of c-list imports/exports, or large vatstore tables, without fear
of causing an overload (by e.g. dropping 100k references all in a
single crank).

See docs/run-policy.md for details and configuration instructions.

Also changes swing-store to enable budget-limited deletion of vat
transcripts and snapshots.

refs #8928

@warner warner added SwingSet package: SwingSet swing-store labels Apr 12, 2024
@warner warner force-pushed the warner/8928-terminate-vats-slowly branch from 7841696 to 55452ec Compare April 12, 2024 14:02
Copy link

cloudflare-workers-and-pages bot commented Apr 12, 2024

Deploying agoric-sdk with  Cloudflare Pages  Cloudflare Pages

Latest commit: 9ac2ef0
Status: ✅  Deploy successful!
Preview URL: https://56ae4a6e.agoric-sdk.pages.dev
Branch Preview URL: https://warner-8928-terminate-vats-s.agoric-sdk.pages.dev

View logs

@warner warner force-pushed the warner/8928-terminate-vats-slowly branch from 5d23745 to b3beede Compare April 12, 2024 20:49
@warner warner marked this pull request as ready for review April 13, 2024 04:09
@warner warner requested a review from mhofman April 13, 2024 04:09
@warner warner force-pushed the warner/8928-terminate-vats-slowly branch from b3beede to 701a1a2 Compare April 13, 2024 04:10
@warner warner force-pushed the warner/8980-boyd-scheduler branch from 62aa511 to 0fe9f39 Compare April 13, 2024 04:10
@warner warner mentioned this pull request Apr 13, 2024
@warner warner force-pushed the warner/8928-terminate-vats-slowly branch from 701a1a2 to c3299e5 Compare April 15, 2024 15:36
@warner warner force-pushed the warner/8980-boyd-scheduler branch from 0fe9f39 to 967e458 Compare April 15, 2024 15:36
@warner
Copy link
Member Author

warner commented Apr 15, 2024

Note: in addition to having the kernel spread c-list deletion processing over time (to spread out the GC consequences in other vats), I had to change the swing-store to let the kernel spread transcript/snapshot deletion over time (to limit the size of the DB txn). The swingstore work is in the first commit of this PR, the kernel side is in the second.

The swingstore needs to maintain the invariant that exports and imports still work. I arranged it so that transcript spans are deleted starting at the highest startPos (ORDER BY startPos DESC), so the isCurrent=1 record is the very first one deleted. And then I changed the export code to ignore any vat which is missing an isCurrent=1 record. The result is that we'll omit partially-deleted transcripts from any exports, so the import code won't ever observe a partial transcript, so its assertComplete() checks will not fail. Without that, any exports created after the first span deletion but before the final span deletion would be unimportable in mode='replay' or mode='archival'.

The snapshots are still deleted oldest-first (ORDER BY snapPos ASC), since the snapStore's assertComplete does not care about old snapshots.

The resulting data-deletion and export-size profiles, starting from the block where the vat is terminated, will look like:

when SQL contents export contents
vat terminated (vats.terminated).push(vatID) everything
draining kvStore kvStore shrinks kvStore shrinks
drained kvStore kvStore empty no kvStore for that vatID
draining heap snapshots old/unpopulated snapstore rows deleted old IAVL records removed but inUse=1 artifact remains
last heap snapshot the only populated snapstore row deleted the only snapshot artifact is removed
drained snapstore no snapshot data no snapshot artifacts or IAVL records
draining transcripts
latest span deleted up-to-200 items deleted, one IAVL deletion immediately stops including the whole transcript
export is now minimum size
N-1 span deleted 200 items deleted, one IAVL deletion no change
.. earliest span deleted 200 items deleted, final IAVL deletion no change
deletion complete (vats.terminated).remove(vatID) IAVL shadow of vats.terminated changed

@warner warner assigned warner and mhofman and unassigned warner Apr 16, 2024
@warner warner force-pushed the warner/8928-terminate-vats-slowly branch from c3299e5 to a31549a Compare April 23, 2024 19:03
@warner warner force-pushed the warner/8980-boyd-scheduler branch from 967e458 to 402811a Compare April 23, 2024 19:03
Copy link
Member

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preliminary review of the first commit introducing budgeted deletion.

I think we should have a first commit changing the semantics of termination to set inUse/isCurrent to null for the active snapshot/span, and assert in the deletion function that there is no active snapshot/span before proceeding. Then a second commit can introduce an optional budgeted deletion, which I believe it should do in a consistent order (either old to new, or opposite, but not mix and match).

It would also avoid unnecessarily exporting snapshot/transcript span artifacts while their slow deletion is in progress (since the kv entries are processed first).

Comment on lines 392 to 397
// Unlike transcripts, here we delete the oldest snapshots first,
// to simplify the logic: we delete the only inUse=1 snapshot
// last, and then immediately delete the .current record, at which
// point we're done. This has a side-effect of keeping the unused
// snapshot in the export artifacts longer, but it doesn't seem
// worth fixing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find a little weird to do things in a different order. It also causes the metadata entries between transcript and snapshot to be inconsistent between each other.

*
* @param {string} vatID
* @param {number} budget
* @returns {{ done: boolean, cleanups: number }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of interface really feels like a generator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, but I didn't find a way to take advantage of that fact.

An actual function* generator wouldn't work, of course, because the process can be killed and a new process started while the deletion is going on, and a real generator would lose state when the application is rebooted.

And, I think changing the function signature to match that of a normal function* generator is only an improvement if the caller gets to use for..of syntax, but as long as the snapStore function is doing internal iteration (deleting more than one thing per call), the vatKeeper.js deleteSnapshotsAndTranscripts() caller is only going to call it once per block (per terminated vat), so there's no good place for a for..of loop. (the real loop is higher up, with one iteration per block).

To get one, we'd need to change snapStore's deleteSomeVatSnapshots into maybeDeleteOneVatSnapshot, to delete at most one per call, and then have vatKeeper's deleteSnapshotsAndTranscripts() use a for..of loop. We'd still need to return whether a cleanup was done or not, and have the caller accumulate them, so deleteSnapshotsAndTranscripts knows when to switch from snapshots to transcripts. maybeDeleteOneVatSnapshot would always make one DB query (with a LIMIT 1) to get which snapshot to delete, if any. Then it either returns, or does a second DB query to delete the one row, and a third to noteExport the deletion, making the cost 3 small DB queries until all the snapshots are gone, then 1 small DB query each block until all the transcripts are gone (since we always check for remaining snapshots on the way to checking for transcript spans).

That's compared to the current cost (with a budget of 5) of one moderate-sized query every time (using LIMIT 5, returning anywhere from 0 to 5 rows), followed by 0 to 5 noteExports, maybe followed by a single DELETE query removing 1 to 5 rows at once.

And we'd need snapStore to expose maybeDeleteOneVatSnapshot separately from deleteVatSnapshots (unlimited), so the latter could to queries without LIMIT constraints, and delete everything in one shot.

In general, it de-amortizes the DB queries, because to make use of the iterator, we have to move responsibility for doing more than one deletion (per block) up into vatKeeper, which then can't give a hint to swingstore about how many deletions are coming up, so it could query them in a batch.

Comment on lines 452 to 454
// if you didn't set a budget, you won't be counting deletions
return { done: true, cleanups: 0 };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could fairly easily return deletions.length from deleteAllVatSnapshots to be consistent

Comment on lines 415 to 419
// if we reach here, the last sqlDeleteOneVatSnapshot() in that
// loop had deleted the inUse=1 snapshot and the corresponding
// snapshotMetadataKey, so now it is time to delete the .current
// record and inform the kernel that we're done
noteExport(currentSnapshotMetadataKey({ vatID }), undefined);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm wondering if:

  • vat termination should not set the inUse snapshot to null and remove the .current snapshot marker
  • assert when we call deleteAllVatSnapshots that there are no inUse snapshots for the vat

Comment on lines 356 to 357
// isCurrent=1 span first, which causes export to ignore the
// entire vat (good, since it's deleted)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify? That does not sound good. An export / import in the middle of a slow prune must reconstitute the partially deleted swing-store so that the slow deletion can continue in consensus on that restored node.

I think we need to be careful differentiating items and spans metadata here.

My understanding is that the isCurrent only impacts the artifacts yielded during export, and the completeness checks of items during import. Yielding no artifacts and skipping checks is indeed consistent and the right behavior, and since the metadata is always restored, the pruning behavior will be the same on restore.

That said, I am uneasy to rely on the deletion operation to impact the completeness checks. Imagine we switched things around and started deleting from the oldest span. The operational check would fail. I believe that vat termination should explicitly "close" the span (set isCurrent = null), and only allow deletion of transcripts for which there is no current span. In the future this could be modified to slowly delete transcripts of old incarnations by just adding a constraint on incarnation number on the queries.

Comment on lines 397 to 446
// no budget? no accounting.
return { done: true, cleanups: 0 };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we could return deletions.length from deleteAllVatTranscripts

@warner
Copy link
Member Author

warner commented May 10, 2024

I think we should have a first commit changing the semantics of termination to set inUse/isCurrent to null for the active snapshot/span, and assert in the deletion function that there is no active snapshot/span before proceeding. Then a second commit can introduce an optional budgeted deletion, which I believe it should do in a consistent order (either old to new, or opposite, but not mix and match).

Hm, in general I like it, now I'm trying to walk through how that would work.

Say we terminate a vat in block 1, then start deleting parts of it in block 2, and continue on through block 100. We delete the kvStore entries first (say blocks 2-40), then the snapshots (say blocks 41-50), then the transcript spans/items (say blocks 51-100).

I think you're aiming to have swingstore exports immediately stop including artifacts for the terminated vat as of block 1. No transcript span artifacts, no snapshot artifacts. The exports at that point continue to have export-data for everything. We start losing export-data for kvStore entries during 2-40, but an export at block 40 still has all the snapshot export-data (hashes), plus transcript span records. Then in 41-50 we start seeing fewer and fewer snapshot export-data records, and in 51-100 we start losing transcript span records, until by block 100 we see no export-data records for anything related to the now-fully-deleted vat.

We can't afford to delete all the export-data records during block 1, since they're all shadowed into IAVL, which we're protecting/rate-limiting just as much as SQLite. But we want getArtifactNames() to not include names of artifacts that could be produced (the rows are still present), but which are being suppressed because the vat was terminated.

transcriptStore.js getArtifactNames() does that already, with the initial sqlGetCurrentSpanMetadata query (which filters on inUse=1), in all modes except debug. snapStore.js behaves the same way.

Ok, so clearing the inUse/isCurrent flag, or deleting that one row, will suffice to prune the artifacts from exports immediately. But we need to make sure the importer won't think this is a broken import (missing artifact names that the export-data says should be present).

For the snapStore, assertComplete uses sqlListPrunedCurrentSnapshots, which only pays attention to inUse=1, so it won't complain. For transcriptStore, it uses sqlGetCurrentSpanMetadata, which likewise only looks at isCurrent=1.

So.. I think it would just work? We delete the inUse/isCurrent record when the vat is terminated, and we immediately stop observing that vat's heap/transcript-span artifact names or artifacts in the export. The importer would import them if they were present, but it won't complain if they are not.

Then, slowly, we delete the actual DB items, budget-limited, until they're all gone, at which point vatKeeper learns that there was nothing left to delete, and it deletes the record that says the vat was still being deleted. We still delete a transcriptSpans row and its matching group of 200-ish transcriptItems rows as an atomic unit, so the DB remains consistent, and we remove one export-data row for each span, so we eventually clear out the IAVL data.

@warner
Copy link
Member Author

warner commented May 10, 2024

Now let's see, should we delete the inUse/isCurrent entry, or should we set the flag to NULL and then delete the entry along with all the rest? I think it simplifies the deleteSome code if there are no special cases, so I'm inclined to delete the current entry at the time of vat termination, so the only thing left for the rate-limited API is to delete the old non-inUse/isCurrent rows.

I bet it would work to clear the flag too, but the IAVL .current export-data row would need to be deleted specially.

@mhofman
Copy link
Member

mhofman commented May 10, 2024

So.. I think it would just work?

That was my conclusion as well.

should we delete the inUse/isCurrent entry, or should we set the flag to NULL and then delete the entry along with all the rest?

I was thinking of setting it to NULL.

I bet it would work to clear the flag too, but the IAVL .current export-data row would need to be deleted specially.

Yes, I think that's the only change needed (and on import making sure we don't choke if there is no .current)

@aj-agoric aj-agoric assigned warner and unassigned mhofman May 14, 2024
@warner
Copy link
Member Author

warner commented Jun 10, 2024

I was able to make this work by clearing the isCurrent/inUse flag at the time stopUsingTranscript / stopUsingLatestSpan is called. I had to do more export-data work than expected: clearing isCurrent is really transforming a "current" transcript span into a historical one, which means deleting one export-data key (.current) and adding another (.startPos for the last span, which is now closed and short). I added more tests of that code.

@warner warner force-pushed the warner/8980-boyd-scheduler branch from 402811a to 3200c52 Compare June 10, 2024 17:35
@warner warner force-pushed the warner/8928-terminate-vats-slowly branch from a31549a to 13ea1dc Compare June 10, 2024 17:35
@warner warner changed the base branch from warner/8980-boyd-scheduler to warner/8980-swingstore-deletion-budget June 10, 2024 17:37
@warner warner requested a review from mhofman June 10, 2024 17:43
@warner warner changed the title feat(swingset): allow slow termination of vats feat(swingset): allow slow deletion of terminated vats Jun 11, 2024
@warner warner force-pushed the warner/8928-terminate-vats-slowly branch from fa87abe to 4ad9689 Compare July 2, 2024 06:03
@warner warner force-pushed the warner/8980-swingstore-deletion-budget branch from 1ba5547 to afe1c29 Compare July 10, 2024 17:05
@warner warner force-pushed the warner/8928-terminate-vats-slowly branch from 4ad9689 to 0140249 Compare July 10, 2024 17:05
@warner
Copy link
Member Author

warner commented Jul 10, 2024

My most recent rebase+edit includes a fix for the way we call the runPolicy: previously the didCleanups() call omitted the actual number of cleanups, which prevented me from doing a real integration. The new version also includes a test of that invocation (which I just forgot about the first time through, else I would have caught this earlier).

@warner warner requested a review from gibson042 July 10, 2024 17:11
@warner warner force-pushed the warner/8980-swingstore-deletion-budget branch from afe1c29 to 49ba122 Compare July 11, 2024 04:38
@warner warner force-pushed the warner/8928-terminate-vats-slowly branch from 0140249 to 71ede77 Compare July 11, 2024 04:38
@warner warner force-pushed the warner/8980-swingstore-deletion-budget branch from 49ba122 to 77c46d1 Compare July 11, 2024 23:51
@warner warner force-pushed the warner/8928-terminate-vats-slowly branch from 71ede77 to 06ccc1f Compare July 11, 2024 23:51
Copy link
Member

@gibson042 gibson042 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation comments; code review to follow.

packages/SwingSet/docs/run-policy.md Outdated Show resolved Hide resolved
packages/SwingSet/docs/run-policy.md Outdated Show resolved Hide resolved
packages/SwingSet/docs/run-policy.md Outdated Show resolved Hide resolved
packages/SwingSet/docs/run-policy.md Outdated Show resolved Hide resolved
packages/SwingSet/docs/run-policy.md Outdated Show resolved Hide resolved
packages/SwingSet/docs/run-policy.md Outdated Show resolved Hide resolved
packages/SwingSet/src/lib/runPolicies.js Show resolved Hide resolved
packages/SwingSet/docs/run-policy.md Outdated Show resolved Hide resolved
packages/SwingSet/docs/run-policy.md Outdated Show resolved Hide resolved
packages/SwingSet/src/lib/runPolicies.js Outdated Show resolved Hide resolved
@gibson042 gibson042 self-requested a review July 22, 2024 11:50
Copy link
Member

@gibson042 gibson042 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with the usual code style suggestions. And thanks again for the testing.

packages/SwingSet/docs/run-policy.md Outdated Show resolved Hide resolved
packages/SwingSet/src/kernel/kernel.js Outdated Show resolved Hide resolved
packages/SwingSet/src/kernel/kernel.js Outdated Show resolved Hide resolved
packages/SwingSet/src/kernel/kernel.js Outdated Show resolved Hide resolved
Comment on lines 1814 to 1820
// false, or an object with optional .budget
if (allowCleanup) {
assert.typeof(allowCleanup, 'object');
if (allowCleanup.budget) {
assert.typeof(allowCleanup.budget, 'number');
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like another opportunity for @endo/patterns.

Suggested change
// false, or an object with optional .budget
if (allowCleanup) {
assert.typeof(allowCleanup, 'object');
if (allowCleanup.budget) {
assert.typeof(allowCleanup.budget, 'number');
}
}
mustMatch(harden(allowCleanup), allowCleanupShape);

with

const allowCleanupShape = M.or(
  // Prohibit cleanup.
  false,
  // Allow cleanup, optionally with a limiting budget.
  M.splitRecord({}, { budget: M.number() }, M.record()),
);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied, but do we have any benchmarks on how expensive the more general pattern-matching approach is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an ad hoc measurement:

$ node esbench.mjs -h V8 -b 5 \
  -M @endo/init \
  -i 'import { M, mustMatch } from "@endo/patterns"; Object.assign(globalThis, { M, mustMatch });' \
  -s '
    const allowCleanupShape = M.or(
      false,
      M.splitRecord({}, { budget: M.number() }, M.record()),
    );
    const checkManual = allowCleanup => {
      if (allowCleanup) {
        assert.typeof(allowCleanup, "object");
        if (allowCleanup.budget) assert.typeof(allowCleanup.budget, "number");
      }
    };
    const makeObj = () => harden({ budget: 42 });
  ' \
  'checkManual(false)' 'checkManual(makeObj())' \
  'mustMatch(false, allowCleanupShape)' 'mustMatch(makeObj(), allowCleanupShape)'
#### V8
checkManual(false) (0) 19161.615353858455 ops/ms after 195 491520-count samples
checkManual(makeObj()) (0) 610.7172827172827 ops/ms after 199 15360-count samples
mustMatch(false, allowCleanupShape) (0) 426.1556886227545 ops/ms after 278 7680-count samples
mustMatch(makeObj(), allowCleanupShape) (0) 17.63226192852865 ops/ms after 368 240-count samples

It's about 25x slower in this case, but I think not of significance anywhere we're writing to a database. And I'll be improving it over the next few months anyway—I can already triple the speed for false input and double it for object input with some tweaks to endo:

mustMatch(false, allowCleanupShape) (0) 1227.981345769487 ops/ms after 120 30720-count samples
mustMatch(makeObj(), allowCleanupShape) (0) 38.56 ops/ms after 482 240-count samples

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking! So, converting into actually-comparable units, and focusing on the most-common "false" case, and ignoring the effects of the extra harden() call:

  • manual: 52ns / op
  • allowCleanupShape: 2.3us / op

I know I can't legitimately complain unless I also measure the DB speed and show that it's not the limiting factor, and I'm not prepared to do that now, but I really worry about how these (admittedly tidy) helpers mask significant runtime.

I don't know where your instincts suggest that DB interaction is slow, but just in case it makes a difference, note that we only commit once per block, so most of the DB writes made by the kernel will just get parked in RAM for a while. And I expect that SQLite will be smart about serving a lot of reads from a cache.

packages/SwingSet/src/kernel/state/kernelKeeper.js Outdated Show resolved Hide resolved
packages/SwingSet/src/kernel/state/kernelKeeper.js Outdated Show resolved Hide resolved
packages/SwingSet/src/kernel/state/vatKeeper.js Outdated Show resolved Hide resolved
@warner
Copy link
Member Author

warner commented Aug 10, 2024

I've been trying to think of a better runPolicy budget-setting API. I'll apply it in a different PR, before I do the cosmic-swingset integration. Here's the plan:

  • if runPolicy?.allowCleanup() is falsy, the kernel is allowed to do unlimited cleanup
  • otherwise, it should return a budget, with a shape like { exports: 5, imports: 5, kv: 50, snapshots: 5, transcripts: 5 }, which provides separate budgets for each phase
  • cleanupAfterTerminatedVat will exit after any category is exhausted. And because it processes each phase separately, the worst-case budget overrun would be if we have 4 exports to clean up, 4 imports, 49 kv, etc, in which case it would do all the work in a single crank, but the total would still be reasonably small
  • runPolicy?.didCleanup(counts) is given an object with { exports, imports, etc }, and maybe also a .total
    • if it returns false, then controller.run() exits, as before

Host-apps that are ok with complete+immediate cleanup can either omit .allowCleanup or have it return false.

Host-apps that want to allow a little bit of cleanup in each controller.run() can configure a policy where allowCleanup() returns a budget up until the point that didCleanup() is called, which should flip a switch to make allowCleanup() return false until the next run. Their didCleanup() should return true, to allow the GC/BOYD consequences of cleanup to execute promptly.

Host-apps (like ours) which want to only perform cleanup during empty blocks should do their initial runs with allowCleanup: () => false, and measure the crankComplete calls to sense whether work was done or not. Then, if they decide that no other work was done and cleanup is allowed, they should do another run with a separate policy that allows cleanup, and whose didCleanup() returns true, for the same reasons as above.

Some concerns:

  • if we add a new phase in the future, how should we handle backwards compatibility with the allowCleanup budgets that omit that property? should we include .default or something that fills in the missing values? should we start with that, and have { default: 5, kv: 50 } to express our initial budget?
  • similar, but with fewer consequences, if didCleanup() starts passing more properties than before, will it be ok if the older host-app runPolicy ignores those other properties? passing a .total might help future-proof it, but it raises questions about what the host-app is going to do with those numbers anyways.

cc @gibson042 @mhofman

@warner warner force-pushed the warner/8980-swingstore-deletion-budget branch from c862b45 to 227dafa Compare August 10, 2024 23:57
@warner warner force-pushed the warner/8928-terminate-vats-slowly branch 2 times, most recently from 5c8afcc to 47fe6e9 Compare August 11, 2024 16:38
@warner warner force-pushed the warner/8980-swingstore-deletion-budget branch from 227dafa to 486dfbb Compare August 11, 2024 16:38
@warner warner force-pushed the warner/8928-terminate-vats-slowly branch 3 times, most recently from fc2718e to 3648e55 Compare August 12, 2024 22:01
@warner warner force-pushed the warner/8980-swingstore-deletion-budget branch from 486dfbb to f6787e8 Compare August 12, 2024 22:01
@warner warner force-pushed the warner/8928-terminate-vats-slowly branch from 3648e55 to 1dcb1d2 Compare August 12, 2024 22:54
@warner warner force-pushed the warner/8980-swingstore-deletion-budget branch from f6787e8 to c43bf63 Compare August 12, 2024 22:54
Base automatically changed from warner/8980-swingstore-deletion-budget to master August 12, 2024 23:50
This introduces new `runPolicy()` controls which enable "slow
termination" of vats. When configured, terminated vats are immediately
dead (all promises are rejected, all new messages go splat, they never
run again), however the vat's state is deleted slowly, one piece at a
time. This makes it safe to terminate large vats, with a long history,
lots of c-list imports/exports, or large vatstore tables, without fear
of causing an overload (by e.g. dropping 100k references all in a
single crank).

See docs/run-policy.md for details and configuration instructions.

The kernelKeeper is upgraded from v1 to v2, to add a new
'vats.terminated' key, which tracks the vats that have been terminated
but not yet completely deleted. NOTE: deployed applications must use
`upgradeSwingset()` when using this kernel version for the first time.

Also refactor vatKeeper.deleteSnapshotsAndTranscripts() into two
separate methods, to fix a bug that hid in the combination: if the
snapshot deletion phase exhausted our budget, we'd call
deleteVatTranscripts() with a budget of 0, which was interpreted as
"unlimited", and deleted all the transcript spans in a single burst.

refs #8928

Co-authored-by: Richard Gibson <richard.gibson@gmail.com>
@warner warner force-pushed the warner/8928-terminate-vats-slowly branch from 1dcb1d2 to 9ac2ef0 Compare August 12, 2024 23:51
@warner warner added the automerge:rebase Automatically rebase updates, then merge label Aug 12, 2024
@mergify mergify bot merged commit 1f816f9 into master Aug 13, 2024
90 checks passed
@mergify mergify bot deleted the warner/8928-terminate-vats-slowly branch August 13, 2024 00:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge:rebase Automatically rebase updates, then merge swing-store SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants