-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
snapshot / BOYD interval based on computrons #6786
Comments
As requested by @dckc, here is a graph and histogram of the execution time between snapshots as captured by the replay tool for all vats on mainnet. |
In conjunction with this, we should implement @mhofman 's earlier suggestion that we use the transcript to record the act of writing a snapshot, since they force a GC pass, which changes the state of the worker. If we tied BOYD and snapshotting together (always doing them at the same time), then the transcript will look like:
The Thinking about the improved scheduler and a per-vat input queue, what we'd really do is push these two deliveries onto the front of the vat-input queue, and then let them be delivered in their own time (so not necessarily right away). They'd happen before anything else on that vat's input queue, but we could allow other vats to get time before returning attention to the one that needs a snapshot written. If the snapshot is called for because this particular vat just did a whole lot of work, then maybe we don't want to spend even more time on this vat right away. One tricky bit is to make sure we don't accidentally re-schedule a BOYD or snapshot because of the computrons used by the BOYD or snapshot.. we need some per-vat state to remember that we pushed the pair onto the input queue already. We need some durable per-vat state anyways, to remember the cumulative computron count, since the kernel might be rebooted, e.g. just before the BOYD/snapshot was to have been scheduled. Those counters cannot be only in RAM. The resulting transcript "spans" (#6702/#6775) will always end with a BOYD and a If we incorporate vat-upgrade into this transcript-tracked scheme, then the last span of an incarnation will end with a BOYD and an |
In the future improved vat scheduler / parallelization, I think makeSnapshot / BOYD should be driven entirely (but still deterministically) by the vat CDP itself. A logical way of thinking about the kernel -> vat input is that it focuses on direct actions such as message send, promise resolve, retire export, upgrade/exit. The vat -> kernel output is a result of execution derived from these input, such as message send, promise resolve and subscribe, but also drop import, etc. The fact that snapshots are made by the CDP is an implementation detail that shouldn't cross this boundary, asides from the abstract artifacts that the CDP produces for state-sync. Similarly, I consider BOYD an internal concern of the CDP in how it find its garbage. It is of course observable in some output of the CDP, which is why it needs to be deterministic. It raises the question of how the result of BYOD should be included in the vat output queue? A way to approach it would be to simply roll the results of BYOD into the output of the delivery that triggered the BYOD. The main concern is the apparent disconnection of this output from the input. The parametrization is interesting. This is IMO something we want to control by a governance parameter, so once the kernel is updated it needs to inform the vat CDP of the new parameter. That new parameter should be processed in the same input queue since the vat may or may not have processed earlier units of work. That means vat parameter changes are like upgrade/exit commands. Similarly the computron storage should be something kept logically inside the vat CDP. |
Thinking more about this, I hope we do not count BYOD meter usage against the vat? BYOD is intrinsically a gc related operation which should be exempt from metering. We can "charge" the vat through other means, like the amount of "time" each object is retained, which would be higher if a vat produces a lot of garbage and thus a good proxy for the expense of running BYOD. |
This would be nice to have, but not strictly needed for the vaults/bulldozer release, and we think we can add it later without causing too many problems. We should definitely include computrons in the consensus state, since this will make us even more sensitive to metering differences. |
We should also consider telling the |
A further tweak on this would be to let the host schedule reaping of eligible vats. Performing a snapshot in the middle of a block is potentially disruptive to the block's execution. It also raises the possibility of a snapshot being taken twice for the same vat in a block, which is useless. BOYD in the middle of an execution is also not the optimal time, and it'd be better to perform reaping when quiescent. If snapshots are performed right after reaping, they would also be smaller. The host could schedule a reaping (which also triggers a snapshot) after the kernel has run to completion on an inbound queue items for a block. To deal with maxed out blocks, the host could gain knowledge about the need of reaping for vats, and pre-allocate "compute time" in the run policy to perform them at the end. Edit: the layer is the tricky bit here: vat gc is a swingset concern that may not make sense to expose to the host, even though I'd argue it should be accounted for in the run policy, and we cannot rely on execution computrons to measure its cost (similar to how the cost of taking heap snapshots should be accounted for). However on the other side, block boundaries and time passing is a concern of the host, which may not make sense to expose to swingset directly. |
We might consider an additional limit based on the concatenated size of the transcript span, to avoid problems like #8325 where the swingstore export artifact (the transcript span, expressed as newline-concatenated transcript entries) was larger than the cosmos-sdk state-sync import size limit. That limit was originally 64MB, but #8325 raises it to a new value. If we want to avoid the possibility of running up against the new value in the future, we could have swingset force a new span if the total span size was getting close. @mhofman observed only a single large span, v1-bootstrap, and it was only about 65MB, just barely over the limit. It was that big because of the large amount of data we provided in the bootstrap() " |
`dispatch.bringOutYourDead()`, aka "reap", triggers garbage collection inside a vat, and gives it a chance to drop imported c-list vrefs that are no longer referenced by anything inside the vat. Previously, each vat has a configurable parameter named `reapInterval`, which defaults to a kernel-wide `defaultReapInterval` (but can be set separately for each vat). This defaults to 1, mainly for unit testing, but real applications set it to something like 200. This caused BOYD to happen once every 200 deliveries, plus an extra BOYD just before we save an XS heap-state snapshot. This commit switches to a "dirt"-based BOYD scheduler, wherein we consider the vat to get more and more dirty as it does work, and eventually it reaches a `reapDirtThreshold` that triggers the BOYD (which resets the dirt counter). We continue to track `dirt.deliveries` as before, with the same defaults. But we add a new `dirt.gcKrefs` counter, which is incremented by the krefs we submit to the vat in GC deliveries. For example, calling `dispatch.dropImports([kref1, kref2])` would increase `dirt.gcKrefs` by two. The `reapDirtThreshold.gcKrefs` limit defaults to 20. For normal use patterns, this will trigger a BOYD after ten krefs have been dropped and retired. We choose this value to allow the #8928 slow vat termination process to trigger BOYD frequently enough to keep the BOYD cranks small: since these will be happening constantly (in the "background"), we don't want them to take more than 500ms or so. Given the current size of the large vats that #8928 seeks to terminate, 10 krefs seems like a reasonable limit. And of course we don't want to perform too many BOYDs, so `gcKrefs: 20` is about the smallest threshold we'd want to use. External APIs continue to accept `reapInterval`, and now also accept `reapGCKrefs`. * kernel config record * takes `config.defaultReapInterval` and `defaultReapGCKrefs` * takes `vat.NAME.creationOptions.reapInterval` and `.reapGCKrefs` * `controller.changeKernelOptions()` still takes `defaultReapInterval` but now also accepts `defaultReapGCKrefs` The APIs available to userspace code (through `vatAdminSvc`) are unchanged (partially due to upgrade/backwards-compatibility limitations), and continue to only support setting `reapInterval`. Internally, this just modifies `reapDirtThreshold.deliveries`. * `E(vatAdminSvc).createVat(bcap, { reapInterval })` * `E(adminNode).upgrade(bcap, { reapInterval })` * `E(adminNode).changeOptions({ reapInterval })` Internally, the kernel-wide state records `defaultReapDirtThreshold` instead of `defaultReapInterval`, and each vat records `.reapDirtThreshold` in their `vNN.options` key instead of `vNN.reapInterval`. The current dirt level is recorded in `vNN.reapDirt`. The kernel will automatically upgrade both the kernel-wide and the per-vat state upon the first reboot with the new kernel code. The old `reapCountdown` value is used to initialize the vat's `reapDirt.deliveries` counter, so the upgrade shouldn't disrupt the existing schedule. Vats which used `reapInterval = 'never'` (eg comms) will get a `reapDirtThreshold` of all 'never' values, so they continue to inhibit BOYD. Otherwise, all vats get a `threshold.gcKrefs` of 20. We do not track dirt when the corresponding threshold is 'never', to avoid incrementing the comms dirt counters forever. This design leaves room for adding `.computrons` to the dirt record, as well as tracking a separate `snapshotDirt` counter (to trigger XS heap snapshots, ala #6786). We add `reapDirtThreshold.computrons`, but do not yet expose an API to set it. Future work includes: * upgrade vat-vat-admin to let userspace set `reapDirtThreshold` New tests were added to exercise the upgrade process, and other tests were updated to match the new internal initialization pattern. We now reset the dirt counter upon any BOYD, so this also happens to help with #8665 (doing a `reapAllVats()` resets the delivery counters, so future BOYDs will be delayed, which is what we want). But we should still change `controller.reapAllVats()` to avoid BOYDs on vats which haven't received any deliveries. closes #8980
`dispatch.bringOutYourDead()`, aka "reap", triggers garbage collection inside a vat, and gives it a chance to drop imported c-list vrefs that are no longer referenced by anything inside the vat. Previously, each vat has a configurable parameter named `reapInterval`, which defaults to a kernel-wide `defaultReapInterval` (but can be set separately for each vat). This defaults to 1, mainly for unit testing, but real applications set it to something like 200. This caused BOYD to happen once every 200 deliveries, plus an extra BOYD just before we save an XS heap-state snapshot. This commit switches to a "dirt"-based BOYD scheduler, wherein we consider the vat to get more and more dirty as it does work, and eventually it reaches a `reapDirtThreshold` that triggers the BOYD (which resets the dirt counter). We continue to track `dirt.deliveries` as before, with the same defaults. But we add a new `dirt.gcKrefs` counter, which is incremented by the krefs we submit to the vat in GC deliveries. For example, calling `dispatch.dropImports([kref1, kref2])` would increase `dirt.gcKrefs` by two. The `reapDirtThreshold.gcKrefs` limit defaults to 20. For normal use patterns, this will trigger a BOYD after ten krefs have been dropped and retired. We choose this value to allow the #8928 slow vat termination process to trigger BOYD frequently enough to keep the BOYD cranks small: since these will be happening constantly (in the "background"), we don't want them to take more than 500ms or so. Given the current size of the large vats that #8928 seeks to terminate, 10 krefs seems like a reasonable limit. And of course we don't want to perform too many BOYDs, so `gcKrefs: 20` is about the smallest threshold we'd want to use. External APIs continue to accept `reapInterval`, and now also accept `reapGCKrefs`. * kernel config record * takes `config.defaultReapInterval` and `defaultReapGCKrefs` * takes `vat.NAME.creationOptions.reapInterval` and `.reapGCKrefs` * `controller.changeKernelOptions()` still takes `defaultReapInterval` but now also accepts `defaultReapGCKrefs` The APIs available to userspace code (through `vatAdminSvc`) are unchanged (partially due to upgrade/backwards-compatibility limitations), and continue to only support setting `reapInterval`. Internally, this just modifies `reapDirtThreshold.deliveries`. * `E(vatAdminSvc).createVat(bcap, { reapInterval })` * `E(adminNode).upgrade(bcap, { reapInterval })` * `E(adminNode).changeOptions({ reapInterval })` Internally, the kernel-wide state records `defaultReapDirtThreshold` instead of `defaultReapInterval`, and each vat records `.reapDirtThreshold` in their `vNN.options` key instead of `vNN.reapInterval`. The current dirt level is recorded in `vNN.reapDirt`. The kernel will automatically upgrade both the kernel-wide and the per-vat state upon the first reboot with the new kernel code. The old `reapCountdown` value is used to initialize the vat's `reapDirt.deliveries` counter, so the upgrade shouldn't disrupt the existing schedule. Vats which used `reapInterval = 'never'` (eg comms) will get a `reapDirtThreshold` of all 'never' values, so they continue to inhibit BOYD. Otherwise, all vats get a `threshold.gcKrefs` of 20. We do not track dirt when the corresponding threshold is 'never', to avoid incrementing the comms dirt counters forever. This design leaves room for adding `.computrons` to the dirt record, as well as tracking a separate `snapshotDirt` counter (to trigger XS heap snapshots, ala #6786). We add `reapDirtThreshold.computrons`, but do not yet expose an API to set it. Future work includes: * upgrade vat-vat-admin to let userspace set `reapDirtThreshold` New tests were added to exercise the upgrade process, and other tests were updated to match the new internal initialization pattern. We now reset the dirt counter upon any BOYD, so this also happens to help with #8665 (doing a `reapAllVats()` resets the delivery counters, so future BOYDs will be delayed, which is what we want). But we should still change `controller.reapAllVats()` to avoid BOYDs on vats which haven't received any deliveries. closes #8980
NOTE: deployed kernels require a new `upgradeSwingset()` call upon (at least) first boot after upgrading to this version of the kernel code. See below for details. `dispatch.bringOutYourDead()`, aka "reap", triggers garbage collection inside a vat, and gives it a chance to drop imported c-list vrefs that are no longer referenced by anything inside the vat. Previously, each vat has a configurable parameter named `reapInterval`, which defaults to a kernel-wide `defaultReapInterval` (but can be set separately for each vat). This defaults to 1, mainly for unit testing, but real applications set it to something like 1000. This caused BOYD to happen once every 1000 deliveries, plus an extra BOYD just before we save an XS heap-state snapshot. This commit switches to a "dirt"-based BOYD scheduler, wherein we consider the vat to get more and more dirty as it does work, and eventually it reaches a `reapDirtThreshold` that triggers the BOYD (which resets the dirt counter). We continue to track `dirt.deliveries` as before, with the same defaults. But we add a new `dirt.gcKrefs` counter, which is incremented by the krefs we submit to the vat in GC deliveries. For example, calling `dispatch.dropImports([kref1, kref2])` would increase `dirt.gcKrefs` by two. The `reapDirtThreshold.gcKrefs` limit defaults to 20. For normal use patterns, this will trigger a BOYD after ten krefs have been dropped and retired. We choose this value to allow the #8928 slow vat termination process to trigger BOYD frequently enough to keep the BOYD cranks small: since these will be happening constantly (in the "background"), we don't want them to take more than 500ms or so. Given the current size of the large vats that #8928 seeks to terminate, 10 krefs seems like a reasonable limit. And of course we don't want to perform too many BOYDs, so `gcKrefs: 20` is about the smallest threshold we'd want to use. External APIs continue to accept `reapInterval`, and now also accept `reapGCKrefs`, and `neverReap` (a boolean which inhibits all BOYD, even new forms of dirt added in the future). * kernel config record * takes `config.defaultReapInterval` and `defaultReapGCKrefs` * takes `vat.NAME.creationOptions.reapInterval` and `.reapGCKrefs` and `.neverReap` * `controller.changeKernelOptions()` still takes `defaultReapInterval` but now also accepts `defaultReapGCKrefs` The APIs available to userspace code (through `vatAdminSvc`) are unchanged (partially due to upgrade/backwards-compatibility limitations), and continue to only support setting `reapInterval`. Internally, this just modifies `reapDirtThreshold.deliveries`. * `E(vatAdminSvc).createVat(bcap, { reapInterval })` * `E(adminNode).upgrade(bcap, { reapInterval })` * `E(adminNode).changeOptions({ reapInterval })` Internally, the kernel-wide state records `defaultReapDirtThreshold` instead of `defaultReapInterval`, and each vat records `.reapDirtThreshold` in their `vNN.options` key instead of `vNN.reapInterval`. The vat-level records override the kernel-wide values. The current dirt level is recorded in `vNN.reapDirt`. NOTE: deployed kernels require explicit state upgrade, with: ```js import { upgradeSwingset } from '@agoric/swingset-vat'; .. upgradeSwingset(kernelStorage); ``` This must be called after upgrading to the new kernel code/release, and before calling `buildVatController()`. It is safe to call on every reboot (it will only modify the swingstore when the kernel version has changed). If changes are made, the host application is responsible for commiting them, as well as recording any export-data updates (if the host configured the swingstore with an export-data callback). During this upgrade, the old `reapCountdown` value is used to initialize the vat's `reapDirt.deliveries` counter, so the upgrade shouldn't disrupt the existing schedule. Vats which used `reapInterval = 'never'` (eg comms) will get a `reapDirtThreshold.never = true`, so they continue to inhibit BOYD. Any per-vat settings that match the kernel-wide settings are removed, allowing the kernel values to take precedence (as well as changes to the kernel-wide values; i.e. the per-vat settings are not sticky). We do not track dirt when the corresponding threshold is 'never', or if `neverReap` is true, to avoid incrementing the comms dirt counters forever. This design leaves room for adding `.computrons` to the dirt record, as well as tracking a separate `snapshotDirt` counter (to trigger XS heap snapshots, ala #6786). We add `reapDirtThreshold.computrons`, but do not yet expose an API to set it. Future work includes: * upgrade vat-vat-admin to let userspace set `reapDirtThreshold` New tests were added to exercise the upgrade process, and other tests were updated to match the new internal initialization pattern. We now reset the dirt counter upon any BOYD, so this also happens to help with #8665 (doing a `reapAllVats()` resets the delivery counters, so future BOYDs will be delayed, which is what we want). But we should still change `controller.reapAllVats()` to avoid BOYDs on vats which haven't received any deliveries. closes #8980
fix(swingset): use "dirt" to schedule vat reap/bringOutYourDead NOTE: deployed kernels require a new `upgradeSwingset()` call upon (at least) first boot after upgrading to this version of the kernel code. `dispatch.bringOutYourDead()`, aka "reap", triggers garbage collection inside a vat, and gives it a chance to drop imported c-list vrefs that are no longer referenced by anything inside the vat. Previously, each vat has a configurable parameter named `reapInterval`, which defaults to a kernel-wide `defaultReapInterval` (but can be set separately for each vat). This defaults to 1, mainly for unit testing, but real applications set it to something like 1000. This caused BOYD to happen once every 1000 deliveries, plus an extra BOYD just before we save an XS heap-state snapshot. This commit switches to a "dirt"-based BOYD scheduler, wherein we consider the vat to get more and more dirty as it does work, and eventually it reaches a `reapDirtThreshold` that triggers the BOYD (which resets the dirt counter). We continue to track `dirt.deliveries` as before, with the same defaults. But we add a new `dirt.gcKrefs` counter, which is incremented by the krefs we submit to the vat in GC deliveries. For example, calling `dispatch.dropImports([kref1, kref2])` would increase `dirt.gcKrefs` by two. The `reapDirtThreshold.gcKrefs` limit defaults to 20. For normal use patterns, this will trigger a BOYD after ten krefs have been dropped and retired. We choose this value to allow the #8928 slow vat termination process to trigger BOYD frequently enough to keep the BOYD cranks small: since these will be happening constantly (in the "background"), we don't want them to take more than 500ms or so. Given the current size of the large vats that #8928 seeks to terminate, 10 krefs seems like a reasonable limit. And of course we don't want to perform too many BOYDs, so `gcKrefs: 20` is about the smallest threshold we'd want to use. External APIs continue to accept `reapInterval`, and now also accept `reapGCKrefs`, and `neverReap` (a boolean which inhibits all BOYD, even new forms of dirt added in the future). * kernel config record * takes `config.defaultReapInterval` and `defaultReapGCKrefs` * takes `vat.NAME.creationOptions.reapInterval` and `.reapGCKrefs` and `.neverReap` * `controller.changeKernelOptions()` still takes `defaultReapInterval` but now also accepts `defaultReapGCKrefs` The APIs available to userspace code (through `vatAdminSvc`) are unchanged (partially due to upgrade/backwards-compatibility limitations), and continue to only support setting `reapInterval`. Internally, this just modifies `reapDirtThreshold.deliveries`. * `E(vatAdminSvc).createVat(bcap, { reapInterval })` * `E(adminNode).upgrade(bcap, { reapInterval })` * `E(adminNode).changeOptions({ reapInterval })` Internally, the kernel-wide state records `defaultReapDirtThreshold` instead of `defaultReapInterval`, and each vat records `.reapDirtThreshold` in their `vNN.options` key instead of `vNN.reapInterval`. The vat-level records override the kernel-wide values. The current dirt level is recorded in `vNN.reapDirt`. NOTE: deployed kernels require explicit state upgrade, with: ```js import { upgradeSwingset } from '@agoric/swingset-vat'; .. upgradeSwingset(kernelStorage); ``` This must be called after upgrading to the new kernel code/release, and before calling `buildVatController()`. It is safe to call on every reboot (it will only modify the swingstore when the kernel version has changed). If changes are made, the host application is responsible for commiting them, as well as recording any export-data updates (if the host configured the swingstore with an export-data callback). During this upgrade, the old `reapCountdown` value is used to initialize the vat's `reapDirt.deliveries` counter, so the upgrade shouldn't disrupt the existing schedule. Vats which used `reapInterval = 'never'` (eg comms) will get a `reapDirtThreshold.never = true`, so they continue to inhibit BOYD. Any per-vat settings that match the kernel-wide settings are removed, allowing the kernel values to take precedence (as well as changes to the kernel-wide values; i.e. the per-vat settings are not sticky). We do not track dirt when the corresponding threshold is 'never', or if `neverReap` is true, to avoid incrementing the comms dirt counters forever. This design leaves room for adding `.computrons` to the dirt record, as well as tracking a separate `snapshotDirt` counter (to trigger XS heap snapshots, ala #6786). We add `reapDirtThreshold.computrons`, but do not yet expose an API to set it. Future work includes: * upgrade vat-vat-admin to let userspace set `reapDirtThreshold` New tests were added to exercise the upgrade process, and other tests were updated to match the new internal initialization pattern. We now reset the dirt counter upon any BOYD, so this also happens to help with #8665 (doing a `reapAllVats()` resets the delivery counters, so future BOYDs will be delayed, which is what we want). But we should still change `controller.reapAllVats()` to avoid BOYDs on vats which haven't received any deliveries. closes #8980
NOTE: deployed kernels require a new `upgradeSwingset()` call upon (at least) first boot after upgrading to this version of the kernel code. See below for details. `dispatch.bringOutYourDead()`, aka "reap", triggers garbage collection inside a vat, and gives it a chance to drop imported c-list vrefs that are no longer referenced by anything inside the vat. Previously, each vat has a configurable parameter named `reapInterval`, which defaults to a kernel-wide `defaultReapInterval` (but can be set separately for each vat). This defaults to 1, mainly for unit testing, but real applications set it to something like 1000. This caused BOYD to happen once every 1000 deliveries, plus an extra BOYD just before we save an XS heap-state snapshot. This commit switches to a "dirt"-based BOYD scheduler, wherein we consider the vat to get more and more dirty as it does work, and eventually it reaches a `reapDirtThreshold` that triggers the BOYD (which resets the dirt counter). We continue to track `dirt.deliveries` as before, with the same defaults. But we add a new `dirt.gcKrefs` counter, which is incremented by the krefs we submit to the vat in GC deliveries. For example, calling `dispatch.dropImports([kref1, kref2])` would increase `dirt.gcKrefs` by two. The `reapDirtThreshold.gcKrefs` limit defaults to 20. For normal use patterns, this will trigger a BOYD after ten krefs have been dropped and retired. We choose this value to allow the #8928 slow vat termination process to trigger BOYD frequently enough to keep the BOYD cranks small: since these will be happening constantly (in the "background"), we don't want them to take more than 500ms or so. Given the current size of the large vats that #8928 seeks to terminate, 10 krefs seems like a reasonable limit. And of course we don't want to perform too many BOYDs, so `gcKrefs: 20` is about the smallest threshold we'd want to use. External APIs continue to accept `reapInterval`, and now also accept `reapGCKrefs`, and `neverReap` (a boolean which inhibits all BOYD, even new forms of dirt added in the future). * kernel config record * takes `config.defaultReapInterval` and `defaultReapGCKrefs` * takes `vat.NAME.creationOptions.reapInterval` and `.reapGCKrefs` and `.neverReap` * `controller.changeKernelOptions()` still takes `defaultReapInterval` but now also accepts `defaultReapGCKrefs` The APIs available to userspace code (through `vatAdminSvc`) are unchanged (partially due to upgrade/backwards-compatibility limitations), and continue to only support setting `reapInterval`. Internally, this just modifies `reapDirtThreshold.deliveries`. * `E(vatAdminSvc).createVat(bcap, { reapInterval })` * `E(adminNode).upgrade(bcap, { reapInterval })` * `E(adminNode).changeOptions({ reapInterval })` Internally, the kernel-wide state records `defaultReapDirtThreshold` instead of `defaultReapInterval`, and each vat records `.reapDirtThreshold` in their `vNN.options` key instead of `vNN.reapInterval`. The vat-level records override the kernel-wide values. The current dirt level is recorded in `vNN.reapDirt`. NOTE: deployed kernels require explicit state upgrade, with: ```js import { upgradeSwingset } from '@agoric/swingset-vat'; .. upgradeSwingset(kernelStorage); ``` This must be called after upgrading to the new kernel code/release, and before calling `buildVatController()`. It is safe to call on every reboot (it will only modify the swingstore when the kernel version has changed). If changes are made, the host application is responsible for commiting them, as well as recording any export-data updates (if the host configured the swingstore with an export-data callback). During this upgrade, the old `reapCountdown` value is used to initialize the vat's `reapDirt.deliveries` counter, so the upgrade shouldn't disrupt the existing schedule. Vats which used `reapInterval = 'never'` (eg comms) will get a `reapDirtThreshold.never = true`, so they continue to inhibit BOYD. Any per-vat settings that match the kernel-wide settings are removed, allowing the kernel values to take precedence (as well as changes to the kernel-wide values; i.e. the per-vat settings are not sticky). We do not track dirt when the corresponding threshold is 'never', or if `neverReap` is true, to avoid incrementing the comms dirt counters forever. This design leaves room for adding `.computrons` to the dirt record, as well as tracking a separate `snapshotDirt` counter (to trigger XS heap snapshots, ala #6786). We add `reapDirtThreshold.computrons`, but do not yet expose an API to set it. Future work includes: * upgrade vat-vat-admin to let userspace set `reapDirtThreshold` New tests were added to exercise the upgrade process, and other tests were updated to match the new internal initialization pattern. We now reset the dirt counter upon any BOYD, so this also happens to help with #8665 (doing a `reapAllVats()` resets the delivery counters, so future BOYDs will be delayed, which is what we want). But we should still change `controller.reapAllVats()` to avoid BOYDs on vats which haven't received any deliveries. closes #8980
What is the Problem Being Solved?
The amount of time it takes to replay from snapshot varies widely based on the kind of work performed since the last snapshot. Similarly the amount of collectible objects depend on the amount of garbage created/released by the vat. These vary between vats, and can also vary over time for the same vat.
Currently our snapshot and BringOutYourDead interval is expressed in terms of number of deliveries to the vat. A closer approximation would be to use the amount of computrons spent by the vat, which is already tangentially in consensus through the run policy of the host, and is proposed to be more firmly in consensus through #6770.
We should also consider tying these operations together, aka perform a snapshot only immediately after bringOutYourDead so that we don't capture garbage in snapshots.
Description of the Design
An in consensus parameter would define the snapshot/bringOutYourDead interval in terms of computrons instead of deliveries.
Updating this parameter probably has interactions with parallel execution (#6447).
Security Considerations
None I can think of
Test Plan
Yes
The text was updated successfully, but these errors were encountered: