gc/snapshots/metering vs consensus #3830

warner · 2021-09-15T18:27:11Z

What is the Problem Being Solved?

We've gone back and forth about how much local variation we can accomodate and still maintain consensus in a distributed swingset (i.e. the chain). For various reasons, it would be nice if each validator could make local decisions about:

the exact version of XS to use, to allow bugfixes or performance improvements to be deployed incrementally, instead of requiring a "flag day" (a simultaneous upgrade of all validators across the entire chain)
when to "page out" a vat (i.e. kill the xsnap process): this saves memory at the expense of time spent loading the vat back in later, and different validators may have different amount of memory
- paging a vat out means it must some day be paged back in, at the latest by the time a message must be delivered to that vat, which gives each validator another cache-policy decision to make
when to record a heap snapshot of any given vat (making the subsequent page-in faster, by removing/skipping transcript entries), at the cost of more disk IO
when to perform/allow/force GC within a vat, which affects the memory usage

Our primary requirement is that all validators in a consensus machine actually maintain consensus: they agree upon some pre-defined subset of their activity, and that subset is sufficient to capture the overall state that users care about (e.g. token balances, governance outcomes, etc). We can exclude minor things from that subset if they cannot cause variations in the major things.

Metering makes this especially tricky, because so much of a vat's activity is subject to the CPU and memory meters. For example, if we want to exclude the UNREACHABLE-vs-COLLECTED-vs-FINALIZED state of an Object from consensus (allowing variation in the timing of GC), we must also exclude (from metering) the behavior of any code which is influenced by that state distinction. Like writing cryptographic code whose memory accesses or timing does not depend upon secret data, this requires tremendous care, as well as a deep understanding of how the underlying engine behaves, and is not generally covered by automated testing (making it fragile).

The basic decision tree I've figured out so far looks like this:

The green boxes/circles indicate choices that we've already made, or which are pretty obvious. We certainly must allow validators to restart the process. We know that taking an XS snapshot affects the GC behavior (it does a forced GC just before writing the snapshot). We're already using the deadSet and an "unmetered box" to conceal the consequences of GC.

We're uncertain whether the GC behavior of a reloaded (post-read) snapshot is identical to the original (post-write) process: this was previously not the case, because the "headroom" was reduced during the reload process, but recent changes to XS (in particular using mmap instead of malloc, and writing the size of the mmap-ed slab into the snapshot) may have changed this. We're uncertain whether finalizers can run spontaneously (and prefer not to rely upon the opposite). We don't know whether it's possible to use C hooks to disable CPU metering during finalization.

This ticket is to explain and explore the options we have. It's related to #1872 and #2615 .

The text was updated successfully, but these errors were encountered:

warner added enhancement New feature or request SwingSet package: SwingSet labels Sep 15, 2021

Tartuffo added the needs-design label Feb 2, 2022

mhofman mentioned this issue Apr 10, 2023

Better represent heap cost in run policy #7373

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gc/snapshots/metering vs consensus #3830

gc/snapshots/metering vs consensus #3830

warner commented Sep 15, 2021

gc/snapshots/metering vs consensus #3830

gc/snapshots/metering vs consensus #3830

Comments

warner commented Sep 15, 2021

What is the Problem Being Solved?