Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idea for giving each worker its own (local) SQLite vatStore DB #6254

Open
warner opened this issue Sep 17, 2022 · 17 comments
Open

idea for giving each worker its own (local) SQLite vatStore DB #6254

warner opened this issue Sep 17, 2022 · 17 comments
Labels
enhancement New feature or request needs-design performance Performance related issues SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Sep 17, 2022

What is the Problem Being Solved?

One of our apparent performance bottlenecks is the multiplication of the kernel/worker RPC overhead times the large number of vatStore syscalls triggered by extensive use of virtual objects. Each VO that gets paged in will cause a bunch of vatstoreGet calls to provide the data. This is compounded by the large number of virtual reference counts that must be tracked to correctly know when a virtual object can be released/deleted. Each time a Representative is referenced (e.g. VO foo.behavior() does state.bar = rep, which must serialize(state), which adds a virtual refcount from foo to rep) (also e.g. if rep is added to a virtual collection), we must increment a refcount that lives in the DB, so we need a vatstoreGet() plus a vatstoreSet(). Each syscall requires a VatSyscallObject (e.g. ['vatstoreSet', 'vom.o+15/97', 'capdata..']) to be serialized, encoded into a netstring, and written through a pipe to the kernel process. The kernel must receive the string, decode the netstring, parse the VSO, translate it into a KernelSyscallObject (trivial for vatstore, unlike syscall.send which requires vref-to-kref conversion/allocation), then executed (kvStore.set('v1.vs.vom.o+15/97', 'capdata')). Then the kernel builds a KernelSyscallResult, translates it into a VatSyscallResult, serializes it, encodes that into a netstring, and writes it to the pipe. The worker then reads from the pipe, decodes the netstring, parses the result (which is always ['ok']), and returns from vatstoreSet().

This whole vatstoreSet() process was measured to take about 730us (each) on the "ollinet" testnet, which uses Google Cloud hardware, and is considered to perform comparably to what typical validators might use. vatstoreGet took about 470us. This can consume considerable time during a delivery that creates/references/deletes a lot of virtual objects. Our current benchmark operation (a PSM trade) performed 439x vatstoreGet, 224x vatstoreSet, and 45x vatstoreDelete operations, summed across all vats. In contrast, it only did 46x syscall.sends, and 49x syscall.resolves. If we could magically reduce the cost of all vatstore operations to zero, the 761ms we spent doing syscalls could be reduced to to 337ms, a 45% improvement.

To support parallelism (some day), we carefully designed the vatstore to be local to each vat: no other vat is allowed to read from or write to it. This indicates another opportunity for performance improvement: give each worker a private SQLite to back its vatStore, instead of using syscalls to access it. SQLite can hold everything in RAM (or somewhere appropriate) until commit, so I'd expect each vatstore operation to take less than a microsecond.

Synchronizing Multiple DBs

Our application consists of lots of separate components, each with their own state that needs to be durable. If all the components used the same DB instance, they could all commit together, atomically. But they don't. Currently, the swing-store "snapstore" component (SQLite) commits first, then the swing-store kvStore (LMDB) commits, then the host-application cosmos IAVL tree (LevelDB) commits. #3087 is about merging all the swing-store DBs into a single SQLite instance, to reduce the scope of this problem, but it does not attempt to address the IAVL-vs-swingstore gap.

Each new DB (specifically each new commit point) creates a new window of time during which the application might crash when one DB has commited but the next one has not. If this happens, when the application starts back up again, DB-1 (e.g. snapstore) will remember things that DB-2 (e.g. kvStore) does not.

We work around this by recording data in DB-2 that tells us what data to look for in DB-1. For example, we add a new heap snapshot into snapstore under a particular key (the "snapshotID"). Then we add a record to kvStore that says which snapshotID to use for a given vat. We do not delete the old snapshot until much later. When we restart, we read the ID out of kvStore, and then read that snapshot out of snapstore.

Now imagine a block which moves vatA from snapshot1 to snapshot2. If both snapstore and kvStore manage to commit before a crash, the restart will read "snapshot2" out of kvStore, and we'll read snapshot2 out of snapstore, and we'll launch the worker with snapshot2.

If we crash after snapstore commits but before kvStore commits, restart will read "snapshot1" out of kvStore, and we'll read snapshot1 out of snapstore, and launch the worker with snapshot1. The snapstore will still contain snapshot2, but it won't be used. The vat will be sent the same deliveries as last time, causing it to advance, and we'll record a snapshot again, and it will create snapshot2 again (since everything is deterministic). As long as snapstore tolerates the insertion of an existing snapshot without complaint, the DB writes will look just the same as they did the previous time, and if both DBs manage to commit, we'll wind up with the same commited state as above.

And if neither DB manages to commit before the crash, we'll wake up with only snapshot1 in snapstore, and "use snapshot1" in kvstore.

We do the opposite set of contortions to avoid deleting snapshot1 from snapstore until we know that the kvstore record has been updated. This is trickier, and requires a "pending deletes" key to be tracked.

Commit/Rollback of Vat State

The durable state of each vat is captured by the initial vat bundle and the transcript (barring upgrades, which require a record of the new vat bundles, and a "deep transcript" that indicates where the upgrades occurred). This can be used to (slowly) reconstruct the worker at any point in its history. For restart efficiency, we added heap snapshots, making the state artifact (latest snapshot + post-snapshot transcript). To reduce memory usage (and make upgrades easier), we introduced virtual/durable objects and vatstore, making the state artifact (latest snapshot + post-snapshot transcript + vatstore contents).

The application's durable state advances by one block at a time, inside of which each vat advances by zero or more deliveries.

Within a block, the buffered state advances by one delivery at a time (to any vat). This state is not made durable (written to the DB) until the block is committed, which is controlled by the host application: to avoid hangover inconsistency, outbound IO is embargoed until everything is committed. But it only contains complete deliveries. The buffered state consists of the durable state plus a set of deltas held in the "block buffer", which receives updates from the "crank buffer" only after the kernel decides to commit to the delivery.

The runtime state of each vat exists in a worker (xsnap) and the vatstore. Making a delivery to a worker irrevocably advances the worker state, and also makes kernel-side changes which are buffered in the crank buffer.

Certain vat errors will cause the delivery to be unwound. Most of these also terminate the vat, e.g. a vat-fatal syscall or a metering fault. This discards the crankBuffer, kills the worker, deletes all the kernel-side vat state, and rejects the vat's orphaned promises.

However some situations call for the vat to survive and be rewound to an earlier state. E.g. a failed upgrade wants to pretend that the upgrade was never requested. This means we kill the worker but do not terminate the vat. A replacement worker will be started when the next delivery is made, and it will start from the previous durable state (heap snapshot plus post-snapshot transcript, which will not include the discarded delivery).

The application might crash at any moment. The durable state must act as if it only advances one block at a time, even if it is recorded in multiple databases that don't all commit at the same time.

Commit/Rollback of per-vat Vatstore DB

If we give each vat a local SQLite DB for its vatstore, we must decide when this DB performs its commits. We want to simulate a global commit, even though the local DBs might commit much earlier than the kernel-side state.

The rough idea is:

  • each vat has a separate SQLite DB, with a table for the vatstore
  • workers still do syscall.send and syscall.resolve (and the GC syscalls), but not syscall.vatstoreSet/Get/Delete
  • each worker gets filesystem access to their vat's DB
  • when a delivery starts, some layer within the worker opens a new transaction, and vatstore operations are buffered therein
  • when the delivery finishes, the worker notifies the kernel, but does not commit the transaction
  • the kernel decides whether the delivery should be unwound or not
    • if unwound, the kernel instructs the worker to abort the DB transaction, then kills the worker
      • the kernel also discards the kernel-side crank buffer contents, so syscall.send/etc are unwound
      • the worker will be rebuilt from the earlier (buffered) state upon the next delivery
    • if not, the kernel instructs the worker to somehow commit the DB transaction, or at least not discard the transaction

We need the worker to effectively manage two levels of commit/transactions. If a delivery is rewound, we want to roll back to the most recent committed delivery. But we don't want the effective durable state to include any deliveries at all until the block is committed. And the final/real DB commit necessarily happens somewhat before the kernel's DB is committed, so we must manage the gap correctly.

One category of approach is to build a SQLite table that includes generation numbers for each vatstore record. So when userspace/liveslots asks for key=abc, the layer below actually does SELECT value WHERE key='abc' AND generation<=42. The committed-but-unwound changes would ignored because they'd have a higher generation number than what is considered "current". Deletes would require tombstones. When a version is really committed, we could delete the older versions (so maybe we'd want a not_current column, set to true if/when we write a newer version, so we can clean up all the old versions with a single DELETE WHERE not_current=true statement). There are a lot of fussy details to figure out.

Another category of approach is to use a similar trick to the snapstore/kvStore synchronization. In this scheme, the vat records the highest deliveryNum that it has processed. The worker commits its DB as soon as the kernel decides we aren't rewinding, and if the app crashes after that commit but before the block commit, when we restart, the kernel won't know that the worker remembers those deliveries. So, if the kernel tells the worker to execute a delivery that it has already performed, the worker pretends to execute but actually ignores it (because the vat's userspace state is already there).

But, now the kernel needs to know about the syscalls that happened in the previous (lost) execution, so the worker must remember those, and re-deliver them to the kernel. As @FUDCo pointed out, this is interestingly parallel to the way that the kernel performs transcript replay, where the vat really does the work, and it's the kernel which merely pretends to execute the syscalls.

To pull that off, the worker would need to:

  • check that the deliveries being made are really identical to the recorded ones
    • it would be sufficient to record a hash of the VatDeliveryObjects, but (like transcript replay) the diagnostics would be improved by recording the full VDO and printing the mismatches
  • remember the VatSyscallObjects that were executed by a previous delivery
    • these must be recorded in full
  • check that the syscall results coming back from the kernel are really identical to the recorded ones
    • again, a hash would be sufficient, but retaining the full results would provide better diagnostics
  • the vat needs to remember the deliveries until the kernel's records have been committed to disk (when the block buffer is flushed)
    • that probably means the kernel does a new kind of delivery that says "all deliveries up to N have been committed, you are now safe to delete them from your records (but retain anything >N)"
    • it might be able to do this during the first delivery after a block commit
    • or maybe every delivery should just include N
    • this would require the kernel to be more aware of block boundaries than it currently is

It also suggests that maybe the entire transcript should live in the vat's DB, since the deliveries being checked are basically the transcript. And that suggests that the supervisor could be responsible for doing transcript replay: the supervisor should wake up (from a heap snapshot), read the current transcript head from the DB, compare it to a value in RAM (from the snapshot), and load+replay deliveries until they match, and only then inform the kernel that it is open for business.

One tricky part here is the lack of engine separation between by the supervisor (or the part that is deciding to pretend-execute deliveries) and the liveslots+userspace code that actually receives the deliveries. By sharing an engine, activity on one side might influence GC and metering/gas usage on the other. Since we now require GC behavior to be part of consensus (so e.g. reanimateCollection() will perform schemata/label vatstore fetches in a deterministic fashion), we could not afford to have the validator which receives the delivery once ("for real") to behave differently than the validator that crashes and replays/ignores it.

Description of the Design

Security Considerations

Test Plan

@warner warner added enhancement New feature or request SwingSet package: SwingSet performance Performance related issues labels Sep 17, 2022
@warner
Copy link
Member Author

warner commented Oct 19, 2022

One potential downside would be an expansion of DB state for the snapStore, since it could no longer de-duplicate XS heap snapshots between multiple vats which manage to get the same heap state. Until we have zygotes (#2268), this might happen because we write an initial heap snapshot on delivery 2, just after we startVat a ZCF vat and then evalContract its contract bundle, so nothing has differentiated yet, and every instance of a given contract goes through the same handful of initial states. A shared snapStore would dedup these, but one-DB-per-vat would not, costing us maybe 1-2 MB per instance. At least until we've done 1000 deliveries and the worker produces a replacement snapshot, which will certainly have diverged entirely.

When we implement zygotes, we'll want to think about how these snapshots are stored. I've been thinking that zygotes should be regular vats, to which we just agree to not send any messages (thus preserving their state, unmodified from one clone to the next). But I think @dtribble has argued compellingly that we should have a special "freeze" or "zygote-ify" operation that effectively terminates the original instance, and simultaneously creates an artifact that can be used for subsequent clones. There are details to manage w.r.t. the c-list (the now-frozen zygote's imports must still be kept alive, so the zygote gets refcounts even though you can't send messages to it anymore). But if we have an explicit operation for this, then it could also trigger a cross-DB copy. The old parent-vat's worker is told to build a heap snapshot (along with everything else it needs, like an initialized vatstore) and then just DUMP it's entire DB into a file, which is then compressed and saved into a blob in the kernelDB's new zygoteStore. And then the clone operation initializes the new vat's DB from the zygote's copy (heap snapshot and all), instead of an init() function.

@FUDCo
Copy link
Contributor

FUDCo commented Oct 19, 2022

If the underlying OS supports copy-on-write (which I think the ones we care about do) then we can just use the database file itself. No need to mess with DUMP gymnastics.

@warner
Copy link
Member Author

warner commented Nov 1, 2022

Are you thinking about the cp --reflink=always option on linux? I'm not sure I want to rely on that:

  • it'll depend on the OS and host filesystem: I just tried it on Linux under both ext4 and zfs, and got an error on both
  • I don't think we'd want to shell out to /bin/cp anyways, which means finding a module that implements the same linux syscalls (the FICLONE ioctls), which means dipping down in to C, ugh
  • the DB's backing file is not designed to be copied at arbitrary moments. I know it's meant to tolerate surprise power failures, process halts, and kernel halts, but I think whatever notion of atomic sampling that Linux might provide for --reflink is not on a particularly well-tested SQLite path
    • in fact copying the backing file in a middle of a transaction is item 1.2 on their list of ways to corrupt a SQLite DB
    • things might be safer if we close all open DB connections first, basically prompting a flush.. that might put the file/files in a safer state
  • SQLite in the desired WAL mode has two files, and I don't know what happens if we clone them at slightly different instants (this looks like item 1.4 on their corruption list)

I just noticed SQLite has a proper "safely clone a DB" operation, and it can run in the background, which might help (although that introduces some uncertainty about when the zygote could be used, we might need to finesse that somehow). It probably doesn't use any kernel-based speedup tricks though. And the output is not going to be as deterministic as a DUMP/sort/hash.

@mhofman
Copy link
Member

mhofman commented Nov 1, 2022

I think most DBs have a notion of starting a full read operation at T1 while other write transactions proceed, and get a consistent view of all the data that was there at T1 only. LMDB sure does.

The only thing we need to figure out is how to account for the time it might take for a full dump using this kind of mechanism to complete.
Edit: and as you mention, the format may not be deterministic either, but that's easily solved by loading the copy and using an actual DUMP from it. It simply adds to the time, between trigger and collection.

I was thinking we could trigger a dump at a deterministic time, at which point all vats and the kernel are in charge of generating state capture at that time and keep a full transcript to replay from that time.

At a later deterministic time, this data would be atomically collected, hashed, and stored. We likely could arrange the data and hashes in such a way that they're incrementally generated, and so that this atomic operation doesn't stop the world for a significant amount of time.

@warner
Copy link
Member Author

warner commented Nov 9, 2022

@mhofman noticed SQLite's snapshot operation: https://www.sqlite.org/c3ref/snapshot_get.html , which allows limited forms of reading historical data rather than the current contents. If we were lucky, this might help with our need to dump/hash the contents of the DB "in the background", while the foreground continues to write newer values into the DB.

However, I think the limitations make it unlikely to work for us. From the man pages, it seems like the basic API is to call sqlite3_snapshot_get() to create an in-RAM object of type sqlite3_snapshot, after which you can use sqlite3_snqpshot_open() to start a new read txn from that state (rather than the current state). This relies on WAL mode (in which changes are written to a journal, not directly to the DB), and probably works by reading from the middle of the WAL file instead of the end. As a result, the limitations are:

  • we're using WAL mode (not a problem, we want that anyways)
  • the process hasn't rebooted since the snapshot object was created (unfortunate, but probably no worse than what native cosmos-sdk state-sync requires)
    • IAVL can return data from arbitrarily-old non-pruned blockHeights, but the cosmos-sdk "write out a state-sync snapshot" function is only launched immediately after a commit
    • so if you happen to reboot the validator/follower node after the snapshot height but before the snapshot finishes writing, that node will not resume the snapshot write later
    • I imagine validators are rebooted infrequently, and losing out on a snapshot or two isn't a big deal
    • this could be fixed with more code
    • in contrast, the SQLite API doesn't provide a way to name snapshots: the sqlite3_snapshot object (in RAM) is precious, and once the process that called get terminates, the snapshot cannot be re-acquired by a different/subsequent process
  • the snapshot is invalidated when the WAL file is deleted (a "checkpoint" occurs)
    • this happens automatically every once in a while, when the WAL file gets too large: we'd have to disable that, or accept the probabilistic failure of snapshot writes if/when they occur while the WAL file is close to the auto-checkpoint threshold
    • DB writes are not durable against power failure until a checkpoint happens, so we might be forcing a checkpoint once per block for basic safety reasons, preventing us from making use of snapshots

There is a sqlite3_snapshot_recover() function that purports to allow access to snapshots from a WAL file after all connections have closed, but I don't see how it could be used: it's not returning a list of discovered sqlite3_snapshot objects, so the process would need to have those objects from an earlier run, which means it used to have a DB connection, and I don't know why it could have lost that connection. It's conceivable that there's some way to use this API to retrieve snapshots from an earlier process, which might help the "resume writing DB snapshot after a validator reboot". But I doubt it.

Note: the snapshot feature is normally disabled , and requires a compile-time SQLITE_ENABLE_SNAPSHOT flag to be enabled.

@mhofman
Copy link
Member

mhofman commented Nov 9, 2022

Looks like I wrote a similar analysis (but with slightly different conclusions) simultaneously (we didn't open a write lock): #5542 (comment)

@mhofman
Copy link
Member

mhofman commented Nov 9, 2022

Regarding the sqlite_snapshot struct, it seems to be independent of heap state, and is internally copied in the first place. The tests seem to exercise this over multiple DB connections.

Regarding checkpoints, I'm dubious they're needed for most shutdown recovery. My understanding is that they're mostly a form of "compaction", and that if we really want power-loss resilience, WAL writes can be fsync the same way the checkpoints are if setting PRAGMA synchronous to FULL. That said, in the presence of snapshots, maybe it's acceptable to require a power-loss recovery to restart from snapshot?

@mhofman
Copy link
Member

mhofman commented Nov 9, 2022

I posted the question regarding saving of the snapshot struct on the SQLite forums: https://sqlite.org/forum/forumpost/f7a325b6d3

@mhofman
Copy link
Member

mhofman commented Mar 13, 2023

In #6447 (comment), @warner discusses the idea @FUDCo came up with: the vatstore DB is just committed when taking a snapshot, and unlike other syscalls that are just checked against the transcript, vatstore operations are instead performed again on replay.

If the vatstore operations stay within the worker, they don't cross the boundary with the kernel, and are not recorded in the transcript in the first place.

However we still have the problem introduced by multiple commit points: if we commit the vatstore DB when making a worker heap snapshot, and the kernel exits before commiting the block where that snapshot occurred, the next time the worker will be started with a previous heap snapshot and a vatstore that no longer corresponds (from the future).

We also have the problem of capturing the content of the vatstore in a verifiable way for state-sync (#3769). In #6773 we introduce an "export stream" of the kvStore into vstorage. If vatstore is removed from the kvStore, we lose that mechanism.

Both issues can be solved in tandem: a main vatstore DB which commits after the kernel DB has committed, and a "journal" DB recording vatstore changes (updates/deletions) since the last main DB commit, which is itself committed when performing a heap snapshot.

The "journal" DB can be used to populate the kernel's "export stream" when a snapshot is taken, ensuring that the vatstore is part of state-sync.

The flow would be as follow:

  • when starting a worker, the kernel provides:
  • when starting, the worker:
    • starts loading the heap snapshot
    • checks if the start vatstore nonce matches the current nonce saved in the vatstore DB
    • if it doesn't match, perform replay steps:
      • check if the current nonce saved in the vatstore DB exists as a table in the "journal" DB, panic if it doesn't
      • start a savepoint in the vatstore DB
      • replay the "journal" table into the main vatstore DB
      • read the "to nonce" from the "journal" table, save it as the current nonce in the main vatstore DB.
      • check if the latest nonce matches the start nonce, repeat these replay steps if it doesn't
      • if this is a committed snapshot, commit the vatstore DB (releasing the savepoints)
    • performs the "post snapshot steps":
      • start a new savepoint in the main vatstore DB named after the start vatstore nonce
      • starts a new transaction in the "journal" DB
      • create a new table in the "journal" DB named after the start vatstore nonce
  • when performing a snapshot:
    • the kernel remembers the current vatstore nonce it previously sent to the worker
    • the kernel generates a new vatstore nonce
    • the kernel sends a snapshot command to the worker, the new vatstore nonce it generated, and whether to initiate the post snapshot step (if the same worker will be used to execute further deliveries instead of a new worker, see Force xsnap reload from snapshot after writing each snapshot #6943)
    • the worker saves the vatstore nonce in the current "journal" table as the "to nonce", and commits the transaction
    • the worker saves the vatstore nonce as the new current nonce in the main vatstore DB
    • if the worker will not survive, rollback vatstore DB (releasing the worker's write lock)
    • the worker starts streaming the heap snapshot
    • if the worker survives, it performs the "post snapshot steps" (same as above)
      • start a new savepoint in the main vatstore DB named after the vatstore nonce
      • start a new transaction in the "journal" DB
      • create a new table in the "journal" DB named after the vatstore nonce
    • the kernel reads the heap snapshot and saves it in the snapstore
    • the kernel opens a read transaction on the vatstore "journal", opens the table named against the previous nonce it remembered, and notes the changes in its "noteExport" mechanism to replicate changes in vstorage
    • if the kernel reloads the worker from snapshot, it spawns a new worker, streams the snapshot to the new worker, and provides it with the latest nonce
  • when committing a block which includes at least a new heap snapshot
    • the kernel sends a "commit vatstore" command to the worker with the latest vatstore nonce
    • the worker rollsback the main vatstore DB to that named savepoint
    • the worker commits the main vatstore DB (releasing savepoints)
    • start a savepoint in the vatstore DB
    • replays the "journal" table corresponding to the vatstore nonce into the main vatstore DB
    • delete other tables from the "journal" DB.

This is a little heavy, but uses standard SQLite mechanisms. If we want to get a little more fringe, the following optimizations are possible:

  • Leverage the WAL file, read transactions and checkpointing to handle commits of the main vatstore DB, a-la LiteStream
    • kernel starts a read transaction on the vatstore DB before starting a snapshot
    • worker actually performs a commit of the vatstore DB during the snapshot, which will be written in the WAL (cannot checkpoint because of open read transaction).
    • when commiting the block the kernel instructs the worker, which performs a checkpoint.
    • Some handwave to make sure we checkpointed till the latest commit
    • delete the "journal" tables that are no longer needed (as previously)
    • When starting a worker, ignore/delete the WAL file
  • Once we can commit the vatstore when the snapshot is taken, instead of a manual "journal" DB, we could use SQLite Sessions to generate a changeset.

@mhofman
Copy link
Member

mhofman commented Mar 21, 2023

If we remove the vatstore syscalls from the transcript, we'll also have to capture the state, or at least a hash of the state, of the vatStore as it was at the time of a vat upgrade in order to support the Manchurian style upgrades (#1691). Without that, we wouldn't be able to start a replay from baggage as we'd have lost the baggage.

@mhofman
Copy link
Member

mhofman commented Apr 5, 2023

Both issues can be solved in tandem: a main vatstore DB which commits after the kernel DB has committed, and a "journal" DB recording vatstore changes (updates/deletions) since the last main DB commit, which is itself committed when performing a heap snapshot.

Maybe we can take a slightly different approach and automate the creation of this journal using SQLite TRIGGER. The docs have an example of using that mechanism for an application undo/redo

@FUDCo
Copy link
Contributor

FUDCo commented Apr 5, 2023

Triggers terrify me. Triggers are the relational database version of shared-state concurrency: a swamp of chaos and confusion that people naively imagine they understand and can master, but they are almost always wrong.

@warner
Copy link
Member Author

warner commented Jun 7, 2023

I was thinking more about @FUDCo 's "don't commit vatstore DB until heap snapshot" idea, and it occurred to me that making a full copy of the table might not be too expensive. Each vat would have one DB with the transcript (deliveries, results, which ones were collected by the kernel so far, retirements, etc), and a second DB with the heap snapshots and the vatstore data. But the second DB would have two tables for vatstore: an open one with the latest contents (committed after every delivery), and a second one with a full copy of the vatstore contents as of the last heap snapshot point.

The end-of-span sequence would be like:

  • perform delivery, note that the computrons used take us over the "time to snapshot" threshold
  • DROP TABLE oldVatstoreContents
  • CREATE TABLE oldVatstoreContents (key STRING PRIMARY KEY, value STRING)
  • INSERT INTO oldVatstoreContents SELECT * FROM vatstoreContents
  • read the old vatstore contents, build a canonical hash, stash that in the DB somewhere too
  • create snapshot, INSERT INTO snapshots (?..)
  • COMMIT
  • insert write-snapshot transcript entry into other DB

Every time the kernel collects a delivery result, it also gets the hash of the most recent oldVatstoreContents (which doesn't change), and the cumulative hash of the transcript entries that have been collected so far (which do). These are put into the IAVL tree as the "swingstore export data" equivalent, enough hashes to validate the vat state as of the collected deliveryNum. If the block finishes and the host app then decides it's time to make a state-sync snapshot, it asks the worker for an artifact with the complete oldVatstoreContents, and another artifact with the transcript upto and including the last collected entry.

To make sure the vat doesn't run too far ahead and give up the previous contents too early, maybe the vat should stop doing execution once a heap snapshot is created, and not resume until the triggering delivery is collected. Or, maybe we could collect multiple sets of old contents (indexed by the deliveryNum that triggered the corresponding heap snapshot), and figure out what sort of retirement policy would work.

@mhofman
Copy link
Member

mhofman commented Jun 7, 2023

A full copy feels like it would make the cost of snapshotting proportional to the vat store size, which we're telling users to prefer compared to using the heap. Also separating transcripts and heap snapshots in separate DBs makes me feel uneasy.

Furthermore generating a hash ourselves of the vat store content by reading through the whole thing seems wrong. I'd much prefer relying on an IAVL shadowing to do that for us.

@mhofman
Copy link
Member

mhofman commented Jun 7, 2023

Thinking about this more, I don't see how @warner's proposal above solves the hangover problem. In particular, what happens if we fail before the write-snapshot entry is committed in the kernel DB, but after we've committed the vat DB containing the heap snapshot and vatStore? We basically end up with a vat from the future that will be confused by replayed messages from the kernel.

@mhofman
Copy link
Member

mhofman commented Nov 23, 2023

The current implementation exports all vat store entries to the cosmos DB, which is unsustainable (causes slowness in state-sync, pruning, etc.)

We need to move to a place where the vat store is treated as an artifact verified by cosmos DB, like transcripts and heap snapshots. Unlike those artifacts, the vatStore can replace existing entries, so generating a hash is more complex.

I am becoming convinced that the solution should be:

  • represent an exported vat store as a starting point artifact, and a sequence of "journals" artifacts (sets/deletes applied to the previous state)
  • with a deterministic schedule, compact the vat store export: apply the journals up to a certain point, and do a full export to make that the new starting point artifact. The consumed journals can then be pruned.
    • At the slowest schedule, the compaction must be done when upgrading a vat, to create a capture of the baggage (these artifact hashes should likely be pinned). This can result in an ever growing list of journal files that must be kept around if the vat is not upgraded. During restore, this could result in a long replay of vat store journals.
    • At the fastest schedule, as soon as a new heap snapshot is taken, and a journal artifact is finalized for the span, the compaction can start asynchronously to generate a new starting point artifact. That compaction only needs to be ready for the next journal finalization / next span completion. This would result in the exported vat store state always having 2 artifacts: the starting point, and a single journal file for the most recently finalized span.

@mhofman
Copy link
Member

mhofman commented Dec 7, 2023

After discussing about this more, the export data motivation for make vatStore an artifact is less motivated: a similar order of magnitude of keys appear in the kernel side, like clist entries, which means it wouldn't sufficiently reduce the number of entries in the cosmos DB (it would reduce the size however, but that's less a problem)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs-design performance Performance related issues SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

5 participants