Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move snapstore (XS heap snapshots) into SQLite #6742

Closed
warner opened this issue Jan 3, 2023 · 3 comments
Closed

move snapstore (XS heap snapshots) into SQLite #6742

warner opened this issue Jan 3, 2023 · 3 comments
Assignees
Labels
enhancement New feature or request SwingSet package: SwingSet vaults_triage DO NOT USE

Comments

@warner
Copy link
Member

warner commented Jan 3, 2023

What is the Problem Being Solved?

The next step of #3087 is to move snapStore into SQLite too: this is the component of swing-store that holds XS heap snapshots. These heap snapshots are files, 2-20MB when compressed, created by xsnap when it is instructed to write out the state of its heap. The xsnap process can be launched from a snapshot instead of an empty heap, which saves a lot of time (no need to replay the entire history of the vat).

Currently, swing-store holds these in a dedicated directory (one file per snapshot), in which each file is named after the SHA256 hash of its uncompressed contents (.agoric/data/ag-cosmos-chain-state/xs-snapshots/${HASH}.gz). The kvStore holds a JSON blob with { snapshotID, startPos } in the local.v$NN.lastSnapshot key, to keep track of the vatID->snapshot mapping. It also holds local.snapshot.$id = JSON(vatIDs..) to track the snapshot->vatIDs direction (remember that two vats might converge and use the same snapshot, e.g. newly-created ZCF vats running the same contract that have not diverged significantly yet).

The one-file-per-snapshot approach effectively creates a distinct database, whose commit semantics are based upon an atomic rename (creating the HASH.gz file) and some eventual unlink() syscall that deletes the file. These commit points are different than those of the kvstore which references the files, requiring some annoying interlocks to make sure 1: we always add the file before adding the kvstore reference, and 2: we never delete the file before committing the removal of the last kvstore reference.

It would be a lot cleaner to record both the vat-to-snapshot mapping and the snapshots themselves in the same atomicity domain. Basically two tables:

CREATE TABLE heapSnapshots (
 id TEXT,
 compressed BLOB,
 PRIMARY KEY (id)
)

CREATE TABLE vatHeaps (
 vatID TEXT,
 snapshotID TEXT, -- maybe add a FOREIGN KEY constraint
 PRIMARY KEY (vatID)
)

During commit, or just after changing a vatHeaps entry, we can scan heapSnapshots for unreferenced heaps and delete them.

This will interact with @mhofman 's work to make xsnap read/write its heap by streaming it over a pipe, rather than writing it to a file. This also removes the need for xsnap to have access to the filesystem, which will help with #2386 jail.

Description of the Design

In addition to the new tables, I think the swingStore.snapStore component will have some different APIs. I think one pair to read/write snapshots (doing some streaming thing, maybe an AsyncIterator of uncompressed chunks), and a separate pair to either assign a vatID->snapshotID mapping, or clear the mapping (e.g. when upgrading or terminating a vat).

The kvstore keys (local.v$NN.lastSnapshot and local.snapshot.$id) will go away, in favor of the proper cross-table foreign keys. The startPos field from lastSnapshot needs to be tracked next to the snapshotID: the possibility of convergence means that two different vats might conceivably arrive at the same heap snapshot but on different deliveryNums. This should probably coordinate with the streamStore, so they're all using a matched deliveryNum or transcript entry index.

Efficiency Considerations

We've had some concerns about putting large blobs in SQLite. I (warner) am pretty sure this will be fine. I found one article (https://www.sqlite.org/intern-v-extern-blob.html) examining read speed differences between external files and BLOBs, and for the (default) 4kiB pages we use, they report that external files can be read about twice as fast as in-DB blobs. I ran the tests on a follower node (SSD filesystem), and found the same difference. But note that we're talking about 544MBps for in-DB blobs, vs 965MBps for files on disk, so a typical 2MB compressed snapshot is going to load in a millisecond or two, and the extra speed isn't going to matter.

$ ./kvtest init x1.db --count 1000 --size 10000000  # 1000 snapshots of 10MB each
$ ./kvtest export x1.db dir  # copy all blobs to files in dir/
$ ./kvtest run x1.db --count 1000 --blob-api
SQLite version: 3.40.1
--count 1000 --max-id 1000 --asc
--cache-size 1000 --jmode delete
--mmap 0 --blob-api
Database page size: 4096
Total elapsed time: 18.372
Microseconds per BLOB read: 18372.000
Content read rate: 544.3 MB/s
$ ./kvtest run dir --count 1000 --blob-api
--count 1000 --max-id 1000 --asc
Total elapsed time: 10.365
Microseconds per BLOB read: 10365.000
Content read rate: 964.8 MB/s

Using blobs from DB will require slightly more memory, because the SQLite API doesn't provide streaming access to the blob contents (it is delivered as a single large span of memory), whereas pulling files from disk could read just enough bytes to decompress the next chunk. So while we start a worker from heap snapshot, the kernel process will briefly require 2-20MB of RAM to hold the compressed snapshot data. This will be freed once decompression is complete. Note that we don't need to hold a copy of the decompressed data: we can stream that out as fast as the xsnap process can accept it, and never need to hold more than a reasonably-sized buffer.

Debugging Considerations

We might want a switch to disable the "delete unused snapshots" code, for archive nodes (@mhofman has found it awfully useful to have a way to retain all heap snapshots, for later forensics). To help correlate these with vats, maybe we should have a table of historical (vatID, lastPos) -> snapshotID mappings. Each time we update the main table, we also add an entry to this debugging table (but the debug table entries won't keep the snapshots alive, so either they aren't FOREIGN KEYs or the table only exists when we also remove the "delete unused snapshots" code, so the constraint is never violated).

Security Considerations

Shouldn't be any.

Test Plan

Unit tests.

@warner warner added enhancement New feature or request SwingSet package: SwingSet labels Jan 3, 2023
@mhofman
Copy link
Member

mhofman commented Jan 3, 2023

Quick observation: the temporary buffering of the compressed snapshot would also need to happen when making a snapshot, as there is the same streaming into DB limitation, as well as the inability to know the hash of the snapshot for the primary ID. The latter could be solved by using a primary ID generated randomly or incrementally, but that would likely require making sure that this primary ID is internal and not used in any consensus paths. However all that is unnecessary if there are no way to stream blobs from the DB.

Btw we could imagine a chunking mechanism to avoid holding full compressed snapshots in memory, but that's effectively re-implementing streaming.

I am also unconvinced that we need to store identical snapshots in the same table entry. This feels like an unnecessary optimization, where the potential space savings are not worth the complexity costs.

@warner
Copy link
Member Author

warner commented Jan 3, 2023

Good points. I'm not worried about the RAM on the snapshot-write side (at least I'm equally non-worried about the write- and read- sides). So I think we read the stream from xsnap, feed each chunk into both the hasher and the compressor, accumulate the compressed data in RAM, then when the stream is done, we write the large compressed blob into the DB under its hash name.

I agree that de-duplicating snapshots is not an important use case (and the practical chances of convergence are pretty low, especially if we update our "when do we take the first snapshot" code to make sure it includes all the deliveries we do during contract startup, which will probably make them completely diverge). I'm a big fan of hash-named files, but if we're saving them as blobs, then we might as well just use CREATE TABLE heapSnapshots ( vatID TEXT, compressedSnapshot BLOB, startPos INTEGER), with maybe a separate debug table for historical values (would cause a bit more churn during updates, since the history table and the real table wouldn't share data, but I doubt that's a big deal).

@mhofman
Copy link
Member

mhofman commented Jan 3, 2023

Was thinking we could simply select all rows for a particular vatID, sort by startPos, and only use the last one when loading from snapshot. That way removing old rows is simply a matter of pruning, which can be host defined.
Edit: to support updates, why not store the incarnation in a column and index on vatID+incarnation?

Also we probably should still store the computed hash of the uncompressed data along with the blob of the compressed data, since we will need it for consensus, state sync and debugability.

For state sync however we did talk a few weeks ago about being able to mark a snapshot as "in use" by the host application while the state sync artifacts are being generated. Goal is to not do expensive operations when initiating a state sync snapshot, and instead leave that to the asynchronous processing which can span blocks if necessary. If we don't do reference counting on snapshot IDs and go with vatID+startPos instead, we may need the state sync logic to constraint XS snapshot pruning. Or we could just go the route of creating a read transaction on this table for state sync purposes.

@ivanlei ivanlei added the vaults_triage DO NOT USE label Jan 3, 2023
FUDCo added a commit that referenced this issue Jan 6, 2023
…tore

This is phase 1 of #6742.  These changes cease storing snapshots in
files but instead keep them in a new table in the swingstore SQLite
database.  However, in this commit, snapshot tracking metadata is
still managed the old way using entries in the kvstore, rather than
being integrated directly into the snapshots table.
FUDCo added a commit that referenced this issue Jan 6, 2023
…tore

This is phase 1 of #6742.  These changes cease storing snapshots in
files but instead keep them in a new table in the swingstore SQLite
database.  However, in this commit, snapshot tracking metadata is
still managed the old way using entries in the kvstore, rather than
being integrated directly into the snapshots table.
FUDCo added a commit that referenced this issue Jan 6, 2023
…tore

This is phase 1 of #6742.  These changes cease storing snapshots in
files but instead keep them in a new table in the swingstore SQLite
database.  However, in this commit, snapshot tracking metadata is
still managed the old way using entries in the kvstore, rather than
being integrated directly into the snapshots table.
FUDCo added a commit that referenced this issue Jan 12, 2023
…tore

This is phase 1 of #6742.  These changes cease storing snapshots in
files but instead keep them in a new table in the swingstore SQLite
database.  However, in this commit, snapshot tracking metadata is
still managed the old way using entries in the kvstore, rather than
being integrated directly into the snapshots table.
FUDCo added a commit that referenced this issue Jan 14, 2023
…tore

This is phase 1 of #6742.  These changes cease storing snapshots in
files but instead keep them in a new table in the swingstore SQLite
database.  However, in this commit, snapshot tracking metadata is
still managed the old way using entries in the kvstore, rather than
being integrated directly into the snapshots table.
FUDCo added a commit that referenced this issue Jan 18, 2023
…tore

This is phase 1 of #6742.  These changes cease storing snapshots in
files but instead keep them in a new table in the swingstore SQLite
database.  However, in this commit, snapshot tracking metadata is
still managed the old way using entries in the kvstore, rather than
being integrated directly into the snapshots table.
@mergify mergify bot closed this as completed in 4e0f679 Jan 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request SwingSet package: SwingSet vaults_triage DO NOT USE
Projects
None yet
Development

No branches or pull requests

4 participants