Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swing-store should keep historical artifacts outside of SQLite DB #9389

Open
Tracked by #10032
mhofman opened this issue May 20, 2024 · 7 comments
Open
Tracked by #10032

swing-store should keep historical artifacts outside of SQLite DB #9389

mhofman opened this issue May 20, 2024 · 7 comments
Assignees
Labels
enhancement New feature or request swing-store

Comments

@mhofman
Copy link
Member

mhofman commented May 20, 2024

What is the Problem Being Solved?

For #9174 we're making retention of historical transcripts configurable. While most nodes would no longer keep historical transcripts by default, we'd want archive nodes to keep them for potential future uses (like replay based upgrades #7855). However these are conceptually static artifacts that do not strictly need to be stored in the SQLite DB. Furthermore sharing them with validators that have pruned these would be easier if simply stored on disk.

Description of the Design

Based on #8318 and the related work in #8693, when being rolled, a transcript span would be compressed and stored as a file on disk alongside the SQLite DB. Similarly, if historical heap snapshots are kept by config (#9386), they would be stored as compressed files instead. The files can be flushed to disk immediately, and before the swingstore itself is committed.

The files can be named based on the "export artifact" naming scheme used for state-sync.

Security Considerations

None

Scaling Considerations

Reduce growth of SQLite DB size

Test Plan

TBD

Upgrade Considerations

Host side of the swingset kernel stack, not upgrade sensitive, but deployed as part of a chain software upgrade.

@warner
Copy link
Member

warner commented Sep 5, 2024

We had a discussion today about "keep" vs "archive", which might make us want to split this ticket into two separate ones, or maybe change its scope a little bit.

"archive" goal

My personal goal here is to reduce the disk space usage on the follower I run, without losing the historical spans entirely (I want to have a personal archive, for analysis purposes). What I want is an app.toml or environment-variable swingstore mode to:

  • after span rollover, swingstore writes a single file out to a secondary directory
    • it should use the standard "write to a tempfile and atomically rename at the end" trick, but it should either use a distinctive suffix or prefix, or put the tempfiles in a neighboring directory, such that there's a simple glob pattern that matches only complete span files
    • the contents should be compressed on their way out, ideally in .gz format (e.g. pipe through gzip or use a GZipFileWriter that includes the full .gz header), so I can zcat filename the output and work with the JSON lines.
    • then it should delete the rows from transcriptItems, to save space on the .sqlite file

Then I can write a script that uploads the contents of that secondary directory to S3 or an external bulk drive, and then delete the files that were copied. Something like mv dir/* .../upload-dir; rsync .../upload-dir/ remote:archive-dir/; rm .../upload-dir/* in a cronjob.

For my goal, I don't need the swingstore to retain access to these archive files. The swingstore would be configured to delete the old spans (keepTranscripts: false), but it would just stash this archival copy in an easy-to-offload form just before deletion. I don't feel strongly about having the tool fflush() those files before starting the deletion process: doing that would smack of inventing our own database, which I want to avoid.

"keep" goal

From our conversation today (and @mhofman please correct me if I get this wrong), @mhofman 's intention was to have the swingstore retain the old spans (assuming keepTranscripts: true), just store them in an alternate location. We'd want mostly the same atomicity requirements of SQLite's commit(), which means:

  • the backing store filename should include the span hash, so if we write the file, crash before commit, restart, and the (non-cosmos) host app tells the kernel to do something differently than last time, the replacement span rollover will emit a different filename, and we'll just have a leftover/unused span file that won't cause much harm
  • we should fflush() the span file before moving on to the DELETE FROM transcriptItems step, to ensure the data survives at least as well as the .sqlite file

The hope would be to improve performance of the SQLite DB by reducing its size, but not to reduce the size of the swingstore overall: we'd just be moving some items out of the DB and into (retained, protected, flushed) individual files.

Another potential benefit of this storage mode would be to make state-sync import/export faster. Export might be faster if we could hand off (perhaps hardlinking??) the backing files as state-sync artifacts directly, instead of needing to read from the DB and then write to the disk, or writing to a pipe. Import might be faster if "import" just meant dropping the file in the secondary directory.

My counter-arguments are:

  • I'm not convinced shrinking the SQLite file will significantly improve its performance. I expect removing transcriptItem rows will yield a log(N) speedup of fetching items (e.g. for vat replay), but we only do about 200 of those per vat, and only at kernel startup and vat page-in time.
  • If the state-sync export protocol wants files (as opposed to a pipe), then we'd be comparing a file copy (OS kernel read + OS kernel write) against a DB export (DB read + OS kernel write). And I'd bet (ok maybe only $10) that SQLite would win: it's already got a lot of metadata in RAM, and as the SQLite folks say, they're not competing with MySQL, they're competing with open(2).
    • OTOH, this might be a place where the log(N) of the bloated transcriptItems index bites us, and this would hurt both (DB read + OS kernel write) to a file, and (DB read + OS kernel write) to a pipe
  • The state-sync import protocol still needs to verify that the imported contents match the hash, which means their contents must be read, which means we can't really just drop them into place.
  • landing the transcript-span compression PR (Implement transcript span compression in SwingStore #8693) would shrink the index nearly as well as moving these items out of the DB, giving us most of the potential performance improvements, without changing the durability/atomicity properties
  • finally, we started swingstore with transcript files, not in a database, because we thought we should keep bulk data out of LMDB; when we moved to SQLite, we realized that life really gets simpler when everything is in the database. I don't think we should second-guess the performance of the DB without careful timing measurements, or build a lot of mechanism to avoid using the DB for data that we really want to keep.

next steps

One outcome might be to change the stated intention of this ticket, maybe by renaming it to something like "add a mode to archive deleted historical spans in separate files".

Another might be to make a new ticket with that title, and have @gibson042 look into implementing that sooner, and defer implementing this one ("keep") until/unless we have some performance numbers to show it would be better, and worth the added complexity.

I'm happy either way; I don't want to hijack @mhofman 's ticket, but the "keep" behavior is not what I had in mind, and we had some confusion as to what the short-term development goal was. We should nail that down so @gibson042 can work on the right thing (and mark it as closing the right ticket.. if this ticket retains @mhofman 's intent, and @gibson042 implements archive files, then his work should not close this ticket).

@mhofman
Copy link
Member Author

mhofman commented Sep 5, 2024

  • the backing store filename should include the span hash

Not strictly needed. If it has the same start/end pos it'll just overwrite the different file. If it has different pos, it'll be a different file.

Import might be faster if "import" just meant dropping the file in the secondary directory.

My use case is not so much performance of state-sync import, but the ability to easily reconstitute a "replay" or "archive" mode swing-store from an "operational" one by simply copying some static files. In particular, I hope the replay tool for XS acceptance testing (#6929) could avoid using a really large SQLite file.

  • we started swingstore with transcript files

Did we, I thought transcripts were always in the DB.

  • we realized that life really gets simpler when everything is in the database.

I agree it's vastly simpler for the current span transcript entries. I'm not convinced storing heap snapshots in the DB made things that much simpler, which is roughly the equivalent here.

  • or build a lot of mechanism to avoid using the DB for data that we really want to keep.

Maybe this is something we should talk about with archive node operators, and understand the direction of the cosmos ecosystem here. We currently don't have a runtime need to read this data. It really is fully at rest for current use cases. I expect that archival operations could be optimized by keeping this historical data in separate non mutating files.

@warner
Copy link
Member

warner commented Sep 6, 2024

  • the backing store filename should include the span hash

Not strictly needed. If it has the same start/end pos it'll just overwrite the different file. If it has different pos, it'll be a different file.

Yeah... ok, I see what you mean. The fflush() means the backing-store file might be ahead of the SQLite DB, but never behind (assuming the filesystem maintains ordering of flushes to different files, which I guess the .wal file depends upon anyways). So you might wind up with multiple files with the same startPos but different endPos (if e.g. dirt-based snapshotting means the first pass and the second pass decide to snapshot at different deliveryNums), or if you happen to snapshot at the same time then you overwrite the original file with the newer (and more likely to reach commit()) contents. In either case, once you make it to the commit() of SQLite, the correct file with the correct contents should already be in place and flushed.

Import might be faster if "import" just meant dropping the file in the secondary directory.

My use case is not so much performance of state-sync import, but the ability to easily reconstitute a "replay" or "archive" mode swing-store from an "operational" one by simply copying some static files. In particular, I hope the replay tool for XS acceptance testing (#6929) could avoid using a really large SQLite file.

Gotcha. Composition of primary and archival data with cp rather than INSERT INTO.

  • we started swingstore with transcript files

Did we, I thought transcripts were always in the DB.

This was very early, LMDB days (May-2021), where we stored the transcript in a flat file, and the offset in LMDB. It used a thing named "StreamStore".

function writeStreamItem(streamName, item, position) {

I agree it's vastly simpler for the current span transcript entries. I'm not convinced storing heap snapshots in the DB made things that much simpler, which is roughly the equivalent here.

Hm, maybe. I know managing refcounts on heap snapshots is a lot more robust now: previously we had to be careful about not deleting the old snapshot too early, because we might crash before the DB commit and the restart needs the previous snapshot to start the replay from. Now the lifetime of the snapshot is exactly equal to the lifetime of the span that needs it.

Maybe this is something we should talk about with archive node operators, and understand the direction of the cosmos ecosystem here. We currently don't have a runtime need to read this data. It really is fully at rest for current use cases. I expect that archival operations could be optimized by keeping this historical data in separate non mutating files.

Fair point.

Ok, I'll make a separate ticket for the non-DB write-only thing, in case that's easy to implement in the short term, and then let's continue the conversation about what archive node operators could use best. #10036

@mhofman
Copy link
Member Author

mhofman commented Sep 6, 2024

Now the lifetime of the snapshot is exactly equal to the lifetime of the span that needs it.

That's independent, and because we don't attempt to deduplicate by snapshot content anymore ;) We only manage snapshot based on vatId and position, regardless of their hash.

@warner
Copy link
Member

warner commented Sep 6, 2024

Now the lifetime of the snapshot is exactly equal to the lifetime of the span that needs it.

That's independent, and because we don't attempt to deduplicate by snapshot content anymore ;) We only manage snapshot based on vatId and position, regardless of their hash.

But, just within a single vat, if we're not retaining all snapshots, we still have to manage the transition from vat1-span2 to vat1-span3:

 SNAP1 -> span1.. -> SNAP2 -> span2.. -> SNAP3

When we finish with span2 and call rolloverSpan, we write SNAP3 to the snapStore, and we delete SNAP2. With a DB, the txn that deletes SNAP2 will atomically include the creation of SNAP3, so there is no danger of winding up with zero snapshots. With discrete files, there's a point where we write (and flush) SNAP3, and then there's a different point where we delete SNAP2.

Before we moved snapStore into SQLite, yes we were also sharing snapshots between vats, but even without that we had to be careful that the code which reacted to refCount===0 would only delete things in the next block. I think we had it delete immediately after the commit, but of course we might crash just after the commit but before we deleted the file. So there was a funky interlock, and delayed deletion, and all of that just went away when we kept everything in a real DB, and had a single commit point for everything.

@warner
Copy link
Member

warner commented Sep 6, 2024

Did some digging,

function prepareToDelete(hash) {
is involved in the old (January 2023) discrete-file snapshot deletion path. I see a toDelete Set that remembers things we might be able to delete.. the fact that it's in RAM means we might forget it if we crashed.
snapStore.prepareToDelete(old.snapshotID);
appears to be where we mark it for deletion. Anyways, yeah this got a lot better when we leaned into the DB and used its atomicity.

@mhofman
Copy link
Member Author

mhofman commented Sep 6, 2024

With discrete files, there's a point where we write (and flush) SNAP3, and then there's a different point where we delete SNAP2.

Right, with deletion we'd have to be careful to delete after commit to handle abort.

the fact that it's in RAM means we might forget it if we crashed.

Yeah there's that too.

Thankfully we'd be in append only in this case. I don't think we'd want a mode where historical entries ever need to be deleted.

mergify bot added a commit that referenced this issue Sep 6, 2024
Ref #9174
Fixes #9387
Fixes #9386

TODO:
- [ ] #9389

## Description
Adds consensus-independent `vat-snapshot-retention` ("debug" vs. "operational") and `vat-transcript-retention` ("archival" vs. "operational" vs. "default") cosmos-sdk swingset configuration (values chosen to correspond with [`artifactMode`](https://github.com/Agoric/agoric-sdk/blob/master/packages/swing-store/docs/data-export.md#optional--historical-data)) for propagation in AG_COSMOS_INIT. The former defaults to "operational" and the latter defaults to "default", which infers a value from cosmos-sdk `pruning` to allow simple configuration of archiving nodes.

It also updates the semantics of TranscriptStore `keepTranscripts: false` configuration to remove items from only the previously-current span rather than from all previous spans when rolling over (to avoid expensive database churn). Removal of older items can be accomplished by reloading from an export that does not include them.

### Security Considerations
I don't think this changes any relevant security posture.

### Scaling Considerations
This will reduce the SQLite disk usage for any node that is not explicitly configured to retain snapshots and/or transcripts. The latter in particular is expected to have significant benefits for mainnet (as noted in #9174, about 116 GB ÷ 147 GB ≈ 79% of the database on 2024-03-29 was vat transcript items).

### Documentation Considerations
The new fields are documented in our default TOML template, and captured in a JSDoc type on the JavaScript side.

### Testing Considerations
This PR extends coverage TranscriptStore to include `keepTranscripts` true vs. false, but I don't see a good way to cover Go→JS propagation other than manually (which I have done). It should be possible to add testing for the use and validation of `resolvedConfig` in AG_COSMOS_INIT handling, but IMO that is best saved for after completion of split-brain (to avoid issues with same-process Go–JS entanglement).

### Upgrade Considerations
This is all kernel code that can be used at any node restart (i.e., because the configuration is consensus-independent, it doesn't even need to wait for a chain software upgrade). But we should mention the new cosmos-sdk configuration in release notes, because it won't be added to existing app.toml files already in use.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request swing-store
Projects
None yet
Development

No branches or pull requests

3 participants