explore sqlite as kernelDB #3087

warner · 2021-05-12T20:51:30Z

What is the Problem Being Solved?

The swingset kernel currently encodes all of its state in a simple key-value format, with a "schema" (i.e. what we use each key for) informally defined in a large comment in kernelKeeper.js. The kernel requires synchronous access to the DB.

We use LMDB to hold this "kernelDB" by implementing and using the swing-store-lmdb package. We selected LMDB because we found mature Node.js bindings which provided synchronous access. In contrast, the much richer "levelup/leveldown" ecosystem has bindings which are either async, or immature/unsupported.

@dckc and I are both fans of SQLite, and think it could be a good idea to move the kernelDB to it:

SQLite provides actual configurable INDEXes, which could help a lot with the virtual collections: range queries, sort options, indices #2004 "virtual collections" that @erights is working on
the DB is stored in a single file (vs the extra lock.mdb that LMDB uses)
the size of that single file would grow only as new data is added (vs LMDB using a large-but-sparse file, currently 2GiB)
I suspect SQLite would have less overhead, and thus would consume less space (I estimate that 60% of the LMDB data.mdb is overhead)
I can write tools to read SQLite files in other languages, whereas I've struggled to get e.g. the Python LMDB bindings to read our kernelDB files

However we haven't yet looked for synchronous SQLite Node.js bindings, so we don't know if it's really an option or not.

Description of the Design

If we go this direction, we'd start with a swing-store-sqlite that has the same API as swing-store-lmdb, and only offers the same key-value features. Then, if it goes well, we'd abandon swing-store-lmdb, and start adding new features that depend upon SQLite, such as whatever indexing support #2004 needs.

Security Considerations

Both SQLite and LMDB are pretty battle-tested libraries, at least their core C/C++ implementation. The maturity of the bindings will make a big difference, though.

The text was updated successfully, but these errors were encountered:

dckc · 2021-06-21T12:33:22Z

https://www.npmjs.com/package/better-sqlite3 provides synchronous bindings.

dckc · 2021-06-23T21:03:04Z

broke ground: https://github.com/Agoric/agoric-sdk/blob/sqlite-stream-spike/packages/swing-store-lmdb/test/test-stream-sql.js 3ffcc7c

dckc · 2021-06-23T21:50:43Z

@warner @FUDCo is the "already in use" aspect of a streamStore a required API feature or just a limitation? sqlite3 provides isolation, so concurrent queries are not a problem.

As of 8def34b , sqlStreamStore passes the basic streamStore read/write test, but when I tried the streamStore mode interlock test, it failed to throw. Should I teach it to throw, even though it's capable of concurrent reads?

warner · 2022-08-26T07:04:14Z

After today's dive (#6056) into streamStore, I've got a renewed interest in merging all aspects of swingstore (kvStore, streamStore, snapStore) into a single SQLite DB. Maybe we can make this happen in the MN-1a timeframe (the update that discards earlier JS state, and switches to a mode where we're relying on retaining the JS state into the future).

The "already in use" checks are not a requirement: they were a holdover from the old file-based streamStore. We'll still access streams with the same modality (read a suffix during transcript replay, then switch into append-only mode as we make new deliveries, then switch back to reading a suffix if/when a worker is evicted and then reloaded).

We can get rid of the awkward startPos/endPos interlock once we have a single commit point. We'll still need to record startPos so a vat upgrade can record the "truncation" of a transcript, but we can use the DB itself to remember the end position (since we no longer have to worry about a crash happening between the streamStore commit and the kvStore commit).

As recorded in #6056 (comment) , we'll need to rewrite the CrankBuffer in terms of a SQLite "savepoint", so it includes all the non-kvStore changes.

warner · 2022-08-26T07:29:46Z

Oh, and I'm thinking that we continue to expose only kvStore/streamStore/snapStore to the kernel. We use SQLite to build smarter implementations of these APIs, but we don't start giving full SQL queries to the kernel. We can then start to build more specialized subsets, like maybe queueStore, and a CREATE TABLE vatstore (vatID INTEGER, key VARCHAR, value VARCHAR) to provide a vatStore, so we can stop messing around with manually-assembled kvStore keys.

And what would be really fun would be if there were some way to safely give each vat worker direct access to their own vatstore: that would remove all of the cross-process overhead of accessing virtualized/durablized data. The kernel never really needs to read from it, and no other vats ever touch it, so the worker has exclusive access. The tricky part is how to unwind any changes the worker makes if the kernel decides to unwind the crank. One expensive option would be to have the kernel hold a read transaction open on the DB, allow the worker to do its thing, then if it unwinds, the kernel uses that pre-modification txn to read out the full contents of the earlier state, and writes it all into a brand new database. (That'd be free for the usual commit case, but really expensive for the unwind case). Oh, but also, we must ensure that a crash before the block commit point does not allow the individual crank vatstore changes to survive.

SQLite has a nifty feature called Attached Databases that allows multiple files to be updated in an atomic fashion (I have no idea how they pull that off). If that could conceivably allow the kernel to open a transaction on the per-vat vatstore.sqlite file, then let the worker do writes that don't get committed until the kernel commits, then that might be a solution. I doubt it works that way, though.

warner · 2022-08-28T19:50:57Z

More thoughts:

it's probably time to allow swing-store (specifically the SwingStore / HostStorage API) to expose the notion of a crank-buffer, rather than having the kernel implement one by holding back writes and deletes in a RAM buffer.
- that would add startCrankBuffer, commitCrankBuffer, abandonCrankBuffer to the API
- the crank-buffer becomes swing-store -wide, not limited to just the kvStore component: when we're done, snapStore and streamStore writes will be buffered in the crank-buffer too
- the swing-store -provided crankbuffer becomes responsible for computing the crankhash and returning it from commitCrankBuffer, replacing the current practice wherein the kernel's storage wrapper computes it
- the kernel remains responsible for computing the activityhash
- kernel DB writes that happen outside of a startCrankBuffer/{commit,abandon}CrankBuffer pair will go directly to the block buffer, so the kernel can write the activityhash. swing-store will hold those writes (and all writes from a committed crank buffer) until the host application commits the block buffer with commit, as usual
- swing-store can start by implementing the crank-buffer in RAM, just like the kernel's storageWrapper.js does now, but as soon as we have the tools for it, that should change to use a transaction or savepoint, so the data is held on disk instead of RAM (and we trust SQLite to make a sensible choice between the two)
swing-store's kvStore provides a getKeys(start, end) iterable. The kernel's wrapper has a gnarly iterator-merger to ensure the crank-buffer's stashed writes are observed in the iteration.
- this provides the backing for syscall.vatstoreGetAfter(priorKey, lowerBound, upperBound)
- we do not reveal the iterator across the syscall boundary: vatstoreGetAfter returns just a single entry (or undefined)
- but vatstoreGetAfter is currently implemented around a retained internal getKeys iterator, along with a huge pile of code to carefully discard that iterator if the vat switches to a different section of the kvstore, or if the next invocation comes from a different vat
- I'd like all that to go away:
  - swing-store should provide getNext(previousKey, maxKey=undefined) -> undefined | [key, value], instead of getKeys()
  - the SQLite backend does SELECT * FROM kvstore WHERE key > ? AND key < ? ORDER BY key ASC LIMIT 1 with the two key bounds
  - the LMDB backend (until we finish replacing it) would do something like getKeys(previousKey, maxKey)[Symbol.iterator]().next().value, and does not attempt to retain the iterator
  - a savepoint-based crank buffer does not need to do any special merging of iterators
  - given how far away the consumer of this iterator is, in a vat worker, I doubt the SQLite query execution time will be significant
  - so let's start with the simplest implementation
  - but later, we can do performance experiments to see if retaining the SELECT query statement or caching a batch of responses (and then carefully invalidating the cache if a write or delete happens) is faster
the snapStore should retain the current add-snapshot API ("here's a function, please call it with a temporary filename, when the Promise it returns fires, please hash the contents, compress them, store them into the heap_snapshots table under the hex hash key, then delete the temporaries, then fire your own result promise")
- but the writes go into the crank buffer, and are discarded upon abandonCrankBuffer
- remember, it's ok to put large binary blobs in SQLite
- the refcounts are in kvstore, but maybe we should move them into their own table, and change the API to be more like "I'm saving a snapshot for vatID X"
- then the snapStore would be responsible for tracking the refcounts and deleting unused snapshots
- this would remove the part where we defer deletions until a subsequent commit (to tolerate the non-atomic snapshots-written vs kvstore-written commit points)
the streamStore is already in SQLite, but once we've got the kvstore there too, we can remove the kernel-side endPos stuff (the startPos part remains)

warner · 2022-09-05T10:38:09Z

Oh and it might be a good idea to make the kvStore use type BLOB instead of type STRING, so the values 1: can be arbitrary bytes, and 2: there's no DB-side uncertainty about encoding.

warner · 2022-10-19T00:47:03Z

@FUDCo and I talked through this a bit today. We're exploring the idea of doing this in a couple of steps. For the first, #3087 would be complete when we merge the various swing-store DBs (kvStore, streamStore, snapStore) into a unified SQLite DB, but we'd still have only one such DB, not one per vat. The per-vat vatstore) would continue to be managed as partitioned/prefixed keys within the single shared kvstore table (with keys like ${vatID}.vs.${vsKey}). We'd have one kvStore table, one streamStore table, and one snapStore table.

Then, under a separate ticket, the second step would be to break out the vatstore into a separate table, which would include a vatID column, and would no longer insert a ${vatID}.vs. prefix on each key. This would enhance the swingStore API to maybe have a vatStore next to kvStore, whose functions would each take an additional vatID argument. And maybe a vatStore.deleteVatStore(vatID) to delete all the keys when a vat is terminated.

Later, we'd look into enhancing the vatstore API to support refcounts more directly, which would help the kernel (or the vat-worker in #6447) be more involved in efficient cleanup of old data after a vat upgrade. We're thinking one table for all virtual/durable objects (with columns for virtual-vs-durable status, kindID, serialized capdata body, serialized capdata slots, refcounts from ephemeral/RAM objects, refcounts from virtual objects, refcounts from durable objects). That way, after an upgrade, we first DELETE FROM vobjs WHERE ephemeralRefCount=0 AND virtualRefCount=0, and boom there goes the entire first wave (the "burn front") of objects that don't survive the upgrade. Except that first we do a SELECT slots FROM the same set, to get the list of references about to be deleted, so we can walk the list and decref them and iterate the burn front some more. We'd probably have a second table for virtual collections.

Later still, #6254 would move the kernel-side shared vatstore tables (with their vatID column) into a separate DB-per-vat (without a vatID column), about the same time we do #6447 (which really kinda requires each worker to have exclusive access to its own DB).

The issues I can think of that need to be solved to make the first step are:

remove the LMDB-specific kvStore.getKeys() API, in favor of a non-iterating getNext() one: swingstore: replace kvStore.getKeys with getNext #6468
move the crankBuffer into swing-store (instead of the kernel-side kvStore wrapper), use SQLite "savepoints"
- so something like swingStore.startCrankBuffer() / .abortCrankBuffer() / .commitCrankBuffer()
remove the snapStore dance with pendingDeletes, since we can precisely remove unused snapshots in the same commit that changes the list of vats which reference them
choosing an appropriate durability mode (WAL/etc) and commit cadence
- for the first step, I think we should use WAL mode with synchronous=FULL, which provides durability against power failures, and seems to achieve the nearly same rate of commits (500/s) as LMDB did (500-700/s) on the network-based filesystem that Google Cloud instances offer by default
- the host (cosmic-swingset) does the single real COMMIT, and the kernel gets a crankBuffer API which uses savepoints (which do not, and must not, do a real commit)
- which means we're only really doing one SQLite/kernelDB commit per block, so like once every 5 seconds

FUDCo · 2022-10-21T00:49:49Z

In the course of its evolution, the vatstore has morphed from a general purpose store available to vat code for whatever purposes it chooses into a specialized store accessible only to liveslots. However, the vatstore syscall API still presents it as a general purpose string-to-string key-value store, in line with the original concept for its use. In practice, aside from a small number of simple bookkeeping keys, liveslots' use of the vatstore is for the implementation of more complex persistent abstractions such as virtual objects, collections, and watched promises, for which a simple key-value store is not ideal. Liveslots does various key string encoding tricks to shoehorn the storage for these abstractions and their associated metadata into the key-value patterns that the current syscall API supports. If, as anticipated in the comment above, we wish to exploit the more sophisticated storage affordances of Sqlite or another relational store in order to more directly implement higher level persistence abstractions, then we must either (a) relocate the vatstore into a database accessible directly by the worker process as anticipated in #6254, in which case we can provide a local, liveslots-specific store API whose implementation makes use of the native Sqlite (or whatever) database API, or (b) augment the syscall vatstore interface with additional calls that reflect the liveslots-specific API affordances that would be added in (a).

My take is that course (b) is undesirable for several reasons:

it would entail the incorporation of liveslots' persistence abstractions into the kernel
adding to the syscall API is tedious and error prone since it requires making a bunch of coordinated changes to several different places in both the kernel and worker management code
assuming we expect to eventually proceed with some variant of idea for giving each worker its own (local) SQLite vatStore DB #6254, once the database is relocated we'd then have to reimplement the liveslots store API on the worker side and discard the augmented syscall functionality that we'd just added

This suggests that the sequence of steps is:

Replace the LMDB kvStore implementation with a comparable Sqlite implementation that does the same thing (i.e., implements a simple string-to-string key-value store) and retains the same interface (modulo possible tweaks we might have to make to the iteration API (i.e., vatstore.getAfter) to work well with a SQL database), as described above, along with all the other changes that Brian describes.
Change the vatstore API implementation to use its own table that encodes the vatID as its own column instead of using key prefixing, also as described above.
Give each vat worker process its own Sqlite instance and move the vatstore table into it (at this point you can get rid of the vatID column) and remove the vatstore calls from the syscall API and hence from the transcript (the big win for this step is that it removes the major remaining way for GC non-determinism to leak into the consensus state). We'd also want to relocate the heap snapshots, transcript store, and vat-related bundles from the kernel db into the vat's db at this time. This would also entail whatever (possibly hairy) changes we need to make to the crank and block commit machinery to ensure consistency is maintained even though we've now got multiple databases.
Incrementally refactor virtual objects, collections, and promise watchers to all exploit the affordances of the SQL data store. Mostly this consists of migrating an object's metadata into the database row that describes the object rather than storing it as a series of separate KV entries using key prefix trickery. We should also be able to track the order-of-insertion ordinals for collection entries as part of the entries themselves rather than requiring a two-step lookup to iterate collections. However, the main win here will be to enable practical GC of persistent ephemeral data objects when a vat is upgraded. It may also be possible to index collections more efficiently than the current rank order gimmickry allows for.

Note that relocating vat-specific storage into the worker processes' own databases should be able to be done without committing to any of the strategies for concurrent vat execution that we've been contemplating, though it is likely to be a prerequisite to being able to implement any of them.

warner · 2023-01-24T20:53:52Z

I think we can close this, now that #6561 and #6742 / #6755 have landed. Further explorations about the swing-store are now part of #6677

warner added enhancement New feature or request SwingSet package: SwingSet labels May 12, 2021

dckc mentioned this issue Jun 21, 2021

invalid seek in writeStreamItem #3370

Closed

This was referenced Sep 17, 2022

idea for giving each worker its own (local) SQLite vatStore DB #6254

Open

background/2-phase full-replay vat upgrade for non-upgradable vats #6288

Open

warner mentioned this issue Oct 18, 2022

swingstore: replace kvStore.getKeys with getNext #6468

Closed

FUDCo mentioned this issue Nov 13, 2022

Convert swingstore from LMDB to Sqlite, phase 1 #6561

Merged

mhofman mentioned this issue Nov 14, 2022

API to get summary of swingstore block changes #6562

Closed

Tartuffo added the migrate-product-backlog label Nov 17, 2022

warner mentioned this issue Nov 18, 2022

include vat transcripts in activityhash? #6594

Closed

Tartuffo removed the migrate-product-backlog label Nov 21, 2022

warner assigned FUDCo Dec 14, 2022

warner mentioned this issue Dec 14, 2022

maybe expand the swingstore API to take better advantage of SQLite features #6677

Open

warner mentioned this issue Jan 3, 2023

move snapstore (XS heap snapshots) into SQLite #6742

Closed

warner closed this as completed Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

explore sqlite as kernelDB #3087

explore sqlite as kernelDB #3087

warner commented May 12, 2021 •

edited

Loading

dckc commented Jun 21, 2021

dckc commented Jun 23, 2021

dckc commented Jun 23, 2021

warner commented Aug 26, 2022

warner commented Aug 26, 2022

warner commented Aug 28, 2022

warner commented Sep 5, 2022

warner commented Oct 19, 2022

FUDCo commented Oct 21, 2022

warner commented Jan 24, 2023

explore sqlite as kernelDB #3087

explore sqlite as kernelDB #3087

Comments

warner commented May 12, 2021 • edited Loading

What is the Problem Being Solved?

Description of the Design

Security Considerations

dckc commented Jun 21, 2021

dckc commented Jun 23, 2021

dckc commented Jun 23, 2021

warner commented Aug 26, 2022

warner commented Aug 26, 2022

warner commented Aug 28, 2022

warner commented Sep 5, 2022

warner commented Oct 19, 2022

FUDCo commented Oct 21, 2022

warner commented Jan 24, 2023

warner commented May 12, 2021 •

edited

Loading