Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explore sqlite as kernelDB #3087

Closed
warner opened this issue May 12, 2021 · 10 comments
Closed

explore sqlite as kernelDB #3087

warner opened this issue May 12, 2021 · 10 comments
Assignees
Labels
enhancement New feature or request SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented May 12, 2021

What is the Problem Being Solved?

The swingset kernel currently encodes all of its state in a simple key-value format, with a "schema" (i.e. what we use each key for) informally defined in a large comment in kernelKeeper.js. The kernel requires synchronous access to the DB.

We use LMDB to hold this "kernelDB" by implementing and using the swing-store-lmdb package. We selected LMDB because we found mature Node.js bindings which provided synchronous access. In contrast, the much richer "levelup/leveldown" ecosystem has bindings which are either async, or immature/unsupported.

@dckc and I are both fans of SQLite, and think it could be a good idea to move the kernelDB to it:

  • SQLite provides actual configurable INDEXes, which could help a lot with the virtual collections: range queries, sort options, indices #2004 "virtual collections" that @erights is working on
  • the DB is stored in a single file (vs the extra lock.mdb that LMDB uses)
  • the size of that single file would grow only as new data is added (vs LMDB using a large-but-sparse file, currently 2GiB)
  • I suspect SQLite would have less overhead, and thus would consume less space (I estimate that 60% of the LMDB data.mdb is overhead)
  • I can write tools to read SQLite files in other languages, whereas I've struggled to get e.g. the Python LMDB bindings to read our kernelDB files

However we haven't yet looked for synchronous SQLite Node.js bindings, so we don't know if it's really an option or not.

Description of the Design

If we go this direction, we'd start with a swing-store-sqlite that has the same API as swing-store-lmdb, and only offers the same key-value features. Then, if it goes well, we'd abandon swing-store-lmdb, and start adding new features that depend upon SQLite, such as whatever indexing support #2004 needs.

Security Considerations

Both SQLite and LMDB are pretty battle-tested libraries, at least their core C/C++ implementation. The maturity of the bindings will make a big difference, though.

@warner warner added enhancement New feature or request SwingSet package: SwingSet labels May 12, 2021
@dckc
Copy link
Member

dckc commented Jun 21, 2021

https://www.npmjs.com/package/better-sqlite3 provides synchronous bindings.

@dckc
Copy link
Member

dckc commented Jun 23, 2021

@warner @FUDCo is the "already in use" aspect of a streamStore a required API feature or just a limitation? sqlite3 provides isolation, so concurrent queries are not a problem.

As of 8def34b , sqlStreamStore passes the basic streamStore read/write test, but when I tried the streamStore mode interlock test, it failed to throw. Should I teach it to throw, even though it's capable of concurrent reads?

@warner
Copy link
Member Author

warner commented Aug 26, 2022

After today's dive (#6056) into streamStore, I've got a renewed interest in merging all aspects of swingstore (kvStore, streamStore, snapStore) into a single SQLite DB. Maybe we can make this happen in the MN-1a timeframe (the update that discards earlier JS state, and switches to a mode where we're relying on retaining the JS state into the future).

The "already in use" checks are not a requirement: they were a holdover from the old file-based streamStore. We'll still access streams with the same modality (read a suffix during transcript replay, then switch into append-only mode as we make new deliveries, then switch back to reading a suffix if/when a worker is evicted and then reloaded).

We can get rid of the awkward startPos/endPos interlock once we have a single commit point. We'll still need to record startPos so a vat upgrade can record the "truncation" of a transcript, but we can use the DB itself to remember the end position (since we no longer have to worry about a crash happening between the streamStore commit and the kvStore commit).

As recorded in #6056 (comment) , we'll need to rewrite the CrankBuffer in terms of a SQLite "savepoint", so it includes all the non-kvStore changes.

@warner
Copy link
Member Author

warner commented Aug 26, 2022

Oh, and I'm thinking that we continue to expose only kvStore/streamStore/snapStore to the kernel. We use SQLite to build smarter implementations of these APIs, but we don't start giving full SQL queries to the kernel. We can then start to build more specialized subsets, like maybe queueStore, and a CREATE TABLE vatstore (vatID INTEGER, key VARCHAR, value VARCHAR) to provide a vatStore, so we can stop messing around with manually-assembled kvStore keys.

And what would be really fun would be if there were some way to safely give each vat worker direct access to their own vatstore: that would remove all of the cross-process overhead of accessing virtualized/durablized data. The kernel never really needs to read from it, and no other vats ever touch it, so the worker has exclusive access. The tricky part is how to unwind any changes the worker makes if the kernel decides to unwind the crank. One expensive option would be to have the kernel hold a read transaction open on the DB, allow the worker to do its thing, then if it unwinds, the kernel uses that pre-modification txn to read out the full contents of the earlier state, and writes it all into a brand new database. (That'd be free for the usual commit case, but really expensive for the unwind case). Oh, but also, we must ensure that a crash before the block commit point does not allow the individual crank vatstore changes to survive.

SQLite has a nifty feature called Attached Databases that allows multiple files to be updated in an atomic fashion (I have no idea how they pull that off). If that could conceivably allow the kernel to open a transaction on the per-vat vatstore.sqlite file, then let the worker do writes that don't get committed until the kernel commits, then that might be a solution. I doubt it works that way, though.

@warner
Copy link
Member Author

warner commented Aug 28, 2022

More thoughts:

  • it's probably time to allow swing-store (specifically the SwingStore / HostStorage API) to expose the notion of a crank-buffer, rather than having the kernel implement one by holding back writes and deletes in a RAM buffer.
    • that would add startCrankBuffer, commitCrankBuffer, abandonCrankBuffer to the API
    • the crank-buffer becomes swing-store -wide, not limited to just the kvStore component: when we're done, snapStore and streamStore writes will be buffered in the crank-buffer too
    • the swing-store -provided crankbuffer becomes responsible for computing the crankhash and returning it from commitCrankBuffer, replacing the current practice wherein the kernel's storage wrapper computes it
    • the kernel remains responsible for computing the activityhash
    • kernel DB writes that happen outside of a startCrankBuffer/{commit,abandon}CrankBuffer pair will go directly to the block buffer, so the kernel can write the activityhash. swing-store will hold those writes (and all writes from a committed crank buffer) until the host application commits the block buffer with commit, as usual
    • swing-store can start by implementing the crank-buffer in RAM, just like the kernel's storageWrapper.js does now, but as soon as we have the tools for it, that should change to use a transaction or savepoint, so the data is held on disk instead of RAM (and we trust SQLite to make a sensible choice between the two)
  • swing-store's kvStore provides a getKeys(start, end) iterable. The kernel's wrapper has a gnarly iterator-merger to ensure the crank-buffer's stashed writes are observed in the iteration.
    • this provides the backing for syscall.vatstoreGetAfter(priorKey, lowerBound, upperBound)
    • we do not reveal the iterator across the syscall boundary: vatstoreGetAfter returns just a single entry (or undefined)
    • but vatstoreGetAfter is currently implemented around a retained internal getKeys iterator, along with a huge pile of code to carefully discard that iterator if the vat switches to a different section of the kvstore, or if the next invocation comes from a different vat
    • I'd like all that to go away:
      • swing-store should provide getNext(previousKey, maxKey=undefined) -> undefined | [key, value], instead of getKeys()
      • the SQLite backend does SELECT * FROM kvstore WHERE key > ? AND key < ? ORDER BY key ASC LIMIT 1 with the two key bounds
      • the LMDB backend (until we finish replacing it) would do something like getKeys(previousKey, maxKey)[Symbol.iterator]().next().value, and does not attempt to retain the iterator
      • a savepoint-based crank buffer does not need to do any special merging of iterators
      • given how far away the consumer of this iterator is, in a vat worker, I doubt the SQLite query execution time will be significant
      • so let's start with the simplest implementation
      • but later, we can do performance experiments to see if retaining the SELECT query statement or caching a batch of responses (and then carefully invalidating the cache if a write or delete happens) is faster
  • the snapStore should retain the current add-snapshot API ("here's a function, please call it with a temporary filename, when the Promise it returns fires, please hash the contents, compress them, store them into the heap_snapshots table under the hex hash key, then delete the temporaries, then fire your own result promise")
    • but the writes go into the crank buffer, and are discarded upon abandonCrankBuffer
    • remember, it's ok to put large binary blobs in SQLite
    • the refcounts are in kvstore, but maybe we should move them into their own table, and change the API to be more like "I'm saving a snapshot for vatID X"
    • then the snapStore would be responsible for tracking the refcounts and deleting unused snapshots
    • this would remove the part where we defer deletions until a subsequent commit (to tolerate the non-atomic snapshots-written vs kvstore-written commit points)
  • the streamStore is already in SQLite, but once we've got the kvstore there too, we can remove the kernel-side endPos stuff (the startPos part remains)

@warner
Copy link
Member Author

warner commented Sep 5, 2022

Oh and it might be a good idea to make the kvStore use type BLOB instead of type STRING, so the values 1: can be arbitrary bytes, and 2: there's no DB-side uncertainty about encoding.

@warner
Copy link
Member Author

warner commented Oct 19, 2022

@FUDCo and I talked through this a bit today. We're exploring the idea of doing this in a couple of steps. For the first, #3087 would be complete when we merge the various swing-store DBs (kvStore, streamStore, snapStore) into a unified SQLite DB, but we'd still have only one such DB, not one per vat. The per-vat vatstore) would continue to be managed as partitioned/prefixed keys within the single shared kvstore table (with keys like ${vatID}.vs.${vsKey}). We'd have one kvStore table, one streamStore table, and one snapStore table.

Then, under a separate ticket, the second step would be to break out the vatstore into a separate table, which would include a vatID column, and would no longer insert a ${vatID}.vs. prefix on each key. This would enhance the swingStore API to maybe have a vatStore next to kvStore, whose functions would each take an additional vatID argument. And maybe a vatStore.deleteVatStore(vatID) to delete all the keys when a vat is terminated.

Later, we'd look into enhancing the vatstore API to support refcounts more directly, which would help the kernel (or the vat-worker in #6447) be more involved in efficient cleanup of old data after a vat upgrade. We're thinking one table for all virtual/durable objects (with columns for virtual-vs-durable status, kindID, serialized capdata body, serialized capdata slots, refcounts from ephemeral/RAM objects, refcounts from virtual objects, refcounts from durable objects). That way, after an upgrade, we first DELETE FROM vobjs WHERE ephemeralRefCount=0 AND virtualRefCount=0, and boom there goes the entire first wave (the "burn front") of objects that don't survive the upgrade. Except that first we do a SELECT slots FROM the same set, to get the list of references about to be deleted, so we can walk the list and decref them and iterate the burn front some more. We'd probably have a second table for virtual collections.

Later still, #6254 would move the kernel-side shared vatstore tables (with their vatID column) into a separate DB-per-vat (without a vatID column), about the same time we do #6447 (which really kinda requires each worker to have exclusive access to its own DB).

The issues I can think of that need to be solved to make the first step are:

  • remove the LMDB-specific kvStore.getKeys() API, in favor of a non-iterating getNext() one: swingstore: replace kvStore.getKeys with getNext #6468
  • move the crankBuffer into swing-store (instead of the kernel-side kvStore wrapper), use SQLite "savepoints"
    • so something like swingStore.startCrankBuffer() / .abortCrankBuffer() / .commitCrankBuffer()
  • remove the snapStore dance with pendingDeletes, since we can precisely remove unused snapshots in the same commit that changes the list of vats which reference them
  • choosing an appropriate durability mode (WAL/etc) and commit cadence
    • for the first step, I think we should use WAL mode with synchronous=FULL, which provides durability against power failures, and seems to achieve the nearly same rate of commits (500/s) as LMDB did (500-700/s) on the network-based filesystem that Google Cloud instances offer by default
    • the host (cosmic-swingset) does the single real COMMIT, and the kernel gets a crankBuffer API which uses savepoints (which do not, and must not, do a real commit)
    • which means we're only really doing one SQLite/kernelDB commit per block, so like once every 5 seconds

@FUDCo
Copy link
Contributor

FUDCo commented Oct 21, 2022

In the course of its evolution, the vatstore has morphed from a general purpose store available to vat code for whatever purposes it chooses into a specialized store accessible only to liveslots. However, the vatstore syscall API still presents it as a general purpose string-to-string key-value store, in line with the original concept for its use. In practice, aside from a small number of simple bookkeeping keys, liveslots' use of the vatstore is for the implementation of more complex persistent abstractions such as virtual objects, collections, and watched promises, for which a simple key-value store is not ideal. Liveslots does various key string encoding tricks to shoehorn the storage for these abstractions and their associated metadata into the key-value patterns that the current syscall API supports. If, as anticipated in the comment above, we wish to exploit the more sophisticated storage affordances of Sqlite or another relational store in order to more directly implement higher level persistence abstractions, then we must either (a) relocate the vatstore into a database accessible directly by the worker process as anticipated in #6254, in which case we can provide a local, liveslots-specific store API whose implementation makes use of the native Sqlite (or whatever) database API, or (b) augment the syscall vatstore interface with additional calls that reflect the liveslots-specific API affordances that would be added in (a).

My take is that course (b) is undesirable for several reasons:

  • it would entail the incorporation of liveslots' persistence abstractions into the kernel
  • adding to the syscall API is tedious and error prone since it requires making a bunch of coordinated changes to several different places in both the kernel and worker management code
  • assuming we expect to eventually proceed with some variant of idea for giving each worker its own (local) SQLite vatStore DB #6254, once the database is relocated we'd then have to reimplement the liveslots store API on the worker side and discard the augmented syscall functionality that we'd just added

This suggests that the sequence of steps is:

  1. Replace the LMDB kvStore implementation with a comparable Sqlite implementation that does the same thing (i.e., implements a simple string-to-string key-value store) and retains the same interface (modulo possible tweaks we might have to make to the iteration API (i.e., vatstore.getAfter) to work well with a SQL database), as described above, along with all the other changes that Brian describes.
  2. Change the vatstore API implementation to use its own table that encodes the vatID as its own column instead of using key prefixing, also as described above.
  3. Give each vat worker process its own Sqlite instance and move the vatstore table into it (at this point you can get rid of the vatID column) and remove the vatstore calls from the syscall API and hence from the transcript (the big win for this step is that it removes the major remaining way for GC non-determinism to leak into the consensus state). We'd also want to relocate the heap snapshots, transcript store, and vat-related bundles from the kernel db into the vat's db at this time. This would also entail whatever (possibly hairy) changes we need to make to the crank and block commit machinery to ensure consistency is maintained even though we've now got multiple databases.
  4. Incrementally refactor virtual objects, collections, and promise watchers to all exploit the affordances of the SQL data store. Mostly this consists of migrating an object's metadata into the database row that describes the object rather than storing it as a series of separate KV entries using key prefix trickery. We should also be able to track the order-of-insertion ordinals for collection entries as part of the entries themselves rather than requiring a two-step lookup to iterate collections. However, the main win here will be to enable practical GC of persistent ephemeral data objects when a vat is upgraded. It may also be possible to index collections more efficiently than the current rank order gimmickry allows for.

Note that relocating vat-specific storage into the worker processes' own databases should be able to be done without committing to any of the strategies for concurrent vat execution that we've been contemplating, though it is likely to be a prerequisite to being able to implement any of them.

@warner
Copy link
Member Author

warner commented Jan 24, 2023

I think we can close this, now that #6561 and #6742 / #6755 have landed. Further explorations about the swing-store are now part of #6677

@warner warner closed this as completed Jan 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

4 participants