-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
idea for giving each worker its own (local) SQLite vatStore DB #6254
Comments
One potential downside would be an expansion of DB state for the When we implement zygotes, we'll want to think about how these snapshots are stored. I've been thinking that zygotes should be regular vats, to which we just agree to not send any messages (thus preserving their state, unmodified from one clone to the next). But I think @dtribble has argued compellingly that we should have a special "freeze" or "zygote-ify" operation that effectively terminates the original instance, and simultaneously creates an artifact that can be used for subsequent clones. There are details to manage w.r.t. the c-list (the now-frozen zygote's imports must still be kept alive, so the zygote gets refcounts even though you can't send messages to it anymore). But if we have an explicit operation for this, then it could also trigger a cross-DB copy. The old parent-vat's worker is told to build a heap snapshot (along with everything else it needs, like an initialized vatstore) and then just |
If the underlying OS supports copy-on-write (which I think the ones we care about do) then we can just use the database file itself. No need to mess with |
Are you thinking about the
I just noticed SQLite has a proper "safely clone a DB" operation, and it can run in the background, which might help (although that introduces some uncertainty about when the zygote could be used, we might need to finesse that somehow). It probably doesn't use any kernel-based speedup tricks though. And the output is not going to be as deterministic as a |
I think most DBs have a notion of starting a full read operation at T1 while other write transactions proceed, and get a consistent view of all the data that was there at T1 only. LMDB sure does. The only thing we need to figure out is how to account for the time it might take for a full dump using this kind of mechanism to complete. I was thinking we could trigger a dump at a deterministic time, at which point all vats and the kernel are in charge of generating state capture at that time and keep a full transcript to replay from that time. At a later deterministic time, this data would be atomically collected, hashed, and stored. We likely could arrange the data and hashes in such a way that they're incrementally generated, and so that this atomic operation doesn't stop the world for a significant amount of time. |
@mhofman noticed SQLite's However, I think the limitations make it unlikely to work for us. From the man pages, it seems like the basic API is to call
There is a Note: the snapshot feature is normally disabled , and requires a compile-time |
Looks like I wrote a similar analysis (but with slightly different conclusions) simultaneously (we didn't open a write lock): #5542 (comment) |
Regarding the sqlite_snapshot struct, it seems to be independent of heap state, and is internally copied in the first place. The tests seem to exercise this over multiple DB connections. Regarding checkpoints, I'm dubious they're needed for most shutdown recovery. My understanding is that they're mostly a form of "compaction", and that if we really want power-loss resilience, WAL writes can be |
I posted the question regarding saving of the snapshot struct on the SQLite forums: https://sqlite.org/forum/forumpost/f7a325b6d3 |
In #6447 (comment), @warner discusses the idea @FUDCo came up with: the vatstore DB is just committed when taking a snapshot, and unlike other syscalls that are just checked against the transcript, vatstore operations are instead performed again on replay. If the vatstore operations stay within the worker, they don't cross the boundary with the kernel, and are not recorded in the transcript in the first place. However we still have the problem introduced by multiple commit points: if we commit the vatstore DB when making a worker heap snapshot, and the kernel exits before commiting the block where that snapshot occurred, the next time the worker will be started with a previous heap snapshot and a vatstore that no longer corresponds (from the future). We also have the problem of capturing the content of the vatstore in a verifiable way for state-sync (#3769). In #6773 we introduce an "export stream" of the kvStore into vstorage. If vatstore is removed from the kvStore, we lose that mechanism. Both issues can be solved in tandem: a main vatstore DB which commits after the kernel DB has committed, and a "journal" DB recording vatstore changes (updates/deletions) since the last main DB commit, which is itself committed when performing a heap snapshot. The "journal" DB can be used to populate the kernel's "export stream" when a snapshot is taken, ensuring that the vatstore is part of state-sync. The flow would be as follow:
This is a little heavy, but uses standard SQLite mechanisms. If we want to get a little more fringe, the following optimizations are possible:
|
If we remove the vatstore syscalls from the transcript, we'll also have to capture the state, or at least a hash of the state, of the vatStore as it was at the time of a vat upgrade in order to support the Manchurian style upgrades (#1691). Without that, we wouldn't be able to start a replay from baggage as we'd have lost the baggage. |
Maybe we can take a slightly different approach and automate the creation of this journal using SQLite TRIGGER. The docs have an example of using that mechanism for an application undo/redo |
Triggers terrify me. Triggers are the relational database version of shared-state concurrency: a swamp of chaos and confusion that people naively imagine they understand and can master, but they are almost always wrong. |
I was thinking more about @FUDCo 's "don't commit vatstore DB until heap snapshot" idea, and it occurred to me that making a full copy of the table might not be too expensive. Each vat would have one DB with the transcript (deliveries, results, which ones were collected by the kernel so far, retirements, etc), and a second DB with the heap snapshots and the vatstore data. But the second DB would have two tables for vatstore: an open one with the latest contents (committed after every delivery), and a second one with a full copy of the vatstore contents as of the last heap snapshot point. The end-of-span sequence would be like:
Every time the kernel collects a delivery result, it also gets the hash of the most recent To make sure the vat doesn't run too far ahead and give up the previous contents too early, maybe the vat should stop doing execution once a heap snapshot is created, and not resume until the triggering delivery is collected. Or, maybe we could collect multiple sets of old contents (indexed by the deliveryNum that triggered the corresponding heap snapshot), and figure out what sort of retirement policy would work. |
A full copy feels like it would make the cost of snapshotting proportional to the vat store size, which we're telling users to prefer compared to using the heap. Also separating transcripts and heap snapshots in separate DBs makes me feel uneasy. Furthermore generating a hash ourselves of the vat store content by reading through the whole thing seems wrong. I'd much prefer relying on an IAVL shadowing to do that for us. |
Thinking about this more, I don't see how @warner's proposal above solves the hangover problem. In particular, what happens if we fail before the |
The current implementation exports all vat store entries to the cosmos DB, which is unsustainable (causes slowness in state-sync, pruning, etc.) We need to move to a place where the vat store is treated as an artifact verified by cosmos DB, like transcripts and heap snapshots. Unlike those artifacts, the vatStore can replace existing entries, so generating a hash is more complex. I am becoming convinced that the solution should be:
|
After discussing about this more, the export data motivation for make vatStore an artifact is less motivated: a similar order of magnitude of keys appear in the kernel side, like clist entries, which means it wouldn't sufficiently reduce the number of entries in the cosmos DB (it would reduce the size however, but that's less a problem) |
What is the Problem Being Solved?
One of our apparent performance bottlenecks is the multiplication of the kernel/worker RPC overhead times the large number of vatStore syscalls triggered by extensive use of virtual objects. Each VO that gets paged in will cause a bunch of
vatstoreGet
calls to provide the data. This is compounded by the large number of virtual reference counts that must be tracked to correctly know when a virtual object can be released/deleted. Each time a Representative is referenced (e.g. VOfoo.behavior()
doesstate.bar = rep
, which mustserialize(state)
, which adds a virtual refcount fromfoo
torep
) (also e.g. ifrep
is added to a virtual collection), we must increment a refcount that lives in the DB, so we need avatstoreGet()
plus avatstoreSet()
. Each syscall requires a VatSyscallObject (e.g.['vatstoreSet', 'vom.o+15/97', 'capdata..']
) to be serialized, encoded into a netstring, and written through a pipe to the kernel process. The kernel must receive the string, decode the netstring, parse the VSO, translate it into a KernelSyscallObject (trivial forvatstore
, unlikesyscall.send
which requires vref-to-kref conversion/allocation), then executed (kvStore.set('v1.vs.vom.o+15/97', 'capdata')
). Then the kernel builds a KernelSyscallResult, translates it into a VatSyscallResult, serializes it, encodes that into a netstring, and writes it to the pipe. The worker then reads from the pipe, decodes the netstring, parses the result (which is always['ok']
), and returns fromvatstoreSet()
.This whole
vatstoreSet()
process was measured to take about 730us (each) on the "ollinet" testnet, which uses Google Cloud hardware, and is considered to perform comparably to what typical validators might use.vatstoreGet
took about 470us. This can consume considerable time during a delivery that creates/references/deletes a lot of virtual objects. Our current benchmark operation (a PSM trade) performed 439xvatstoreGet
, 224xvatstoreSet
, and 45xvatstoreDelete
operations, summed across all vats. In contrast, it only did 46xsyscall.send
s, and 49xsyscall.resolve
s. If we could magically reduce the cost of all vatstore operations to zero, the 761ms we spent doing syscalls could be reduced to to 337ms, a 45% improvement.To support parallelism (some day), we carefully designed the
vatstore
to be local to each vat: no other vat is allowed to read from or write to it. This indicates another opportunity for performance improvement: give each worker a private SQLite to back its vatStore, instead of using syscalls to access it. SQLite can hold everything in RAM (or somewhere appropriate) until commit, so I'd expect each vatstore operation to take less than a microsecond.Synchronizing Multiple DBs
Our application consists of lots of separate components, each with their own state that needs to be durable. If all the components used the same DB instance, they could all commit together, atomically. But they don't. Currently, the swing-store "snapstore" component (SQLite) commits first, then the swing-store
kvStore
(LMDB) commits, then the host-application cosmos IAVL tree (LevelDB) commits. #3087 is about merging all the swing-store DBs into a single SQLite instance, to reduce the scope of this problem, but it does not attempt to address the IAVL-vs-swingstore gap.Each new DB (specifically each new commit point) creates a new window of time during which the application might crash when one DB has commited but the next one has not. If this happens, when the application starts back up again, DB-1 (e.g.
snapstore
) will remember things that DB-2 (e.g.kvStore
) does not.We work around this by recording data in DB-2 that tells us what data to look for in DB-1. For example, we add a new heap snapshot into
snapstore
under a particular key (the "snapshotID"). Then we add a record tokvStore
that says which snapshotID to use for a given vat. We do not delete the old snapshot until much later. When we restart, we read the ID out ofkvStore
, and then read that snapshot out ofsnapstore
.Now imagine a block which moves vatA from snapshot1 to snapshot2. If both
snapstore
andkvStore
manage to commit before a crash, the restart will read "snapshot2
" out of kvStore, and we'll readsnapshot2
out of snapstore, and we'll launch the worker with snapshot2.If we crash after
snapstore
commits but beforekvStore
commits, restart will read "snapshot1
" out of kvStore, and we'll readsnapshot1
out of snapstore, and launch the worker with snapshot1. The snapstore will still contain snapshot2, but it won't be used. The vat will be sent the same deliveries as last time, causing it to advance, and we'll record a snapshot again, and it will create snapshot2 again (since everything is deterministic). As long as snapstore tolerates the insertion of an existing snapshot without complaint, the DB writes will look just the same as they did the previous time, and if both DBs manage to commit, we'll wind up with the same commited state as above.And if neither DB manages to commit before the crash, we'll wake up with only
snapshot1
in snapstore, and "usesnapshot1
" in kvstore.We do the opposite set of contortions to avoid deleting
snapshot1
from snapstore until we know that the kvstore record has been updated. This is trickier, and requires a "pending deletes" key to be tracked.Commit/Rollback of Vat State
The durable state of each vat is captured by the initial vat bundle and the transcript (barring upgrades, which require a record of the new vat bundles, and a "deep transcript" that indicates where the upgrades occurred). This can be used to (slowly) reconstruct the worker at any point in its history. For restart efficiency, we added heap snapshots, making the state artifact (latest snapshot + post-snapshot transcript). To reduce memory usage (and make upgrades easier), we introduced virtual/durable objects and
vatstore
, making the state artifact (latest snapshot + post-snapshot transcript + vatstore contents).The application's durable state advances by one block at a time, inside of which each vat advances by zero or more deliveries.
Within a block, the buffered state advances by one delivery at a time (to any vat). This state is not made durable (written to the DB) until the block is committed, which is controlled by the host application: to avoid hangover inconsistency, outbound IO is embargoed until everything is committed. But it only contains complete deliveries. The buffered state consists of the durable state plus a set of deltas held in the "block buffer", which receives updates from the "crank buffer" only after the kernel decides to commit to the delivery.
The runtime state of each vat exists in a worker (
xsnap
) and thevatstore
. Making a delivery to a worker irrevocably advances the worker state, and also makes kernel-side changes which are buffered in the crank buffer.Certain vat errors will cause the delivery to be unwound. Most of these also terminate the vat, e.g. a vat-fatal syscall or a metering fault. This discards the crankBuffer, kills the worker, deletes all the kernel-side vat state, and rejects the vat's orphaned promises.
However some situations call for the vat to survive and be rewound to an earlier state. E.g. a failed upgrade wants to pretend that the upgrade was never requested. This means we kill the worker but do not terminate the vat. A replacement worker will be started when the next delivery is made, and it will start from the previous durable state (heap snapshot plus post-snapshot transcript, which will not include the discarded delivery).
The application might crash at any moment. The durable state must act as if it only advances one block at a time, even if it is recorded in multiple databases that don't all commit at the same time.
Commit/Rollback of per-vat Vatstore DB
If we give each vat a local SQLite DB for its vatstore, we must decide when this DB performs its commits. We want to simulate a global commit, even though the local DBs might commit much earlier than the kernel-side state.
The rough idea is:
syscall.send
andsyscall.resolve
(and the GC syscalls), but notsyscall.vatstoreSet/Get/Delete
fsync
and edit the WAL/journal files too)syscall.send
/etc are unwoundWe need the worker to effectively manage two levels of commit/transactions. If a delivery is rewound, we want to roll back to the most recent committed delivery. But we don't want the effective durable state to include any deliveries at all until the block is committed. And the final/real DB commit necessarily happens somewhat before the kernel's DB is committed, so we must manage the gap correctly.
One category of approach is to build a SQLite table that includes generation numbers for each vatstore record. So when userspace/liveslots asks for
key=abc
, the layer below actually doesSELECT value WHERE key='abc' AND generation<=42
. The committed-but-unwound changes would ignored because they'd have a higher generation number than what is considered "current". Deletes would require tombstones. When a version is really committed, we could delete the older versions (so maybe we'd want anot_current
column, set totrue
if/when we write a newer version, so we can clean up all the old versions with a singleDELETE WHERE not_current=true
statement). There are a lot of fussy details to figure out.Another category of approach is to use a similar trick to the
snapstore
/kvStore
synchronization. In this scheme, the vat records the highest deliveryNum that it has processed. The worker commits its DB as soon as the kernel decides we aren't rewinding, and if the app crashes after that commit but before the block commit, when we restart, the kernel won't know that the worker remembers those deliveries. So, if the kernel tells the worker to execute a delivery that it has already performed, the worker pretends to execute but actually ignores it (because the vat's userspace state is already there).But, now the kernel needs to know about the syscalls that happened in the previous (lost) execution, so the worker must remember those, and re-deliver them to the kernel. As @FUDCo pointed out, this is interestingly parallel to the way that the kernel performs transcript replay, where the vat really does the work, and it's the kernel which merely pretends to execute the syscalls.
To pull that off, the worker would need to:
N
It also suggests that maybe the entire transcript should live in the vat's DB, since the deliveries being checked are basically the transcript. And that suggests that the supervisor could be responsible for doing transcript replay: the supervisor should wake up (from a heap snapshot), read the current transcript head from the DB, compare it to a value in RAM (from the snapshot), and load+replay deliveries until they match, and only then inform the kernel that it is open for business.
One tricky part here is the lack of engine separation between by the supervisor (or the part that is deciding to pretend-execute deliveries) and the liveslots+userspace code that actually receives the deliveries. By sharing an engine, activity on one side might influence GC and metering/gas usage on the other. Since we now require GC behavior to be part of consensus (so e.g.
reanimateCollection()
will perform schemata/label vatstore fetches in a deterministic fashion), we could not afford to have the validator which receives the delivery once ("for real") to behave differently than the validator that crashes and replays/ignores it.Description of the Design
Security Considerations
Test Plan
The text was updated successfully, but these errors were encountered: