-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change virtual-object state LRU cache: write-through, write-back at end-of-crank, or remove entirely #6693
Comments
Very much in favor of this! Didn't realize we were caching writes.
I guess it depends the granularity of batching we want. If you want to batch synchronous writes (aka all writes happening in a single turn), then a promise based queue is appropriate. However we may want to consider batching at the crank level since liveslots is in a privileged position and is aware of crank boundaries. |
I spent 15 minutes trying to reconstruct my understanding of the LRU cache and the
And we need the sequence of transcript-recorded The simplest way to achieve this is:
The sequence of
If a Representative is used, then dropped and collected in the middle of a crank, then reanimated and used again, all within that same crank, we'd keep using the state data fetched the first time, even though we dropped the Representative. That means liveslots would hold a We (used to?) worry about deserialization being observable to userspace, and had protections in place to prevent this. The one I'm remembering retained the raw (still serialized) state data in the cache, but re-invoking Note that currently we store the entire Also note that this would be a virtual-object- |
I think that'll work.. we currently use agoric-sdk/packages/SwingSet/src/liveslots/liveslots.js Lines 1563 to 1565 in c6f074c
We do a So I think we could have Then we get rid of the existing LRU cache and probably some of the |
Setting aside for a moment issues of code complexity vs. correctness & maintainability that the implementation of any kind of caching scheme introduces, I think the issue of determinism vs. caching is a bit of a red herring. In particular, I'm unclear what motivates the concern for crank boundaries in the narrative here. If we take the always-read-on-get, always-write-on-set idea as the baseline (which, now that you've articulated it, seems like a really good baseline for analysis, in the why-weren't-we-always-thinking-about-it-this-way? sense), then any amount of LRU-style caching that is driven solely by the vat code's own access patterns should be fine from a determinism perspective -- as long as the underlying sequence of get/set operations is deterministic, then the resulting sequence of reads and writes will also be deterministic, regardless of where the crank boundaries lie. GC enters the analysis if there are additional reads or writes that are triggered by things being swapped in and out based on local-but-not-global reachability determinations that come up if something can be local garbage but not global garbage. If we take as a policy position that we aren't going to have any of those, then I think we can be as sophisticated with our caching scheme as performance metrics push us to be (that said, I think the existing scheme is already too complicated for our own good). |
Yeah, I agree that the baseline is:
which should be entirely insenstive to GC. Trying to avoid some of those reads depends upon holding the data in RAM, and since we must limit the amount of RAM usage, we must eventually remove things from RAM, which means re-fetching it on the next So yeah I guess I'm jumping ahead to picking a compromise point between debuggability and performance. "First I guess we need some way to estimate how frequently we get a batch of reads/writes for the same vobj state in a single crank, to estimate what kind of performance difference we could anticipate. |
Ugh? Why would userspace get to do something only in the case where the vobj is re-animated? Did I misunderstand something?
There is an alternative "simple" approach: always trigger a kind metadata lookup, and do not hold anything in ram.
I was very confused by this as well, but seeing the subsequent reply, I understand the benefits from a debuggability point of view. However it may be an expensive trade-off to make performance-wise.
That's what I tried to express the other day, but looks like I failed at explaining myself. Happy to see we're converging :)
So one way to mitigate any performance impact of extra syscalls, yet keep full debuggability of user code (not liveslots, but hey it's not like we get much insight into its behavior from the outside even today) would be to passthrough syscalls on all vatstore read/writes but have the xsnap process + liveslots be able to not block on the result of these and keep going if the result is in the cache. Then no matter the caching strategy, the only observable impact would be on snapshot content, and we could even clear the cache on BOYD calls before snapshotting to mitigate that impact (the caching implementation would still be part of the heap snapshot, so that's probably a step too far). |
No no, the "former/latter" was referencing the first split in the previous sentence (inbound delivery vs pull vobj out of storage), not the second (discover existing vs reanimation). Userspace doesn't get to sense whether it was re-animated or pre-existing. I'm just pointing out that sometimes methods get invoked, sometimes they don't, but either way, method invocation (as well as
Chip and I were talking about this, and worked out a bit of a spectrum. The baseline is for vobj But of course we'd feel silly constantly re-reading a pair of metadata keys for every collection operation, especially when the metadata is immutable. We can't hold it in RAM forever because we might have a lot of collections (despite our mental image when designing them, we never instructed userspace authors that collections should be low-cardinality, and given how people are using them, we can't really add that constraint now). So some sort of compromise is needed. We might start by changing liveslots/VOM/vc to use the "no cache" extreme, but I'm worried about the perf consequences.
Yeah, that's where I'm looking for a sensible compromise, and "write everything / remember nothing at end-of-crank" might be a workable one.
Eeeargh. I don't think I like that: syscalls that sometimes block, sometimes don't, sounds like a recipe for interleaving surprises, and confusion about whether a syscall-response is for syscall-1 or syscall-2. A |
I think I didn't explain myself correctly. Maybe another mental model is needed here: assume syscalls were asynchronous. Currently making the syscall would send the data over the wire, then await the result. What I'm suggesting is that it'd send the data over the wire, but drop the result promise, and instead resolve with the content of the cache. This could also be considered the equivalent of racing the wire result with a cache result. Do we agree that from that point of view, doing this is safe and doesn't result in interleaving surprises? Now of course syscalls are not asynchronous, and the trick is how to make this work. One possibility would be for liveslots to include a flag in the
Yes that is true we'd have to block, but whether it negates the benefits would require testing. I personally believe it would still be sufficiently beneficial, and at the end of the day, depends heavily on the kind of caching. A poor caching strategy would be no worse than blocking on all syscalls, which we've agreed is the baseline. We could easily try this without a cache: mark all
I really don't follow. How would heap state be influenced in a non-deterministic way? The cache would still be deterministic, and it's the only thing that influences whether an |
I think we have consensus on the "populate an unlimited-size cache during the first userspace access of each delivery, flush all dirty entries and clear all entries at the end of the delivery" approach. I think we can investigate the async/non-blocking vatstore syscalls in a separate ticket. I've made a start on this in the
I'm thinking that The tricky part will be unwinding the "innerSelf" parts of |
some more hacking on that branch: the collection manager is wired up, and I've made a really incomplete start on wiring up the VOM |
This cache accumulates all data until an explicit flush, at which point it writes back all dirty state, and forgets all state. This avoids the somewhat unpredictable behavior of an LRU cache, which caused us problems with GC sensitivity. It suffers from the potential for larger RAM usage between flushes, but is quite simple, and yields easy-to-debug syscall behavior. We have other caches in our platform, but liveslots needs something with these specific properties, and I don't know what the needs in the rest of the platform are, so I'm fine with not sharing this implementation with other packages, at least for now. refs #6693
This cache accumulates all data until an explicit flush, at which point it writes back all dirty state, and forgets all state. This avoids the somewhat unpredictable behavior of an LRU cache, which caused us problems with GC sensitivity. It suffers from the potential for larger RAM usage between flushes, but is quite simple, and yields easy-to-debug syscall behavior. We have other caches in our platform, but liveslots needs something with these specific properties, and I don't know what the needs in the rest of the platform are, so I'm fine with not sharing this implementation with other packages, at least for now. refs #6693
This cache accumulates all data until an explicit flush, at which point it writes back all dirty state, and forgets all state. This avoids the somewhat unpredictable behavior of an LRU cache, which caused us problems with GC sensitivity. It suffers from the potential for larger RAM usage between flushes, but is quite simple, and yields easy-to-debug syscall behavior. We have other caches in our platform, but liveslots needs something with these specific properties, and I don't know what the needs in the rest of the platform are, so I'm fine with not sharing this implementation with other packages, at least for now. refs #6693
This cache accumulates all data until an explicit flush, at which point it writes back all dirty state, and forgets all state. This avoids the somewhat unpredictable behavior of an LRU cache, which caused us problems with GC sensitivity. It suffers from the potential for larger RAM usage between flushes, but is quite simple, and yields easy-to-debug syscall behavior. We have other caches in our platform, but liveslots needs something with these specific properties, and I don't know what the needs in the rest of the platform are, so I'm fine with not sharing this implementation with other packages, at least for now. refs #6693
note to self: vat-ulrik-1.js can be changed to remove the dummy cache-flushing object once this is fixed |
This cache accumulates all data until an explicit flush, at which point it writes back all dirty state, and forgets all state. This avoids the somewhat unpredictable behavior of an LRU cache, which caused us problems with GC sensitivity. It suffers from the potential for larger RAM usage between flushes, but is quite simple, and yields easy-to-debug syscall behavior. We have other caches in our platform, but liveslots needs something with these specific properties, and I don't know what the needs in the rest of the platform are, so I'm fine with not sharing this implementation with other packages, at least for now. refs #6693
Be aware of Draft #7286 , which I think is a complementary optimization. But I'm not sure. |
This cache accumulates all data until an explicit flush, at which point it writes back all dirty state, and forgets all state. This avoids the somewhat unpredictable behavior of an LRU cache, which caused us problems with GC sensitivity. It suffers from the potential for larger RAM usage between flushes, but is quite simple, and yields easy-to-debug syscall behavior. We have other caches in our platform, but liveslots needs something with these specific properties, and I don't know what the needs in the rest of the platform are, so I'm fine with not sharing this implementation with other packages, at least for now. refs #6693
This cache accumulates all data until an explicit flush, at which point it writes back all dirty state, and forgets all state. This avoids the somewhat unpredictable behavior of an LRU cache, which caused us problems with GC sensitivity. It suffers from the potential for larger RAM usage between flushes, but is quite simple, and yields easy-to-debug syscall behavior. We have other caches in our platform, but liveslots needs something with these specific properties, and I don't know what the needs in the rest of the platform are, so I'm fine with not sharing this implementation with other packages, at least for now. refs #6693
This cache accumulates all data until an explicit flush, at which point it writes back all dirty state, and forgets all state. This avoids the somewhat unpredictable behavior of an LRU cache, which caused us problems with GC sensitivity. It suffers from the potential for larger RAM usage between flushes, but is quite simple, and yields easy-to-debug syscall behavior. We have other caches in our platform, but liveslots needs something with these specific properties, and I don't know what the needs in the rest of the platform are, so I'm fine with not sharing this implementation with other packages, at least for now. refs #6693
This cache accumulates all data until an explicit flush, at which point it writes back all dirty state, and forgets all state. This avoids the somewhat unpredictable behavior of an LRU cache, which caused us problems with GC sensitivity. It suffers from the potential for larger RAM usage between flushes, but is quite simple, and yields easy-to-debug syscall behavior. We have other caches in our platform, but liveslots needs something with these specific properties, and I don't know what the needs in the rest of the platform are, so I'm fine with not sharing this implementation with other packages, at least for now. refs #6693
This cache accumulates all data until an explicit flush, at which point it writes back all dirty state, and forgets all state. This avoids the somewhat unpredictable behavior of an LRU cache, which caused us problems with GC sensitivity. It suffers from the potential for larger RAM usage between flushes, but is quite simple, and yields easy-to-debug syscall behavior. We have other caches in our platform, but liveslots needs something with these specific properties, and I don't know what the needs in the rest of the platform are, so I'm fine with not sharing this implementation with other packages, at least for now. refs #6693
This cache accumulates all data until an explicit flush, at which point it writes back all dirty state, and forgets all state. This avoids the somewhat unpredictable behavior of an LRU cache, which caused us problems with GC sensitivity. It suffers from the potential for larger RAM usage between flushes, but is quite simple, and yields easy-to-debug syscall behavior. We have other caches in our platform, but liveslots needs something with these specific properties, and I don't know what the needs in the rest of the platform are, so I'm fine with not sharing this implementation with other packages, at least for now. refs #6693
What is the Problem Being Solved?
As listed in #6650, one blocker for removing the
stopVat
from the vat upgrade process is the write-back nature of the virtual object manager's LRU cache. This cache holds the state of virtual objects, but does not write changes back to the DB immediately. Instead, it waits until the object is evicted from the cache (by the loading of some newer object). The cache is also flushed explicltly during adispatch.bringOutYourDead
operation. We (now, see #6604) flush this duringstopVat
too, since otherwise the last few virtual-object modifications by a vat would be lost during the upgrade process.@FUDCo and I aren't sure that the write-back cache is worth the 1: risk of bugs like #6604, 2: debugging/testing complication (the vatstore write you're looking for doesn't happen right away, and requires several dummy writes to elicit), and 3: the need for a flush before vat upgrade.
We figured that turning it into a write-through cache would be a decent compromise.
Description of the Design
When a virtual object is first referenced (specifically when something reads from
state.${propname}
), avatstoreGet
is used to read the data into the LRU cache, and it remains there until evicted. However, every time the state is modified, the VOM does an immediatevatstoreSet
to write the changes to DB (leaving the modified state in the cache, for later reading).A performance optimization would be to queue the writeback for later in the delivery (with
Promise.resolve().then(doWriteback)
). That might allow multiple changes to the samestate
to be collapsed into a singlevatstoreSet
, for code that does something like:The
foo
write would modify the state in the LRU cache item, setting thedirty
flag, and enqueue the first call todoWriteback
. Then, in the same turn, thebar
write would further modify the state, again set thedirty
flag, and enqueue a second call todoWriteback
(or maybe it checks thedirty
flag first and doesn't enqueue additional calls). In a later turn, but still in the same crank/delivery, thedoWriteback
gets executed: it writes the data withvatstoreSet
, and clears thedirty
flag. If later calls todoWriteback
happen, they notice the missingdirty
flag and do no work.Security Considerations
No security consequences (userspace doesn't get to see the LRU cache), but this should increase the reliability of virtual-object changes, particularly in the face of a vat malfunction that requires a vat upgrade to recover from.
Test Plan
The existing unit tests will be modified to match the new behavior. They are probably already sufficient to check that a DB write happens right away.
Relation to stopVat changes
Having a write-through cache will mean
stopVat
no longer needs to rely upon the explicit cache flush performed in BOYD. It may then become safe to remove the BOYD call fromstopVat
. However, BOYD also does GC, which might release durable objects. If we remove thestopVat
BOYD, we might find a few durable objects are not released before the upgrade, and they might not be releasable after the upgrade until we implement a full mark+sweep GC (which we've deferred until some future software upgrade). So in the short term, this might result in more objects being retained. Note that this would be visible to other vats: durable objects might be recognizable (used as WeakMap keys) in other vats, and without a final BOYD, those keys might stick around forever.The text was updated successfully, but these errors were encountered: