-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ExitObject/SeatHandle cross-vat reference cycle retains old objects #8401
Comments
Incidentally, it looks like some of these cycles are also keeping deposited IST
The first row is that vref's VOM state (an empty object, as expected). The second row is a zoeSeatAdmin object's state, which points at the Payment in I noticed that half of the 3,446 IST Payment objects in v9-zoe are kept alive by the recovery set (so they haven't been deposited), and their balance is 0.0 IST. I think these zoeSeatAdmin references explain the other half: if they've been deposited, they'll no longer be in the recovery set or the |
For reference, some idea for enabling cooperative distributed gc are detailed in #6793, and while it was explored with mutually suspicious vats in mind, it is definitely not fully fleshed out yet. |
@erights and I dug into this more today. We think the code isn't doing anything surprising or wrong, but it's relying upon distributed GC features that swingset doesn't have, and it would be appropriate for the code to cooperate in breaking the cycles to avoid that reliance. We identified three kinds of changes to make. The first is to delete state or WeakStore entries when the seat is exited. The second is to remediate (break) the existing cycles that involve fully-exited seats. The third is to change the data structures to avoid creating the cycles in the first place. Breaking Cycles During
|
BTW I walked through one instance of this cycle ( |
In my run-5 dataset, v9-zoe has 55,867 instances of the v9-zoe has a That would let us enumerate most of the live seat-admin objects. But I don't know how to turn that into a replace-the-weakmap remediation. We'd need two more things:
|
Replacing the price-feed vat is feasible / straightforward
Fortunately, most of our system deals only indirectly with the price-feed vat.
It's straightforward to start a new price feed (aggregator) for oracleBrand.ATOM and register it with v8. The biggest impact would be on the oracle operators, who would have to redo the "redeem invitation" step so that they can submit prices to the new price feed (aggregator). v29 would then no longer be needed. I'm reasonably confident that nothing in our system is holding on to quote payments from v29 and might want to cc @turadg @michaelfig in case I'm mistaken |
In #8402 I'm investigating to see if we can afford to remediate this in "one fell swoop". I don't yet know what we need to do in Zoe or the contracts to break the existing cycles, but assuming we find a way, we're going to be deleting 50k-100k objects all at once. If we're lucky, then we can perform the remediation during a chain-software upgrade (which would upgrade zoe, delete the old WeakMaps during vat upgrade, and perform a prompt BOYD to trigger the |
@warner wrote:
|
This seems straightforward to me. We have to continue to define the behavior of the deprecated kind, but don't use its maker for anything. There would be a new exo definition (which would get a new kindID), and its maker function would be used for any future creation of |
My current theory/plan:
|
FWIW, I think this cycle exists on all seats, not just price authorities (although those are far more numerous, so a price-authority-only fix would still solve our immediate sustainability problems). As mentioned in today's meeting, our most likely path toremediate this problem in the large price-feed vats will be to improve the kernel to the point where we can survive a "one-swell-foop" cleanup (#8644, #8402, maybe #8417 if we upgrade zoe to include it), and then terminate the price-feed vats. We can install replacement vats first (to fix the bug and stop the storage growth), then defer terminating the old vats until we can survive the GC. But, instead of terminating the old vats, we might look for a way to "upgrade" them to a special vat whose only job is to slowly delete the old objects, sort of like how a polluted gas station will be shut down and provisioned with special clean-the-dirt machines for years before it can be safely used for other purposes. Given that the cycles are held by WeakMaps, this might not be easy. One option to consider is a special raw vat (no liveslots, no ocap rules) which uses direct We've discussed having "caretaker" or "estate-winddown" bundles configured for each vat, so that instead of terminating a vat, we "upgrade" it to a version whose only job is an orderly disposal of its assets. It might be reasonable for such an wind-down vat to get more authority than usual, things like |
Note to self: to measure the size of this problem, use
|
Closing as this is shipped in upgrade16 need another ticket for not creating cycles |
So in upgrade-16, we deployed an upgrade to vat-zoe, which includes PR #8697 . So as of 23-Jul-2024, any Seat that gets exited (or fails) will cut the cycle, and allow the objects to be dropped. That should remove most of the growth related to cycles. (TODO: look at the chain stats to verify this) The remaining issue is for Seats that don't exit. To make sure these don't leave cycles around, we need to avoid creating the cycles in the first place. We (@warner and @Chris-Hibbert ) have discussed a very different design that would avoid the cycles, but transitioning from the old Kinds to the new ones might be tricky. I've filed #9922 to track that work. |
While investigating mainnet storage usage, I discovered a cross-vat reference cycle between v9 (zoe) and v29 (ATOM-USD price feed). There are about 50k instances of this cycle, and each one is keeping enough objects alive to consume 37 kvstore entries. I estimate that this is currently costing about 310MB of combined SQLite+IAVL space. The cycles involve Seats for price-feed operations, so I suspect the quantity is growing over time. I'm not confident of the growth rate, but it might be about 3MB per day.
Following the cycle upstream, starting from Zoe, we get:
zoeSeatAdmin
durable objectzoeSeatAdmin
is held as a value of theseatHandleToZoeSeatAdmin
WeakMapStoreSeatHandle
SeatHandle
is a zoe-local durable object, which is also exported (strongly)Now on the price-feed vat (v29):
SeatHandle
zcfSeatToSeatHandle
WeakMapStorezcfSeat
, a local durable objectzcfSeat
is held as part of the state of anExitObject
, a local durable objectExitObject
is the one exported to zoe, completing the cycleSwingSet cannot collect cross-vat cycles, even if they merely involved weak imports, not strong ones. (In fact, performing distributed GC among mutually-suspicious parties is a significant research topic, what we in the ocap community call a "purple box" for the PhD-thesis -level impact it has on project planning). So all the objects and collection entries involved are kept alive, costing about 37 kvStore entries for each instance.
I'm still examining the chain state to confirm that there are no other references to the objects in this cycle, and the swingstore would not show whether there are ephemeral (RAM) references like a closed-over Presence. But I suspect that it is only the reference cycle that is keeping them alive.
Fixes
The next step of the investigation is to check the history of one instance, and determine whether the seat/offer involved has been exited or not. I also need to find out whether any other contracts are involved, or if all 50k instances are from the price-feed vat.
One outcome might be that exiting an offer allows everything to be cleaned up, and these cycles only exist for non-exited offers, and the core issue is that our price-feed vat is failing to exit the seats. If that's the case, then the flatten-the-slope fix is to change those vats to always exit their old offers/seats when they're done, and deploy a vat upgrade.
Another possibility is that seats have been exited, but the code is relying upon GC to collect everything, and the cycle is inhibiting that cleanup. In that case, we need to change something (either in v9-zoe, or in the other vats) to react to the exit by nulling out some state, to break the cycle:
state.zcfSeat = null
zcfSeatToSeatHandle.delete(seat)
seatHandleToZoeSeatAdmin.delete(seatHandle)
state.exitObject = null
It might also be possible to rearchitect the data structures to avoid the cycle. Each state property and map is surely there for a reason, but perhaps there is a rearrangement that would lack a cycle. I've seen cases before where object X holds property Y and Z for convenience, but having both on the same object creates graph shapes with cycles, and introducing a separate WeakMap or moving a property to a different object could provide a fix.
Remediation
Given the durability of the objects/collections involved, upgrading the code to prevent new cycles from forming will not help remove the 50k existing ones.
More investigation might reveal that strong references are still being held to these objects (perhaps there's an in-RAM Set of un-exited seats?). If so, and if being un-exited is the problem, then it might be possible to explicitly exit each one. Given the sheer number (1.6M) of kvStore entries involved, we should rate-limit this process, to avoid interfering with normal operation or IAVL pruning (same issue as in #8400).
However, if WeakMapStore cycle is really the only thing keeping these alive, then none of the objects are reachable from JS code. One possible path would be for the upgraded code to replace its
zcfSeatToSeatHandle
WeakMapStore entirely, however that might cause problems for real (live) seats.In this case, our best idea so far is to entirely terminate the price feed vats.
This would trigger the kernel's cleanup code (
terminateVat
,cleanupAfterTerminatedVat
, anddeleteCListEntry
), which should decrement the refcounts of the imported SeatHandle objects, allowing v9-zoe to drop its WeakMapStore entries, dropping the (now-abandoned) ExitObjects, and deleting all parts of the cycle. (I'm 90% sure the kernel will do this correctly, however I need to add an explicit test that uses a real cross-vat cycle to be sure).Terminating the price-feed vat will also abandon its Issuer: the identity remains the same, however it no longer accepts messages, so
getAmountOf()
will never again succeed. This is a visible trauma (worse than a a mere vat upgrade would cause), equivalent to destroying the feed, so downstream vats will need to somehow be told to start using a new one, and I don't know what that would take to implement.cc @Chris-Hibbert @erights
The text was updated successfully, but these errors were encountered: