-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
retire abandoned unreachables #8695
Conversation
44a5aae
to
3a51359
Compare
Deploying agoric-sdk with Cloudflare Pages
|
219a897
to
5e34f0f
Compare
|
Security Considerationsnone Scaling Considerationsnone Documentation ConsiderationsNone needed, this is internal to the kernel. No userspace-reliant behavior is changing (userspace vats cannot sense GC anyways). Testing ConsiderationsThe swingset unit tests included in the PR should be sufficient. Upgrade ConsiderationsThis changes kernel behavior, to avoid leaking object state. Deployed kernels may have encountered this situation already, in which case they will still have non-retired abandoned non-reachable objects. This PR makes no attempt to remediate old state like that: those objects will continue to exist until the recognizing vat is terminated or chooses to stop recognizing the object (with syscall.retireImports, which is unlikely). This is merely a storage consumption concern, not a correctness concern. The importing (recognizing) vat doesn't know or care why the object remains unretired: perhaps some other vat can still reach it, or the exporting vat itself. And the previously-exporting vat is either dead or knows nothing about the object, so it doesn't care either. The kernel is the only one in a position to retire the object. The consequence is a few extra DB entries, and whatever is being retained by the WeakMap or WeakMapStore in the importing vat. According to the notes in the original issue (#7212), the mainnet kernel has only one object this this state (as of six months ago), and the state it keeps alive is not significant. However, when we delete the old price-feed vats, it will abandon many hundreds of thousands of non-reachable objects, so it is important that we land this fix before deleting those vats. |
best reviewed one commit at a time, the first several are small refactorings |
1a38cf2
to
e05394d
Compare
5e34f0f
to
0285d68
Compare
081cfba
to
afae8a6
Compare
e05394d
to
4fdb16e
Compare
4fdb16e
to
6c53a99
Compare
afae8a6
to
8f0c328
Compare
6c53a99
to
794a9c2
Compare
c31a1a8
to
d5c5c13
Compare
794a9c2
to
2990dfb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable, but I don't think I'm sufficiently comfortable with the intricacies to sign off without a walkthrough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, I submitted the review comment without the inline comments and thereby lost them. This is an attempt at reconstruction.
} | ||
if (!ownerVatID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
if (!ownerVatID) { | |
} | |
if (!ownerVatID) { |
75d1314
to
c528ec2
Compare
52e0840
to
c9b014a
Compare
c528ec2
to
19a5809
Compare
c9b014a
to
5b67afe
Compare
6956b7d
to
6813383
Compare
5b67afe
to
fba6e3a
Compare
6813383
to
aa004f9
Compare
fba6e3a
to
693b367
Compare
aa004f9
to
2a10c9d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
packages/SwingSet/test/gc-kernel-orphan.test.js is difficult to follow, but on close inspection does look right. 👍
Uncomment a test case that was disabled because of v8/xsnap weirdness. Seems to work now. Unrelated to the 7212 work
the "TODO: decref #2069 auxdata" comment was removed, because that will be the responsibility of deleteKernelObject()
the new function is responsible for deleting the c-list entries, so callers can stop doing that
Previously, processRefcounts() populated a local RAM cache upon entry with getGCActions(), added things to it during operation, then saved it with setGCActions() at the end. The intention was to avoid lots of redundant kvStore gets/sets as we loop over a (potentially large) number of changes. But, this approach would lose actions if we called other functions in the middle which did their own additions (eg with addGCActions). Since we only add actions, never remove them, it's just as fast (and infinitely less buggy) to accumulate a RAM cache of *new* actions, and add them all at the end, with addGCActions(). We don't make calls in the middle yet, but upcoming changes might start doing that, so best to fix this bug now. closes #5054
This adds a new (failing) test of #7212, enhances some other tests to cover the same thing, and uncomments a portion of upgrade.test.js which was commented out when we discovered the bug. These will only pass when the kernel properly retires unreachable objects that have just been abandoned by their owning vat. The new test (gc-kernel-orphan.test.js) also checks that vat termination on the same crank that retires an object will not cause a panic.
If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so it can never become reachable again, which is normally the point at which the owning vat would retire it. But because the owning vat is dead, it can't retire the koid by itself, the kernel needs to do the retirement on the vat's behalf. We now consolidate retirement responsibilities into processRefcounts(): when terminateVat or syscall.abandonExports use abandonKernelObjects() to mark a kref as orphaned, it also adds the kref to maybeFreeKrefs, and then processRefcounts() is responsible for noticing the kref is both orphaned and unreachable, and then notifying any importers of its retirement. I double-checked that cleanupAfterTerminatedVat will always be followed by a processRefcounts(), by virtue of either being called from processDeliveryMessage (in the crankResults.terminate clause), or from within a device invocation syscall (which only happens during a delivery, so follows the same path). We need this to ensure that any maybeFreeKrefs created by the cleanup's abandonKernelObjects() will get processed promptly. This also changes getObjectRefCount() to tolerate deleted krefs (i.e. missing `koNN.refCount`) by just returning 0,0. This fixes a potential kernel panic in the new approach, when a kref is recognizable by one vat but only reachable by a send-message on the run-queue, then becomes unreachable as that message is delivered (the run-queue held the last strong reference), and if the target vat does syscall.exit during the delivery. The decref pushes the kref onto maybeFreeKrefs, the terminateVat retires the merely-recognizable now-orphaned kref, then processRefcounts used getObjectRefCount() to grab the refcount for the now-retired (and deleted) kref, which asserted that the koNN.refCount key still existed, which didn't. This occured in "zoe - secondPriceAuction -- valid input" unit test , where the contract did syscall.exit in response to a timer wake() message sent to a single-use wakeObj. Also rename abandonKernelObject back to orphanKernelObject, the name fits better now. closes #7212
2a10c9d
to
9dcdabb
Compare
closes: #10071 ## Description I believe #8695 introduced a dependency of the boostrap vat of this test on the gc behavior of the engine, which makes it flaky. Applies the common fix to use an xs-worker instead. ### Security Considerations None ### Scaling Considerations None ### Documentation Considerations None ### Testing Considerations Hopefully the flake is gone ### Upgrade Considerations None
fix(swingset): retire unreachable orphans
If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called
syscall.abandonExports
, or was upgradedand the koid was ephemeral), then retire it immediately.
The argument is that the previously-owning vat can never again talk
about the object, so it can never become reachable again, which is
normally the point at which the owning vat would retire it. But
because the owning vat can't retire it by itself, the kernel needs to
do the retirement on its behalf.
We now consolidate retirement responsibilities into
processRefcounts(): when terminateVat or syscall.abandonExports use
abandonKernelObjects() to mark a kref as orphaned, it also adds the
kref to maybeFreeKrefs, and then processRefcounts() is responsible for
noticing the kref is both orphaned and unreachable, and then notifying
any importers of its retirement.
I double-checked that cleanupAfterTerminatedVat will always be
followed by a processRefcounts(), by virtue of either being called
from processDeliveryMessage (in the crankResults.terminate clause), or
from within a device invocation syscall (which only happens during a
delivery, so follows the same path). We need this to ensure that any
maybeFreeKrefs created by the cleanup's abandonKernelObjects() will
get processed promptly.
Changes getObjectRefCount to tolerate deleted krefs (missing
koNN.refCount
) by just returning 0,0. This fixes a potential kernelpanic in the new approach, when a kref is recognizable by one vat but
only reachable by a send-message on the run-queue, then becomes
unreachable as that message is delivered (the run-queue held the last
strong reference), if the target vat does syscall.exit during the
delivery. The decref pushes the kref onto maybeFreeKrefs, the
terminateVat retires the merely-recognizable now-orphaned kref, then
processRefcounts used getObjectRefCount() to grab the refcount for the
now-retired (and deleted) kref, which asserted that the koNN.refCount
key still existed, which didn't.
This occured in zoe - secondPriceAuction -- valid input , where the
contract did syscall.exit in response to a timer wake() message sent
to a single-use wakeObj.
closes #7212