Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retire abandoned unreachables #8695

Merged
merged 7 commits into from
Aug 14, 2024

Conversation

warner
Copy link
Member

@warner warner commented Dec 24, 2023

fix(swingset): retire unreachable orphans

If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called syscall.abandonExports, or was upgraded
and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk
about the object, so it can never become reachable again, which is
normally the point at which the owning vat would retire it. But
because the owning vat can't retire it by itself, the kernel needs to
do the retirement on its behalf.

We now consolidate retirement responsibilities into
processRefcounts(): when terminateVat or syscall.abandonExports use
abandonKernelObjects() to mark a kref as orphaned, it also adds the
kref to maybeFreeKrefs, and then processRefcounts() is responsible for
noticing the kref is both orphaned and unreachable, and then notifying
any importers of its retirement.

I double-checked that cleanupAfterTerminatedVat will always be
followed by a processRefcounts(), by virtue of either being called
from processDeliveryMessage (in the crankResults.terminate clause), or
from within a device invocation syscall (which only happens during a
delivery, so follows the same path). We need this to ensure that any
maybeFreeKrefs created by the cleanup's abandonKernelObjects() will
get processed promptly.

Changes getObjectRefCount to tolerate deleted krefs (missing
koNN.refCount) by just returning 0,0. This fixes a potential kernel
panic in the new approach, when a kref is recognizable by one vat but
only reachable by a send-message on the run-queue, then becomes
unreachable as that message is delivered (the run-queue held the last
strong reference), if the target vat does syscall.exit during the
delivery. The decref pushes the kref onto maybeFreeKrefs, the
terminateVat retires the merely-recognizable now-orphaned kref, then
processRefcounts used getObjectRefCount() to grab the refcount for the
now-retired (and deleted) kref, which asserted that the koNN.refCount
key still existed, which didn't.

This occured in zoe - secondPriceAuction -- valid input , where the
contract did syscall.exit in response to a timer wake() message sent
to a single-use wakeObj.

closes #7212

@warner warner added the SwingSet package: SwingSet label Dec 24, 2023
@warner warner self-assigned this Dec 24, 2023
@warner warner force-pushed the warner/7212-retire-abandoned-unreachables branch from 44a5aae to 3a51359 Compare June 25, 2024 08:15
Copy link

cloudflare-workers-and-pages bot commented Jun 25, 2024

Deploying agoric-sdk with  Cloudflare Pages  Cloudflare Pages

Latest commit: 9dcdabb
Status: ✅  Deploy successful!
Preview URL: https://b24e341d.agoric-sdk.pages.dev
Branch Preview URL: https://warner-7212-retire-abandoned.agoric-sdk.pages.dev

View logs

@warner warner force-pushed the warner/7212-retire-abandoned-unreachables branch 3 times, most recently from 219a897 to 5e34f0f Compare June 27, 2024 21:17
Copy link
Contributor

mergify bot commented Jun 27, 2024

⚠️ The sha of the head commit of this PR conflicts with #9604. Mergify cannot evaluate rules on this PR. ⚠️

@warner
Copy link
Member Author

warner commented Jun 27, 2024

Security Considerations

none

Scaling Considerations

none

Documentation Considerations

None needed, this is internal to the kernel. No userspace-reliant behavior is changing (userspace vats cannot sense GC anyways).

Testing Considerations

The swingset unit tests included in the PR should be sufficient.

Upgrade Considerations

This changes kernel behavior, to avoid leaking object state. Deployed kernels may have encountered this situation already, in which case they will still have non-retired abandoned non-reachable objects. This PR makes no attempt to remediate old state like that: those objects will continue to exist until the recognizing vat is terminated or chooses to stop recognizing the object (with syscall.retireImports, which is unlikely).

This is merely a storage consumption concern, not a correctness concern. The importing (recognizing) vat doesn't know or care why the object remains unretired: perhaps some other vat can still reach it, or the exporting vat itself. And the previously-exporting vat is either dead or knows nothing about the object, so it doesn't care either. The kernel is the only one in a position to retire the object. The consequence is a few extra DB entries, and whatever is being retained by the WeakMap or WeakMapStore in the importing vat.

According to the notes in the original issue (#7212), the mainnet kernel has only one object this this state (as of six months ago), and the state it keeps alive is not significant. However, when we delete the old price-feed vats, it will abandon many hundreds of thousands of non-reachable objects, so it is important that we land this fix before deleting those vats.

@warner
Copy link
Member Author

warner commented Jun 27, 2024

best reviewed one commit at a time, the first several are small refactorings

@warner warner marked this pull request as ready for review June 27, 2024 21:19
@warner warner changed the base branch from master to warner/8687-controller-terminateVat June 27, 2024 21:20
@warner warner requested a review from mhofman June 27, 2024 21:27
@warner warner force-pushed the warner/8687-controller-terminateVat branch from 1a38cf2 to e05394d Compare June 27, 2024 21:39
@warner warner force-pushed the warner/7212-retire-abandoned-unreachables branch from 5e34f0f to 0285d68 Compare June 27, 2024 21:39
@warner warner force-pushed the warner/7212-retire-abandoned-unreachables branch 2 times, most recently from 081cfba to afae8a6 Compare July 2, 2024 06:03
@warner warner force-pushed the warner/8687-controller-terminateVat branch from e05394d to 4fdb16e Compare July 2, 2024 06:03
@warner warner force-pushed the warner/8687-controller-terminateVat branch from 4fdb16e to 6c53a99 Compare July 10, 2024 17:05
@warner warner force-pushed the warner/7212-retire-abandoned-unreachables branch from afae8a6 to 8f0c328 Compare July 10, 2024 17:05
@warner warner requested a review from gibson042 July 10, 2024 17:11
@warner warner force-pushed the warner/8687-controller-terminateVat branch from 6c53a99 to 794a9c2 Compare July 11, 2024 04:38
@warner warner force-pushed the warner/7212-retire-abandoned-unreachables branch 2 times, most recently from c31a1a8 to d5c5c13 Compare July 11, 2024 23:51
@warner warner force-pushed the warner/8687-controller-terminateVat branch from 794a9c2 to 2990dfb Compare July 11, 2024 23:51
Copy link
Member

@gibson042 gibson042 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable, but I don't think I'm sufficiently comfortable with the intricacies to sign off without a walkthrough.

Copy link
Member

@gibson042 gibson042 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I submitted the review comment without the inline comments and thereby lost them. This is an attempt at reconstruction.

packages/SwingSet/src/kernel/state/kernelKeeper.js Outdated Show resolved Hide resolved
Comment on lines 1544 to 1612
}
if (!ownerVatID) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
if (!ownerVatID) {
}
if (!ownerVatID) {

packages/SwingSet/test/abandon-export.test.js Outdated Show resolved Hide resolved
@warner warner force-pushed the warner/7212-retire-abandoned-unreachables branch from 75d1314 to c528ec2 Compare August 11, 2024 20:00
@warner warner force-pushed the warner/8687-controller-terminateVat branch from 52e0840 to c9b014a Compare August 12, 2024 22:01
@warner warner requested a review from gibson042 August 12, 2024 22:02
@warner warner force-pushed the warner/7212-retire-abandoned-unreachables branch from c528ec2 to 19a5809 Compare August 12, 2024 22:02
@warner warner force-pushed the warner/8687-controller-terminateVat branch from c9b014a to 5b67afe Compare August 12, 2024 22:54
@warner warner force-pushed the warner/7212-retire-abandoned-unreachables branch 2 times, most recently from 6956b7d to 6813383 Compare August 12, 2024 23:51
@warner warner force-pushed the warner/8687-controller-terminateVat branch from 5b67afe to fba6e3a Compare August 12, 2024 23:51
@warner warner force-pushed the warner/7212-retire-abandoned-unreachables branch from 6813383 to aa004f9 Compare August 13, 2024 00:50
@warner warner force-pushed the warner/8687-controller-terminateVat branch from fba6e3a to 693b367 Compare August 13, 2024 00:50
Base automatically changed from warner/8687-controller-terminateVat to master August 13, 2024 01:41
@warner warner force-pushed the warner/7212-retire-abandoned-unreachables branch from aa004f9 to 2a10c9d Compare August 13, 2024 01:49
Copy link
Member

@gibson042 gibson042 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

packages/SwingSet/test/gc-kernel-orphan.test.js is difficult to follow, but on close inspection does look right. 👍

packages/SwingSet/src/kernel/state/kernelKeeper.js Outdated Show resolved Hide resolved
packages/SwingSet/src/kernel/kernelSyscall.js Outdated Show resolved Hide resolved
packages/SwingSet/src/kernel/state/kernelKeeper.js Outdated Show resolved Hide resolved
packages/SwingSet/src/kernel/state/kernelKeeper.js Outdated Show resolved Hide resolved
packages/SwingSet/src/kernel/state/kernelKeeper.js Outdated Show resolved Hide resolved
packages/SwingSet/test/abandon-export.test.js Outdated Show resolved Hide resolved
packages/SwingSet/test/abandon-export.test.js Outdated Show resolved Hide resolved
packages/SwingSet/test/abandon-export.test.js Outdated Show resolved Hide resolved
packages/SwingSet/test/abandon-export.test.js Outdated Show resolved Hide resolved
warner added 7 commits August 14, 2024 12:54
Uncomment a test case that was disabled because of v8/xsnap
weirdness. Seems to work now. Unrelated to the 7212 work
the "TODO: decref #2069 auxdata" comment was removed, because that will
be the responsibility of deleteKernelObject()
the new function is responsible for deleting the c-list entries, so
callers can stop doing that
Previously, processRefcounts() populated a local RAM cache upon entry
with getGCActions(), added things to it during operation, then saved
it with setGCActions() at the end. The intention was to avoid lots of
redundant kvStore gets/sets as we loop over a (potentially large)
number of changes. But, this approach would lose actions if we called
other functions in the middle which did their own additions (eg with
addGCActions).

Since we only add actions, never remove them, it's just as fast (and
infinitely less buggy) to accumulate a RAM cache of *new* actions, and
add them all at the end, with addGCActions().

We don't make calls in the middle yet, but upcoming changes might
start doing that, so best to fix this bug now.

closes #5054
This adds a new (failing) test of #7212, enhances some other tests to
cover the same thing, and uncomments a portion of upgrade.test.js
which was commented out when we discovered the bug.

These will only pass when the kernel properly retires unreachable
objects that have just been abandoned by their owning vat.

The new test (gc-kernel-orphan.test.js) also checks that vat
termination on the same crank that retires an object will not cause a
panic.
If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called `syscall.abandonExports`, or was upgraded
and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk
about the object, so it can never become reachable again, which is
normally the point at which the owning vat would retire it. But
because the owning vat is dead, it can't retire the koid by itself,
the kernel needs to do the retirement on the vat's behalf.

We now consolidate retirement responsibilities into
processRefcounts(): when terminateVat or syscall.abandonExports use
abandonKernelObjects() to mark a kref as orphaned, it also adds the
kref to maybeFreeKrefs, and then processRefcounts() is responsible for
noticing the kref is both orphaned and unreachable, and then notifying
any importers of its retirement.

I double-checked that cleanupAfterTerminatedVat will always be
followed by a processRefcounts(), by virtue of either being called
from processDeliveryMessage (in the crankResults.terminate clause), or
from within a device invocation syscall (which only happens during a
delivery, so follows the same path). We need this to ensure that any
maybeFreeKrefs created by the cleanup's abandonKernelObjects() will
get processed promptly.

This also changes getObjectRefCount() to tolerate deleted
krefs (i.e. missing `koNN.refCount`) by just returning 0,0. This fixes
a potential kernel panic in the new approach, when a kref is
recognizable by one vat but only reachable by a send-message on the
run-queue, then becomes unreachable as that message is delivered (the
run-queue held the last strong reference), and if the target vat does
syscall.exit during the delivery. The decref pushes the kref onto
maybeFreeKrefs, the terminateVat retires the merely-recognizable
now-orphaned kref, then processRefcounts used getObjectRefCount() to
grab the refcount for the now-retired (and deleted) kref, which
asserted that the koNN.refCount key still existed, which didn't.

This occured in "zoe - secondPriceAuction -- valid input" unit test ,
where the contract did syscall.exit in response to a timer wake()
message sent to a single-use wakeObj.

Also rename abandonKernelObject back to orphanKernelObject, the name
fits better now.

closes #7212
@warner warner force-pushed the warner/7212-retire-abandoned-unreachables branch from 2a10c9d to 9dcdabb Compare August 14, 2024 17:55
@warner warner added the automerge:rebase Automatically rebase updates, then merge label Aug 14, 2024
@mergify mergify bot merged commit 7f12e80 into master Aug 14, 2024
90 checks passed
@mergify mergify bot deleted the warner/7212-retire-abandoned-unreachables branch August 14, 2024 18:43
mergify bot pushed a commit that referenced this pull request Sep 16, 2024
closes: #10071

## Description

I believe #8695 introduced a dependency of the boostrap vat of this test on the gc behavior of the engine, which makes it flaky. Applies the common fix to use an xs-worker instead.

### Security Considerations
None

### Scaling Considerations
None

### Documentation Considerations
None

### Testing Considerations
Hopefully the flake is gone

### Upgrade Considerations
None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge:rebase Automatically rebase updates, then merge SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kernel should retire abandoned non-reachable objects
2 participants