Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(swingset): workaround XS garbage collection bugs #6664

Merged
merged 1 commit into from
Dec 14, 2022

Conversation

mhofman
Copy link
Member

@mhofman mhofman commented Dec 13, 2022

refs: #6588

Description

This changes how the transcript replay logic handles syscalls so that virtual collection metadata vatstoreGet syscalls can happen at different times than when it was originally recorded.

Security Considerations

This ability to skip, and more importantly, fall-through syscalls should be limited to idempotent syscalls. In this case vatstoreGet of vc metadata will always return the same result as the data never changes, and a "get" does not perform any changes. By collocating the compare function definition and the usage, we avoid this capability being exposed inadvertently to the embedder.

Documentation Considerations

How to apply this change:
In agoric-sdk:

curl https://patch-diff.githubusercontent.com/raw/Agoric/agoric-sdk/pull/6664.diff | patch -p1

Testing Considerations

Tested against stake2me's agoric_2012-12-11.tar snapshot which contains an arachnophobia issue for vat6:

2022-12-13T00:07:49.608Z launch-chain: Launching SwingSet kernel
2022-12-13T00:07:59.792Z SwingSet: kernel: anachrophobia strikes vat v6
2022-12-13T00:07:59.793Z SwingSet: kernel: expected: {"0":"vatstoreGet","1":"vc.12.|schemata","length":2}
2022-12-13T00:07:59.793Z SwingSet: kernel: got     : {"0":"vatstoreSet","1":"vom.dkind.11","2":"{\"kindID\":\"11\",\"tag\":\"IST payment\",\"nextInstanceID\":5909,\"unfaceted\":true}","length":3}
2022-12-13T00:07:59.794Z SwingSet: kernel:   mitigation: ignoring extra vc syscall
2022-12-13T00:07:59.794Z SwingSet: kernel: anachrophobia strikes vat v6
2022-12-13T00:07:59.794Z SwingSet: kernel: expected: {"0":"vatstoreGet","1":"vc.12.|label","length":2}
2022-12-13T00:07:59.794Z SwingSet: kernel: got     : {"0":"vatstoreSet","1":"vom.dkind.11","2":"{\"kindID\":\"11\",\"tag\":\"IST payment\",\"nextInstanceID\":5909,\"unfaceted\":true}","length":3}
2022-12-13T00:07:59.794Z SwingSet: kernel:   mitigation: ignoring extra vc syscall
2022-12-13T00:07:59.819Z SwingSet: kernel: anachrophobia strikes vat v6
2022-12-13T00:07:59.819Z SwingSet: kernel: expected: {"0":"vatstoreGet","1":"vc.12.|schemata","length":2}
2022-12-13T00:07:59.819Z SwingSet: kernel: got     : {"0":"vatstoreGet","1":"vc.7.|o+11/5908","length":2}
2022-12-13T00:07:59.819Z SwingSet: kernel:   mitigation: ignoring extra vc syscall
2022-12-13T00:07:59.819Z SwingSet: kernel: anachrophobia strikes vat v6
2022-12-13T00:07:59.819Z SwingSet: kernel: expected: {"0":"vatstoreGet","1":"vc.12.|label","length":2}
2022-12-13T00:07:59.819Z SwingSet: kernel: got     : {"0":"vatstoreGet","1":"vc.7.|o+11/5908","length":2}
2022-12-13T00:07:59.820Z SwingSet: kernel:   mitigation: ignoring extra vc syscall
2022-12-13T00:08:00.382Z SwingSet: kernel: anachrophobia strikes vat v6
2022-12-13T00:08:00.382Z SwingSet: kernel: expected: {"0":"vatstoreGet","1":"vc.29.|schemata","length":2}
2022-12-13T00:08:00.382Z SwingSet: kernel: got     : {"0":"vatstoreGet","1":"vc.7.|o+11/5915","length":2}
2022-12-13T00:08:00.382Z SwingSet: kernel:   mitigation: ignoring extra vc syscall
2022-12-13T00:08:00.382Z SwingSet: kernel: anachrophobia strikes vat v6
2022-12-13T00:08:00.383Z SwingSet: kernel: expected: {"0":"vatstoreGet","1":"vc.29.|label","length":2}
2022-12-13T00:08:00.383Z SwingSet: kernel: got     : {"0":"vatstoreGet","1":"vc.7.|o+11/5915","length":2}
2022-12-13T00:08:00.383Z SwingSet: kernel:   mitigation: ignoring extra vc syscall
2022-12-13T00:08:00.460Z SwingSet: kernel: anachrophobia strikes vat v6
2022-12-13T00:08:00.460Z SwingSet: kernel: expected: {"0":"vatstoreSet","1":"vom.dkind.11","2":"{\"kindID\":\"11\",\"tag\":\"IST payment\",\"nextInstanceID\":5918,\"unfaceted\":true}","length":3}
2022-12-13T00:08:00.460Z SwingSet: kernel: got     : {"0":"vatstoreGet","1":"vc.12.|schemata","length":2}
2022-12-13T00:08:00.460Z SwingSet: kernel:   mitigation: falling through to syscall handler
2022-12-13T00:08:00.462Z SwingSet: kernel: anachrophobia strikes vat v6
2022-12-13T00:08:00.462Z SwingSet: kernel: expected: {"0":"vatstoreSet","1":"vom.dkind.11","2":"{\"kindID\":\"11\",\"tag\":\"IST payment\",\"nextInstanceID\":5918,\"unfaceted\":true}","length":3}
2022-12-13T00:08:00.462Z SwingSet: kernel: got     : {"0":"vatstoreGet","1":"vc.12.|label","length":2}
2022-12-13T00:08:00.462Z SwingSet: kernel:   mitigation: falling through to syscall handler
2022-12-13T00:08:00.782Z SwingSet: kernel: anachrophobia strikes vat v6
2022-12-13T00:08:00.783Z SwingSet: kernel: expected: {"0":"vatstoreSet","1":"vom.dkind.11","2":"{\"kindID\":\"11\",\"tag\":\"IST payment\",\"nextInstanceID\":5923,\"unfaceted\":true}","length":3}
2022-12-13T00:08:00.783Z SwingSet: kernel: got     : {"0":"vatstoreGet","1":"vc.393.|schemata","length":2}
2022-12-13T00:08:00.783Z SwingSet: kernel:   mitigation: falling through to syscall handler
2022-12-13T00:08:00.785Z SwingSet: kernel: anachrophobia strikes vat v6
2022-12-13T00:08:00.785Z SwingSet: kernel: expected: {"0":"vatstoreSet","1":"vom.dkind.11","2":"{\"kindID\":\"11\",\"tag\":\"IST payment\",\"nextInstanceID\":5923,\"unfaceted\":true}","length":3}
2022-12-13T00:08:00.785Z SwingSet: kernel: got     : {"0":"vatstoreGet","1":"vc.393.|label","length":2}
2022-12-13T00:08:00.786Z SwingSet: kernel:   mitigation: falling through to syscall handler
2022-12-13T00:08:00.897Z SwingSet: kernel: anachrophobia strikes vat v6
2022-12-13T00:08:00.897Z SwingSet: kernel: expected: {"0":"vatstoreGet","1":"vc.7.|o+11/5924","length":2}
2022-12-13T00:08:00.897Z SwingSet: kernel: got     : {"0":"vatstoreGet","1":"vc.12.|schemata","length":2}
2022-12-13T00:08:00.897Z SwingSet: kernel:   mitigation: falling through to syscall handler
2022-12-13T00:08:00.898Z SwingSet: kernel: anachrophobia strikes vat v6
2022-12-13T00:08:00.898Z SwingSet: kernel: expected: {"0":"vatstoreGet","1":"vc.7.|o+11/5924","length":2}
2022-12-13T00:08:00.898Z SwingSet: kernel: got     : {"0":"vatstoreGet","1":"vc.12.|label","length":2}
2022-12-13T00:08:00.898Z SwingSet: kernel:   mitigation: falling through to syscall handler
2022-12-13T00:08:02.148Z SwingSet: kernel: anachrophobia strikes vat v6
2022-12-13T00:08:02.148Z SwingSet: kernel: expected: {"0":"vatstoreGet","1":"vc.7.|o+11/5940","length":2}
2022-12-13T00:08:02.148Z SwingSet: kernel: got     : {"0":"vatstoreGet","1":"vc.12.|schemata","length":2}
2022-12-13T00:08:02.148Z SwingSet: kernel:   mitigation: falling through to syscall handler
2022-12-13T00:08:02.149Z SwingSet: kernel: anachrophobia strikes vat v6
2022-12-13T00:08:02.149Z SwingSet: kernel: expected: {"0":"vatstoreGet","1":"vc.7.|o+11/5940","length":2}
2022-12-13T00:08:02.149Z SwingSet: kernel: got     : {"0":"vatstoreGet","1":"vc.12.|label","length":2}
2022-12-13T00:08:02.149Z SwingSet: kernel:   mitigation: falling through to syscall handler
2022-12-13T00:08:02.522Z SwingSet: kernel: anachrophobia strikes vat v6
2022-12-13T00:08:02.523Z SwingSet: kernel: expected: {"0":"vatstoreGet","1":"vc.12.|schemata","length":2}
2022-12-13T00:08:02.523Z SwingSet: kernel: got     : {"0":"vatstoreGet","1":"vc.7.|o+11/5946","length":2}
2022-12-13T00:08:02.523Z SwingSet: kernel:   mitigation: ignoring extra vc syscall
2022-12-13T00:08:02.523Z SwingSet: kernel: anachrophobia strikes vat v6
2022-12-13T00:08:02.523Z SwingSet: kernel: expected: {"0":"vatstoreGet","1":"vc.12.|label","length":2}
2022-12-13T00:08:02.523Z SwingSet: kernel: got     : {"0":"vatstoreGet","1":"vc.7.|o+11/5946","length":2}
2022-12-13T00:08:02.523Z SwingSet: kernel:   mitigation: ignoring extra vc syscall

Confirmed working on our own archival node.

@mhofman mhofman requested review from warner and FUDCo December 13, 2022 00:17
@mhofman mhofman changed the base branch from master to release-pismo December 13, 2022 00:35
@mhofman mhofman force-pushed the mhofman/6588-anachrophobia-hotfix branch from 087bd09 to 79cc13c Compare December 13, 2022 00:56
Copy link
Contributor

@FUDCo FUDCo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is fine (well, "fine"), but this kind of hackery should be documented for what it is.

Comment on lines 32 to 88
export function requireIdenticalExceptStableVCSyscalls(
vatID,
originalSyscall,
newSyscall,
) {
const error = requireIdentical(vatID, originalSyscall, newSyscall);

if (error) {
if (
originalSyscall[0] === 'vatstoreGet' &&
vcSyscallRE.test(originalSyscall[1])
) {
console.warn(` mitigation: ignoring extra vc syscall`);
return extraSyscall;
}

if (newSyscall[0] === 'vatstoreGet' && vcSyscallRE.test(newSyscall[1])) {
console.warn(` mitigation: falling through to syscall handler`);
return missingSyscall;
}
}

return error;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a detailed comment explaining the whys and whats of this.

@mhofman mhofman requested a review from FUDCo December 13, 2022 12:16
Copy link
Contributor

@FUDCo FUDCo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@warner warner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will do the kind of hack we want out of it.

Let's make the commit comment clear that we're talking about a hack to workaround a bug in the deployed XS, and that we don't want this change to land on trunk (we want to continue to be sensitive to XS bugs on trunk, so we can find and fix them before the bulldozer upgrade). The existing PR title (which I think Github will copy into the headline of the merge commit) is insufficiently alarming. It should say something about a bug workaround.

And the comment body should mention that this should not be applied to trunk. We have a general policy of merging branches back into trunk (to record the fact that we've incorporated all such changes into trunk: i.e. we aren't regressing in some way). If we continue that policy, then when we merge this particular part, we should use --strategy=ours or equivalent, to get a git commit parent relationship, but not actually incorporate this code into trunk.

// we reach the end of the transcript, we would error instead of
// falling through and performing the syscall. However liveslots does not
// perform vc metadata get syscalls unless it needs to provide an entry
// to the program, which always results in subsequent syscalls.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that's strictly true, but the chances are low, and I can't think of a clean way to deal with it, so I'm ok with this approach. The case would be something like delivery 1 doing:

  baggage.set('key', makeBigScalarMapStore(..))

That creates the new collection, creates its initial Representative, uses that to look up the vref and stash it in baggage, then drops the Representative.

Then delivery 2 does:

  const p = E(presence).foo();
  const c = baggage.get('key');
  // let c fall out of scope without actually using it
  return p;

Delivery 2 will send foo, then do the vref lookup in baggage, determine that it needs a Representative for the collection, either finds it in slotToVal or doesn't, if it doesn't then it resurrects one (which causes the two vatstoreGet syscalls). But, if it really doesn't do anything else with the collection, and it doesn't do anything else syscall-worthy (i.e p doesn't resolve until later), then it's not going to do any other syscalls in that delivery.

If the original execution still had the slotToVal entry, the recorded transcript will contain the foo send and nothing else. If the replay finds that it doesn't have the slotToVal entry, the replay will do foo followed by the two vatstoreGets. When the first vatstoreGet arrives at simulateSyscall, it will see playbackSyscalls.length === 0, and it will bypass the whole while loop and go directly to the final throw replayError.

I don't think that's going to be a common pattern (and I guess we could scan the existing transcripts for it, and convince ourselves that pismoA doesn't ever create it). But if you wanted to fix it, you might add an extra check after the while loop but before the if (!replayError), so if we get past the while loop, but the syscall is the vatstoreGet kind, then it returns undefined instead of throwing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do that, it'd be worth refactoring the "is this the right kind of syscall" check into a single function, called twice from requireIdenticalExceptStableVCSyscalls and once from here.

... oh, except, right, that's not particularly easy to configure/activate as just replacing the compareSyscalls function. Hm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that edge case is not worth fixing if we're not triggering it. Worst case, it gets triggered, some validators hit a new anachrophobia, and we provide a new patch that does the more complicated check.

if (!compareError) {
return s.response;
} else if (compareError !== extraSyscall) {
replayError = compareError;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've walked through the cases and convinced myself this will do the right thing, but it'd be lovely to have a short unit test that demonstrates the different permutations this is prepared to handle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I've thought about it, but given it's not something I plan on maintaining long term, I didn't. Happy to think about what some unit tests would look like if you feel strongly about them here.

@@ -1,5 +1,21 @@
import djson from '../../lib/djson.js';

// Indicate that a syscall is missing from the transcript but is safe to
// perform during replay
const missingSyscall = Symbol('missing transcript syscall');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason to use a Symbol rather than just a string? Or an empty object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strings are more forgeable and I have a habit of using unique symbol for sentinel values.

DO NOT MERGE INTO MAIN BRANCH
Workaround for #6588

During transcript replay, handle divergent syscalls which retrieve
stable Virtual Collection metadata. These can happen at different times
than recorded during the transcript because of some bugs in the XS
engine making GC sensitive to reload from heap snapshots.
@mhofman mhofman force-pushed the mhofman/6588-anachrophobia-hotfix branch from 5d3f248 to b2b336f Compare December 13, 2022 23:42
@mhofman mhofman changed the title feat(swingset): handle vc syscalls during transcript replay fix(swingset): workaround XS garbage collection bugs Dec 13, 2022
@mhofman mhofman added the force:integration Force integration tests to run on PR label Dec 13, 2022
@mhofman
Copy link
Member Author

mhofman commented Dec 13, 2022

Once all tests have run again, I will manually "merge" using GH to make sure the commit message is kept (I believe all mergify options will either create a merge commit or drop the commit messages)

@mhofman mhofman merged commit 12e97f1 into release-pismo Dec 14, 2022
@mhofman mhofman deleted the mhofman/6588-anachrophobia-hotfix branch December 14, 2022 01:44
mhofman added a commit that referenced this pull request Jan 25, 2023
Workaround for #6588

During transcript replay, handle divergent syscalls which retrieve
stable Virtual Collection metadata. These can happen at different times
than recorded during the transcript because of some bugs in the XS
engine making GC sensitive to reload from heap snapshots.

cherry picked from commit 12e97f1
Merging into master after further consideration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
force:integration Force integration tests to run on PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants