Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(swingset-tools): Expand replay tool for anachrophobia diagnosis #6723

Conversation

mhofman
Copy link
Member

@mhofman mhofman commented Dec 27, 2022

refs: #6588

This PR is targeting release-pismo as this is where I've done most of the work, but it should be safe to merge into master, which I plan on doing after approval.

I spent a lot of time cleaning up the changes to make the implementation cleaner and the individual commits relevant. At this point it's almost a rewrite of the replay-transcript.js tool, so reviewing commit-by-commit may be easier but not necessary.

Description

In the past month I've massively expanded the capabilities of the replay tool to support my investigation of the anachrophobia issue experienced by validators on mainnet (#6588).

This PR adds the following features:

  • Extract the bundles from swingstore using the same tool used to extract transcripts.
  • Add a kind of "slog" file for the replay tool to track what it does regarding snapshots in a structured way
  • Switch inline const based config to command line options, including ability to load options from a config file
  • Fix errors not exiting the process
  • Handle virtual collection syscall divergence based on the fix from fix(swingset): workaround XS garbage collection bugs #6664
    • that change was first implemented as part of the replay tool, and then moved to Swingset. The replay tool now uses the Swingset implementation with a couple tweaks
  • Force snapshots on interval to replicate snapshot schedule of Swingset (snapshot activity is currently not recorded in the transcript)
  • Execute multiple workers concurrently: when a snapshot is loaded, keep previous workers around according to configuration, and send deliveries to all workers concurrently.
    • Option to force load snapshots that are taken. Used to make divergence bisection easier
    • Option to load specific snapshots from config at given deliveryNums without requiring the load commands to be included in the transcript (used to test version compatibility)
    • Option to keep workers loaded from snapshots matching specific constraints:
      • first N, last M, or at given intervals
      • at specific deliveryNum
      • if explicitly loaded by config/transcript
      • when different snapshots exist for a given deliveryNum
    • Report (on console output) when snapshots or syscalls diverge between workers

Security Considerations

None, this is a debug tooling.

Documentation Considerations

The new command line options could be better documented (with CLI based documentation), but I didn't feel like dealing with the quirks of the full yargs at this point.

Testing Considerations

Main features were used extensively over the last month. Refactors/rebase to clean up was quickly tested again.

Copy link
Member

@warner warner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modulo those two questions, this looks good, just make sure it doesn't break local replay on the non-testing path (if we temporarily break debugging replay for local workers, that's not a big deal)

I'm sure you do, but please make sure you've got a plan in mind for getting this landed on trunk sooner or later.

packages/SwingSet/src/types-external.js Outdated Show resolved Hide resolved
})
);

const snapStore = argv.useCustomSnapStore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep an eye on #6755, it might change the snapStore API.

Copy link
Member Author

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just make sure it doesn't break local replay on the non-testing path (if we temporarily break debugging replay for local workers, that's not a big deal)

I know xs-worker definitely works for normal execution as I've been running multiple following nodes including this change and the ones in #6724 on mainnet. I don't believe we have any local manager type in production use. Also I don't think this PR changes anything in the production paths besides types, adding a couple exports, and providing an extra argument to compareSyscalls.

packages/SwingSet/src/types-external.js Outdated Show resolved Hide resolved
@mhofman
Copy link
Member Author

mhofman commented Jan 12, 2023

I'm sure you do, but please make sure you've got a plan in mind for getting this landed on trunk sooner or later.

I definitely want to merge this in trunk as well once this is approved, but we still need to be ok with the mitigation in #6664 landing as well. It's possible to merge the tooling changes without the replay behavioral change, but we'd lose a lot of the functionality.

@mhofman mhofman force-pushed the mhofman/6588-expand-replay-tool-for-anachrophobia-diagnosis branch from c112b39 to cafe0e8 Compare January 12, 2023 15:17
@mhofman mhofman force-pushed the mhofman/6588-expand-replay-tool-for-anachrophobia-diagnosis branch from a163f96 to c068619 Compare January 12, 2023 17:00
@mhofman mhofman force-pushed the mhofman/6588-expand-replay-tool-for-anachrophobia-diagnosis branch from c068619 to 9a9ea0b Compare January 12, 2023 17:03
@mhofman mhofman added the automerge:rebase Automatically rebase updates, then merge label Jan 12, 2023
@mergify mergify bot merged commit a756bb6 into release-pismo Jan 12, 2023
@mergify mergify bot deleted the mhofman/6588-expand-replay-tool-for-anachrophobia-diagnosis branch January 12, 2023 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge:rebase Automatically rebase updates, then merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants