Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more backpressure from full validation to syncing #4715

Closed
teor2345 opened this issue Jun 28, 2022 · 5 comments · Fixed by #4726
Closed

Add more backpressure from full validation to syncing #4715

teor2345 opened this issue Jun 28, 2022 · 5 comments · Fixed by #4726
Assignees
Labels
A-network Area: Network protocol updates or fixes C-bug Category: This is a bug C-security Category: Security issues I-hang A Zebra component stops responding to requests I-heavy Problems with excessive memory, disk, or CPU usage I-slow Problems with performance or responsiveness

Comments

@teor2345
Copy link
Contributor

teor2345 commented Jun 28, 2022

Motivation

When Zebra's full validation slows down:

  • it uses a lot of RAM and CPU verifying blocks
  • any extra verifications are slower because RAM or CPU is fully used
  • the syncer keeps downloading and queueing large numbers of blocks for verification, taking more RAM
  • when blocks timeout, any uncommitted blocks get cancelled, and they have to be downloaded and verified again

We've partially fixed this issue by decreasing the syncer lookahead in PRs #4662, #4670, #4679.
We also improved halo2 verification speed in PR #4699.

Designs

When we impose backpressure on the syncer, we need to verify the lowest blocks first.
This means that the backpressure has to be implemented in the syncer task.
(After that, the individual block download and verify tasks can run out of order.)

Here is one possible fix:

  • when doing checkpoint verification, let the syncer look ahead 400-1200 blocks
  • when doing full verification, let the syncer look ahead 1-max_concurrent_block_requests blocks

400 is the minimum for checkpointing, because the maximum checkpoint gap is 400 blocks.
The max_concurrent_block_requests default might need to be tuned for good full verification performance.

Here is where the syncer limit is implemented:

while self.downloads.in_flight() > self.lookahead_limit {
trace!(
tips.len = self.prospective_tips.len(),
in_flight = self.downloads.in_flight(),
lookahead_limit = self.lookahead_limit,
state_tip = ?self.latest_chain_tip.best_tip_height(),
"waiting for pending blocks",
);
let response = self.downloads.next().await.expect("downloads is nonempty");
Self::handle_block_response(response)?;

@teor2345 teor2345 added C-bug Category: This is a bug S-needs-triage Status: A bug report needs triage P-Low ❄️ C-security Category: Security issues I-hang A Zebra component stops responding to requests I-heavy Problems with excessive memory, disk, or CPU usage I-slow Problems with performance or responsiveness A-network Area: Network protocol updates or fixes and removed P-Low ❄️ labels Jun 28, 2022
@teor2345
Copy link
Contributor Author

teor2345 commented Jun 29, 2022

The halo2 verification performance improvements in PR #4699, and the updated checkpoints in PR #4708 really helped.

CPU and RAM usage are mostly normal, but they occasionally double. There was one block that took a minute to verify, but most of the time verification happens smoothly.

So this change is optional at the moment.

@teor2345
Copy link
Contributor Author

Actually, I just saw a pretty bad stall.

There were multiple sync failures, with 5 minute gaps between verified blocks, using 10.3 GB RAM and 15 full CPUs.

So I think this might actually be a high priority.

@teor2345 teor2345 self-assigned this Jun 29, 2022
@teor2345
Copy link
Contributor Author

This bug was blocking my other tasks, so I ended up doing a rough fix.

@conradoplg
Copy link
Collaborator

The halo2 verification performance improvements in PR #4699, and the updated checkpoints in PR #4708 really helped.

To be clear AFAIK #4699 doesn't directly speed up halo2 verification, we now need to actually support batched verification

@conradoplg
Copy link
Collaborator

The halo2 verification performance improvements in PR #4699, and the updated checkpoints in PR #4708 really helped.

To be clear AFAIK #4699 doesn't directly speed up halo2 verification, we now need to actually support batched verification

Ah nevermind, there was indeed a ~15% improvement in single verification (though we can still improve it a lot by supporting batching)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-network Area: Network protocol updates or fixes C-bug Category: This is a bug C-security Category: Security issues I-hang A Zebra component stops responding to requests I-heavy Problems with excessive memory, disk, or CPU usage I-slow Problems with performance or responsiveness
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants