Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a deadlock between the crawler and dialer, and other hangs #1950

Merged
merged 7 commits into from
Apr 7, 2021

Conversation

teor2345
Copy link
Contributor

@teor2345 teor2345 commented Mar 26, 2021

Motivation

In #1905, Zebra gradually loses peer connections, until there are none left. Then it stops making sync progress.

When this issue happens, a large number of peers are stuck in the attempt_pending state, despite the handshake timeout.

This issue also impacted the peer::Connection refactor in #1817, and a bunch of upcoming zebra-network security fixes.

Solution

  • Stop ignoring inbound message errors
  • Add timeouts during handshake setup
  • Add a timeout to the candidate set update
  • Use the select! macro in the crawler to avoid starvation
  • Spawn a separate task for each handshake
  • Increase the crawl fanout to improve address diversity
  • Avoid starvation when using the select function
  • Fix priority of multiple ready futures in the select function
  • Add correctness documentation

The code in this pull request has:

  • Documentation Comments
  • Existing Unit Tests and Integration Tests

Review

@dconnolly or @oxarbitrage can review - whoever is available.

I'd like to get this merged in the next few days. It doesn't directly conflict with the other security fixes, but it will make them harder to test.

Related Issues

Closes #1905 (based on 3 mainnet and 3 testnet nodes x 3 days of testing).
Closes #1941 (based on 3 testnet nodes x 3 days of testing).
Possibly closes some other related hang or reliability issues.

Deals with part of the #1892 and #1657 refactors.

@teor2345 teor2345 added C-bug Category: This is a bug A-rust Area: Updates to Rust code C-enhancement Category: This is an improvement P-Medium C-security Category: Security issues I-hang A Zebra component stops responding to requests I-heavy Problems with excessive memory, disk, or CPU usage I-slow Problems with performance or responsiveness I-integration-fail Continuous integration fails, including build and test failures I-usability Zebra is hard to understand or use labels Mar 26, 2021
@teor2345 teor2345 added this to the 2021 Sprint 6 milestone Mar 26, 2021
@teor2345 teor2345 self-assigned this Mar 26, 2021
@teor2345
Copy link
Contributor Author

teor2345 commented Mar 26, 2021

I'm going to put this PR in draft until I've tested a few full syncs over the weekend.

Feel free to review it or run it yourselves - I don't expect to be changing it much.

@teor2345 teor2345 marked this pull request as draft March 26, 2021 11:09
@teor2345
Copy link
Contributor Author

Coverage is failing due to type errors in a dependency, maybe the most recent nightly has a bug?

I'll open a PR on Monday if no-one else gets to it first.

@teor2345
Copy link
Contributor Author

I added a small change that avoids starvation when using the future::select function, and prioritises multiple ready futures correctly.

@teor2345 teor2345 marked this pull request as ready for review March 28, 2021 22:44
@teor2345
Copy link
Contributor Author

@dconnolly or @oxarbitrage this fix is now ready for review.

Let me know if you want to do a walkthrough.

(I tried to split this PR, but unfortunately these changes all depend on each other and modify similar files.)

oxarbitrage
oxarbitrage previously approved these changes Apr 6, 2021
Copy link
Contributor

@oxarbitrage oxarbitrage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks great, i made a few minor suggestions.

The `select` function is biased towards its first argument, risking
starvation.

As a side-benefit, this change also makes the code a lot easier to read
and maintain.
This refactor makes the code a bit easier to read, at the cost of
sometimes blocking the crawler on `candidates.next()`.

That's ok, because `next` only has a short (< 100 ms) delay. And we're
just about to spawn a separate task for each handshake.
This change avoids deadlocks by letting each handshake make progress
independently.
This refactor improves readability.
And document the correctness of the new code.
@teor2345
Copy link
Contributor Author

teor2345 commented Apr 7, 2021

Squashed fixup commits to make the review easier.
Rebased to fix build errors.

@oxarbitrage oxarbitrage merged commit 375c8d8 into ZcashFoundation:main Apr 7, 2021
Comment on lines 343 to 345
a = handshakes.next() => a.expect("handshakes never terminates, because it contains a future that never resolves"),
a = crawl_timer.next() => a.expect("crawl_timer never terminates"),
a = demand_rx.next() => a.expect("demand_rx never fails, because we hold demand_tx"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename a to represent what is actually being returned

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-rust Area: Updates to Rust code C-bug Category: This is a bug C-enhancement Category: This is an improvement C-security Category: Security issues I-hang A Zebra component stops responding to requests I-heavy Problems with excessive memory, disk, or CPU usage I-integration-fail Continuous integration fails, including build and test failures I-slow Problems with performance or responsiveness I-usability Zebra is hard to understand or use
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Zebra lags behind the best tip on testnet Zebra gradually loses peer connections, until it has none left
2 participants