-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a deadlock between the crawler and dialer, and other hangs #1950
Conversation
I'm going to put this PR in draft until I've tested a few full syncs over the weekend. Feel free to review it or run it yourselves - I don't expect to be changing it much. |
Coverage is failing due to type errors in a dependency, maybe the most recent nightly has a bug? I'll open a PR on Monday if no-one else gets to it first. |
I added a small change that avoids starvation when using the |
@dconnolly or @oxarbitrage this fix is now ready for review. Let me know if you want to do a walkthrough. (I tried to split this PR, but unfortunately these changes all depend on each other and modify similar files.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks great, i made a few minor suggestions.
The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain.
This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake.
This change avoids deadlocks by letting each handshake make progress independently.
This refactor improves readability.
And document the correctness of the new code.
b6c0a43
to
b51070e
Compare
Squashed fixup commits to make the review easier. |
a = handshakes.next() => a.expect("handshakes never terminates, because it contains a future that never resolves"), | ||
a = crawl_timer.next() => a.expect("crawl_timer never terminates"), | ||
a = demand_rx.next() => a.expect("demand_rx never fails, because we hold demand_tx"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename a
to represent what is actually being returned
Motivation
In #1905, Zebra gradually loses peer connections, until there are none left. Then it stops making sync progress.
When this issue happens, a large number of peers are stuck in the
attempt_pending
state, despite the handshake timeout.This issue also impacted the
peer::Connection
refactor in #1817, and a bunch of upcoming zebra-network security fixes.Solution
select!
macro in the crawler to avoid starvationselect
functionselect
functionThe code in this pull request has:
Review
@dconnolly or @oxarbitrage can review - whoever is available.
I'd like to get this merged in the next few days. It doesn't directly conflict with the other security fixes, but it will make them harder to test.
Related Issues
Closes #1905 (based on 3 mainnet and 3 testnet nodes x 3 days of testing).
Closes #1941 (based on 3 testnet nodes x 3 days of testing).
Possibly closes some other related hang or reliability issues.
Deals with part of the #1892 and #1657 refactors.