Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(net): Fix potential network hangs, and reduce code complexity #7859

Merged
merged 27 commits into from
Nov 16, 2023

Conversation

teor2345
Copy link
Contributor

@teor2345 teor2345 commented Oct 27, 2023

Motivation

Zebra's peer handling code is complex, and it contains some potential hangs. This PR fixes some hangs, and simplifies the code.

Follow up to #7772
Close #7858

Specifications

Rust futures wake a task when any waker cloned from the most recently passed Context is woken:

When a future is not ready yet, poll returns Poll::Pending and stores a clone of the Waker copied from the current Context. This Waker is then woken once the future can make progress.
...
Note that on multiple calls to poll, only the Waker from the Context passed to the most recent call should be scheduled to receive a wakeup.

https://doc.rust-lang.org/std/future/trait.Future.html#tymethod.poll

Complex Code or Requirements

This PR contains a lot of refactors to make future changes easier.
It also eliminates some wakers entirely to simplify the code and waking logic.

Solution

Bug fixes:

  • Pass context correctly to the peer set background task oneshot. (We're not waking on it now, but we might want to in future.)
  • Pass context correctly to the peer client task, and the peer client request sender, so they always wake the client task. (The client task is the task that is spawned by the peer set buffer.)
  • Check all ready peers for errors before sending requests to them.

Code simplification:

  • Remove preselected peers, just select the peer when we have a new request. This makes waking much less complex.
  • Use futures::ready! (the macro) and ? to simplify polling

Refactors:

  • Refactor polling methods to return Poll, and Ok if they do something
  • Make panic checking method names clearer (this can be split into its own PR)

Testing

The existing tests cover this behaviour fairly well. It is hard to check for hangs in tests.

Some tests have been updated for the new polling behaviour.

Review

This might still need some test fixes.

Reviewer Checklist

  • Will the PR name make sense to users?
    • Does it need extra CHANGELOG info? (new features, breaking changes, large changes)
  • Are the PR labels correct?
  • Does the code do what the ticket and PR says?
    • Does it change concurrent code, unsafe code, or consensus rules?
  • How do you know it works? Does it have tests?

Follow Up Work

@teor2345 teor2345 added C-bug Category: This is a bug P-Medium ⚡ I-hang A Zebra component stops responding to requests A-network Area: Network protocol updates or fixes I-remote-trigger Remote nodes can make Zebra do something bad labels Oct 27, 2023
@teor2345 teor2345 self-assigned this Oct 27, 2023
@teor2345 teor2345 changed the title fix(network): Wake network tasks that are waiting on an empty peerset to get more peers fix(network): When there are no peer connections, and new peers arrive, wake tasks that are waiting for new peers Oct 27, 2023
@arya2 arya2 requested review from arya2, oxarbitrage and upbqdn and removed request for upbqdn and oxarbitrage October 27, 2023 18:03
@teor2345 teor2345 added the do-not-merge Tells Mergify not to merge this PR label Oct 30, 2023
@teor2345 teor2345 changed the title fix(network): When there are no peer connections, and new peers arrive, wake tasks that are waiting for new peers fix(network): Fix potential network hangs, and reduce code complexity Nov 6, 2023
@teor2345 teor2345 force-pushed the peerset-poll branch 2 times, most recently from 2a92792 to 21b5c39 Compare November 14, 2023 02:42
@teor2345
Copy link
Contributor Author

This should be ready to go now, there was a minor issue with the Update future for the inventory registry, but it seemed like a good idea to make all the poll methods consistent.

@teor2345
Copy link
Contributor Author

Does this need a full sync before it merges, or can we wait until Friday's scheduled full sync?

If we do one now, the previous time was 2d 3h 15m:
https://github.com/ZcashFoundation/zebra/actions/runs/6824538443/job/18561008809

arya2
arya2 previously approved these changes Nov 16, 2023
Copy link
Contributor

@arya2 arya2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a full sync before it merges, or can we wait until Friday's scheduled full sync?

I think we can wait until the next scheduled full sync.

zebra-network/src/peer_set/set.rs Outdated Show resolved Hide resolved
zebra-network/src/peer_set/set.rs Show resolved Hide resolved
zebra-network/src/peer_set/set.rs Show resolved Hide resolved
Co-authored-by: Arya <aryasolhi@gmail.com>
@mergify mergify bot merged commit d689e73 into main Nov 16, 2023
104 checks passed
@mergify mergify bot deleted the peerset-poll branch November 16, 2023 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-network Area: Network protocol updates or fixes C-bug Category: This is a bug I-hang A Zebra component stops responding to requests I-remote-trigger Remote nodes can make Zebra do something bad
Projects
None yet
Development

Successfully merging this pull request may close these issues.

security: avoid hangs in the peer set and related code
2 participants