Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop ignoring some connection errors that could make the peer set hang #3200

Merged
merged 18 commits into from
Dec 15, 2021

Conversation

teor2345
Copy link
Contributor

@teor2345 teor2345 commented Dec 13, 2021

Motivation

When I was testing PR #3167, the peer set would hang during the initial block download. This is a bug is in the main branch, but it's triggered more by some PRs (and on some machines).

So I fixed some known peer set bugs, and wrote some basic peer set tests.

Solution

Peer Set service disconnection:

Heartbeat task errors:

  • Exit the client task if the heartbeat task exits
  • Return a Client error if the error slot has an error
  • Close all peer senders and stop the heartbeat task, when the Client errors or is dropped
  • Added tests for these changes, and other Client errors

Client errors:

Connection errors:

  • Run the same shutdown code regardless of how the Connection was errored, closed, or dropped
  • Always close the request channel before draining it
  • Added tests for some of these changes (and opened tickets for more tests)

Connection state:

  • Drop the request timer on error
  • Didn't add tests: trivial changes

Review

Anyone can review this PR, but @jvff knows about ticket #3130.

This PR is based on PR #3203.

Reviewer Checklist

  • Code implements Specs and Designs
  • Tests for Expected Behaviour
  • Tests for Errors

Follow Up Work

Connection:

Peer set:

Testing:

Sorry, something went wrong.

@teor2345 teor2345 added C-bug Category: This is a bug P-Medium I-hang A Zebra component stops responding to requests I-slow Problems with performance or responsiveness I-usability Zebra is hard to understand or use A-network Area: Network protocol updates or fixes A-diagnostics Area: Diagnosing issues or monitoring performance labels Dec 13, 2021
@teor2345 teor2345 added this to the 2021 Sprint 24 milestone Dec 13, 2021
@teor2345 teor2345 requested a review from jvff December 13, 2021 02:31
@teor2345 teor2345 self-assigned this Dec 13, 2021
jvff
jvff previously approved these changes Dec 13, 2021
Copy link
Contributor

@jvff jvff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I've only suggested two ideas, but they are optional.

@mpguerra mpguerra removed this from the 2021 Sprint 24 milestone Dec 13, 2021
@teor2345 teor2345 marked this pull request as draft December 13, 2021 22:51
@teor2345

This comment has been minimized.

@teor2345 teor2345 mentioned this pull request Dec 14, 2021
2 tasks
@teor2345 teor2345 changed the title Fix and diagnose peer set hangs Stop ignoring some connection errors that could make the peer set hang Dec 14, 2021
@teor2345 teor2345 removed A-diagnostics Area: Diagnosing issues or monitoring performance I-usability Zebra is hard to understand or use labels Dec 14, 2021
@teor2345 teor2345 changed the base branch from main to connection-diagnostics December 14, 2021 02:20
@teor2345 teor2345 marked this pull request as ready for review December 14, 2021 02:21
@teor2345 teor2345 marked this pull request as draft December 14, 2021 08:40
@teor2345 teor2345 added P-High and removed P-Medium labels Dec 14, 2021
@teor2345 teor2345 mentioned this pull request Dec 14, 2021
3 tasks
Base automatically changed from connection-diagnostics to main December 14, 2021 21:11
@teor2345 teor2345 added the I-integration-fail Continuous integration fails, including build and test failures label Dec 15, 2021
@teor2345 teor2345 marked this pull request as ready for review December 15, 2021 06:17
@teor2345 teor2345 requested review from conradoplg and oxarbitrage and removed request for conradoplg December 15, 2021 06:18
@teor2345
Copy link
Contributor Author

There are a lot of code changes here, so could you take a look as well @oxarbitrage ?

Copy link
Collaborator

@conradoplg conradoplg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really familiar with this part of the code but it seems good, I just found a small typo. But I'll defer the approval to @oxarbitrage

conradoplg and others added 2 commits December 15, 2021 11:20
Co-authored-by: Conrado Gouvea <conrado@zfnd.org>
Copy link
Contributor

@oxarbitrage oxarbitrage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks good and has a lot of coverage. Could be separated in 2 or 3 PRs but we want to place this changes in main fast to speed up the sync and i think is ok to do so in this case.

@oxarbitrage oxarbitrage enabled auto-merge (squash) December 15, 2021 14:35
@oxarbitrage oxarbitrage merged commit f176bb5 into main Dec 15, 2021
@oxarbitrage oxarbitrage deleted the peer-set-hangs branch December 15, 2021 14:52
@teor2345 teor2345 mentioned this pull request Dec 21, 2021
26 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-network Area: Network protocol updates or fixes C-bug Category: This is a bug I-hang A Zebra component stops responding to requests I-integration-fail Continuous integration fails, including build and test failures I-slow Problems with performance or responsiveness
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UnreadyService ignores dropped cancel handles Stop panicking if a connection has multiple failures
5 participants