Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix candidate set address state handling #1709

Merged
merged 2 commits into from
Feb 18, 2021

Conversation

teor2345
Copy link
Contributor

@teor2345 teor2345 commented Feb 8, 2021

Motivation

The CandidateSet address books: disconnected, gossiped, and failed, are not disjoint. But the module documentation says they should be.

The overall peer state handling is also a bit broken and inconsistent - it looks like peers can get stuck in particular states (perhaps related to #1633 or #1435).

Solution

Design:

  • Redesign AddressBook by deleting duplicate indexes, rather than maintaining complex runtime invariants
  • Add a PeerConnectionState to each MetaAddr
  • Use a single peer set for all peers, regardless of state
  • Implement time-based liveness as an AddressBook method, rather than a PeerConnectionState variant

Implementation:

  • Simplify the AddressBook iterator implementation, replacing it with methods that are more obviously correct
  • Simplify AddressBook changes using update and take modifier methods
  • Consistently collect peer metrics

Documentation:

  • Expand and update the peer set documentation

We can optimise later, but for now we want simple code that is more obviously correct.

The code in this pull request has:

  • Documentation Comments
  • Integration Tests

Review

I don't know if @yaahc or @dconnolly is more familiar with this code, but it's a big enough change that everyone should probably take a look.

This rewrite might solve a whole bunch of our network problems, so it's a high priority. But the review is not urgent.

Related Issues

Closes #1707.
Closes #1435, according to our testing - Zebra hasn't hung over several weeks of cumulative node runtime.

Follow Up Work

Fix the Client state handling bug (#1599)

Performance/Security:

  • limit the total address book size to maintain performance, discarding the oldest addresses once the limit is reached
    • 2x the peer set target size should be more than sufficient for our needs - this is close to the current number of active mainnet peers, so the code actually gets exercised on a regular basis

Security:

  • handle future peer-provided last_seen times by subtracting the local/remote time delta from all the times in that peer's response
    • use relative skew to avoid assuming anything about the correctness of the local or remote clocks (the security analysis and testing done absolute times can get quite complex)
    • check for anywhere else in the protocol that uses peer-provided times or other untrusted/unverified data
  • filter out known addresses, then limit the number of new addresses used from each request to target peer size/3
    • with a minimum of 3, so we almost always get addresses from multiple nodes, because some nodes are dual stack
    • this avoids having a single peer provide our entire set of peer connections (or our entire address book)

Network health:

  • in response to peer requests, only provide peers we've actually connected to, to avoid gossiping unverified peer addresses, last_seen times, or services
    • decide if we want to send live or Responded peers
    • think about what happens for IPv4-only or IPv6-only peers
  • don't gossip peers that are older than a day or so - some NAT / dynamic DNS networks change addresses every day, or even multiple times a day (laptops/mobiles)

Correctness:

  • reject updates that move existing peers into the AssumedDead and NeverAttempted states. (Updates can only be AttemptPending, AssumedAlive, or Failed.)
  • ensure that all inserts of new peers are in the NeverAttempted state.

Refactors:

  • Implement time-based liveness as an AssumedDead state, via update and take inner type methods
    • make the MetaAddr state into a template parameter, which defaults to the outer state enum
  • make future peer-provided last_seen times into a parse error, if possible, so API users have to handle them

Features:

  • Open a follow-up ticket to set up a persistent peer address book cache using RocksDB?

@teor2345 teor2345 added C-bug Category: This is a bug A-docs Area: Documentation C-design Category: Software design work A-rust Area: Updates to Rust code C-cleanup Category: This is a cleanup P-High labels Feb 8, 2021
@teor2345 teor2345 added this to the 2021 Sprint 3 milestone Feb 8, 2021
@teor2345 teor2345 self-assigned this Feb 8, 2021
@teor2345 teor2345 changed the title Candidate set disjoint Fix candidate set peer reconnection state handling Feb 8, 2021
@teor2345

This comment has been minimized.

@mpguerra

This comment has been minimized.

@mpguerra mpguerra modified the milestones: 2021 Sprint 3, 2021 Sprint 2 Feb 8, 2021
zebra-network/src/address_book.rs Outdated Show resolved Hide resolved
zebra-network/src/address_book.rs Outdated Show resolved Hide resolved
@teor2345 teor2345 modified the milestones: 2021 Sprint 2, 2021 Sprint 3 Feb 9, 2021
@teor2345 teor2345 force-pushed the candidate-set-disjoint branch 3 times, most recently from 76fe01d to 40ae4e5 Compare February 12, 2021 03:37
@teor2345
Copy link
Contributor Author

@dconnolly @yaahc I've made the required design changes from the first review.

I'd like to get this reviewed and merged soon, so we can get it into the next alpha.

There's a bunch of other refactors, performance improvements, and security fixes we could do. But they're not required to fix the hangs. So I think we should open separate tickets for them, and schedule them in future sprints.

@teor2345
Copy link
Contributor Author

teor2345 commented Feb 16, 2021

I've rebased on #1750 to silence a bunch of clippy lints.

@teor2345 teor2345 force-pushed the candidate-set-disjoint branch 2 times, most recently from daaafcb to cebbe3f Compare February 16, 2021 07:49
@teor2345 teor2345 changed the title Fix candidate set peer reconnection state handling Fix candidate set address state handling Feb 16, 2021
@teor2345
Copy link
Contributor Author

The CI failure is #1730.

@teor2345 teor2345 added the I-hang A Zebra component stops responding to requests label Feb 17, 2021
dconnolly
dconnolly previously approved these changes Feb 17, 2021
Design:
- Add a `PeerConnectionState` to each `MetaAddr`
- Use a single peer set for all peers, regardless of state
- Implement time-based liveness as an `AddressBook` method, rather than
  a `PeerConnectionState` variant
- Delete AddressBook.by_state

Implementation:
- Simplify `AddressBook` changes using `update` and `take` modifier
  methods
- Simplify the `AddressBook` iterator implementation, replacing it with
  methods that are more obviously correct
- Consistently collect peer set metrics

Documentation:
- Expand and update the peer set documentation

We can optimise later, but for now we want simple code that is more
obviously correct.
@teor2345
Copy link
Contributor Author

Testnet is unstable, we'll fix that in #1222

@teor2345
Copy link
Contributor Author

Merging because @dconnolly approved and asked for a rebase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-docs Area: Documentation A-rust Area: Updates to Rust code C-bug Category: This is a bug C-cleanup Category: This is a cleanup C-design Category: Software design work I-hang A Zebra component stops responding to requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CandidateSet address books are not disjoint Peer connections hang temporarily or fail permanently
4 participants