Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Justify our alternative to "evicting pre-upgrade peers from the peer set across a network upgrade" #706

Closed
1 of 4 tasks
hdevalence opened this issue Jul 21, 2020 · 7 comments · Fixed by #3181
Closed
1 of 4 tasks
Assignees
Labels
A-rust Area: Updates to Rust code C-enhancement Category: This is an improvement C-security Category: Security issues I-heavy Problems with excessive memory, disk, or CPU usage I-slow Problems with performance or responsiveness S-needs-investigation Status: Needs further investigation

Comments

@hdevalence
Copy link
Contributor

hdevalence commented Jul 21, 2020

TODO

Write tests or documentation for:

Motivation

Originally posted by @hdevalence in #702

#702 added a comment that we should attempt to evict pre-upgrade peers from the peer set and prefer choosing post-upgrade ready peers for queries.

We should be careful to do everything we need to do to handle a network upgrade and it's good to consider doing this (I presume this matches zcashd behavior?). But on the other hand, there are a couple of reasons to prefer not to do this:

  • in general, it's some additional code for us to write;
  • we would have to build additional plumbing to connect information about peer versions to the eviction logic;
  • we would have to change the details of the load balancing mechanism and build plumbing to connect peer versions to the load balancer;

I think that if we do nothing, we'll probably be fine:

  • we can close connections to peers that don't upgrade;
  • we connect to a very large number of peers, so we can afford to close many connections to peers that don't upgrade;
  • if all of the peers we were connected to didn't upgrade, we're in a network full of bad peers anyways, so trying to filter them would just prepone the failure.

Does this analysis seem right, or am I missing some important details?

Related Issues

#1334 disconnecting old peers after network upgrade activation
#337 ZIP-200: Network Upgrade Mechanism

@hdevalence
Copy link
Contributor Author

cc @teor2345

@str4d
Copy link
Contributor

str4d commented Jul 21, 2020

The zcashd behaviour is specified in ZIP 201.

@hdevalence
Copy link
Contributor Author

Thanks for the pointer! It looks like that spec is very specific to the details of the Zcashd network stack, which we don't replicate in Zebra, so I think we still need to figure out a good answer to these questions.

@teor2345
Copy link
Contributor

teor2345 commented Jul 22, 2020

I think there are a few different questions here:

  1. How can Zebra prefer new peers before the activation height?
  2. How can Zebra disconnect old peers with existing connections at the activation height?
  3. Do we need to prefer new peers before the activation height?
  4. Can we guarantee that Zebra nodes always have a large number of peers?

Q 1. Here's a quick design sketch for preferring new peers:

  • for every peer that connects, if it is an older version, reject the connection with some probability P. For example, we could set P to 0.5 just before the activation height (~1.5 days), and 1.0 at the activation height.

We might need to artificially increase the peer set churn, if long-lived connections to peers can last longer than 1.5 days. (We could lower the maximum peer connection time, or randomly disconnect peers.)

Q 2. Here's a quick design sketch for disconnecting old peers:

  • for every peer, record its protocol version
  • at the activation height, disconnect old peers
  • if all of the peers we were connected to didn't upgrade, we're in a network full of bad peers anyways, so trying to filter them would just prepone the failure.

Q 3. If we are in a network with a large proportion of old peers, we want to cleanly separate from them at the activation height. Preferring new peers makes us gradually discover new peers, via the old peers we are connected to. So we gradually increase our chances of being connected to new peers.

Let's assume:

  • We are in a large network (compared with our peer set size), with a large proportion of old peers
  • By chance, we happen to mostly be connected to old peers

Here are a few different scenarios:
A. 1. We struggle to download the activation block, because old peers won't validate it.
A. 2. The network partitions, and mines several different activation blocks, increasing the chance of chain reorgs across the activation height

B. 1. We abruptly disconnect from all the old peers, as soon as we validate the activation block
B. 2. Every Zebra node suddenly goes looking for new peers, leading to a spike in network load

C. 1. (same as B)
C. 2. There is a chain reorganisation across the network upgrade height, so we suddenly accept old peers again, leading to thrashing

D. 1. We disconnect from old peers, isolating ourselves in a small network of new peers.
D. 2. We are vulnerable to an eclipse attack:

  • we connect to a very large number of peers, so we can afford to close many connections to peers that don't upgrade;

Q 4. I don't think we can guarantee that Zebra will always have a large number of peers.

Some users will want a small number of peers. For example, connecting to a single trusted node can be useful for:

  • wallet support,
  • network-level anonymity, or
  • sub-instances of a miner or exchange.

At the moment, we don't enforce having a large number of peers, users can set the peer limits as low as they like. (We don't seem to support a very low number of peers right now, but that's a separate issue, see #704.)

We could:

  • warn if the peer limit is low, or
  • warn if the actual number of peers is low.

But people don't read warnings, so we could:

  • shut down if the number of peers is very low, or
  • require users to configure a trusted peer mode to disable warnings or shutdowns.

I think this complicates the design, but for many users, a node shutdown is better than being attacked.

But I think it's easier for us to make sure Zebra has good peers.

@dconnolly dconnolly added A-rust Area: Updates to Rust code C-design Category: Software design work S-needs-design Status: Needs a design decision labels Aug 27, 2020
@teor2345 teor2345 added this to the Sync and Network milestone Sep 29, 2020
@hdevalence
Copy link
Contributor Author

Q 4. I don't think we can guarantee that Zebra will always have a large number of peers.

Some users will want a small number of peers. For example, connecting to a single trusted node can be useful for:

* wallet support,

* network-level anonymity, or

* sub-instances of a miner or exchange.

Supporting arbitrarily few peers was a non-goal of the architecture of the peer set (and the network stack generally), which aimed to treat peer connections as "cattle, not pets" (regrettable phrasing notwithstanding). So although some users may want a small number of peers, we should not seek to support that use-case.

@teor2345
Copy link
Contributor

teor2345 commented Mar 8, 2021

Q 4. I don't think we can guarantee that Zebra will always have a large number of peers.
Some users will want a small number of peers. For example, connecting to a single trusted node can be useful for:

* wallet support,

* network-level anonymity, or

* sub-instances of a miner or exchange.

Supporting arbitrarily few peers was a non-goal of the architecture of the peer set (and the network stack generally), which aimed to treat peer connections as "cattle, not pets" (regrettable phrasing notwithstanding). So although some users may want a small number of peers, we should not seek to support that use-case.

Security arguments should be based on threat models, not supported use-cases.

"A Zebra node has a small number of peers" is something that can happen due to network conditions or a deliberate attack. (Even if we don't support configurations that deliberately seek to reduce the number of peers, they should be as secure as reasonably possible.)

@teor2345 teor2345 added C-bug Category: This is a bug C-security Category: Security issues I-heavy Problems with excessive memory, disk, or CPU usage I-slow Problems with performance or responsiveness I-unbounded-growth Zebra keeps using resources, without any limit labels Mar 8, 2021
@teor2345
Copy link
Contributor

teor2345 commented Mar 8, 2021

I don't think we need to make any code changes for this ticket, as long as the changes in other tickets satisfy the following conditions:

  • network convergence: Zebra knows the addresses of a large number of upgraded peers
    • Zebra doesn't need to be connected to these addresses at the network upgrade, and they don't need to be a large proportion of its address book - they just need to be in the address book
  • starvation: Zebra attempts other addresses before re-trying rejected or disconnected addresses
  • connection flooding / starvation: Zebra rate-limits closing existing connections

These changes should be made in #1334 and various peer set security tickets.

@teor2345 teor2345 added C-enhancement Category: This is an improvement and removed S-needs-design Status: Needs a design decision labels Mar 8, 2021
@mpguerra mpguerra added this to the 2021 Sprint 12 milestone May 7, 2021
@mpguerra mpguerra removed this from the 2021 Sprint 13 milestone Jun 14, 2021
@mpguerra mpguerra added this to the 2021 Sprint 12 milestone Jun 14, 2021
@teor2345 teor2345 changed the title Do we need to prefer evicting pre-upgrade peers from the peer set across a network upgrade? Justify our alternative to evicting pre-upgrade peers from the peer set across a network upgrade Oct 13, 2021
@teor2345 teor2345 added the S-needs-investigation Status: Needs further investigation label Oct 13, 2021
@teor2345 teor2345 changed the title Justify our alternative to evicting pre-upgrade peers from the peer set across a network upgrade Justify our alternative to "evicting pre-upgrade peers from the peer set across a network upgrade" Oct 13, 2021
@teor2345 teor2345 assigned teor2345 and unassigned teor2345 Nov 25, 2021
@teor2345 teor2345 removed C-bug Category: This is a bug C-design Category: Software design work labels Nov 29, 2021
@teor2345 teor2345 removed the I-unbounded-growth Zebra keeps using resources, without any limit label Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-rust Area: Updates to Rust code C-enhancement Category: This is an improvement C-security Category: Security issues I-heavy Problems with excessive memory, disk, or CPU usage I-slow Problems with performance or responsiveness S-needs-investigation Status: Needs further investigation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants