Justify our alternative to "evicting pre-upgrade peers from the peer set across a network upgrade" #706

hdevalence · 2020-07-21T17:45:58Z

TODO

Write tests or documentation for:

make Zebra disconnect from outdated peers after a network upgrade After network upgrade activation, reject new connections from outdated peers #1334
check that we don't need to reject new connections from outdated peers in the lead-up to a network upgrade
check that we don't need to close existing connections to outdated peers in the lead-up to a network upgrade
add a comment to the relevant file, listing the assumptions that are required for this design to be secure:
- network convergence: Zebra knows the addresses of a large number of upgraded peers (security fixes must respect this)
  - Zebra doesn't need to be connected to these addresses at the network upgrade, and they don't need to be a large proportion of its address book - they just need to be present in the address book
- starvation: Zebra attempts other addresses before re-trying rejected or disconnected addresses (done in Security: Limit reconnection rate to individual peers #2275)
- connection flooding / starvation: Zebra rate-limits closing existing connections (After network upgrade activation, close existing connections to outdated peers #2262 and Security: Stop disconnecting from nodes that send unexpected messages, to prevent disconnection attacks, Credit: Equilibrium #2107)

Motivation

Originally posted by @hdevalence in #702

#702 added a comment that we should attempt to evict pre-upgrade peers from the peer set and prefer choosing post-upgrade ready peers for queries.

We should be careful to do everything we need to do to handle a network upgrade and it's good to consider doing this (I presume this matches zcashd behavior?). But on the other hand, there are a couple of reasons to prefer not to do this:

in general, it's some additional code for us to write;
we would have to build additional plumbing to connect information about peer versions to the eviction logic;
we would have to change the details of the load balancing mechanism and build plumbing to connect peer versions to the load balancer;

I think that if we do nothing, we'll probably be fine:

we can close connections to peers that don't upgrade;
we connect to a very large number of peers, so we can afford to close many connections to peers that don't upgrade;
if all of the peers we were connected to didn't upgrade, we're in a network full of bad peers anyways, so trying to filter them would just prepone the failure.

Does this analysis seem right, or am I missing some important details?

Related Issues

#1334 disconnecting old peers after network upgrade activation
#337 ZIP-200: Network Upgrade Mechanism

hdevalence · 2020-07-21T17:46:38Z

cc @teor2345

str4d · 2020-07-21T21:08:38Z

The zcashd behaviour is specified in ZIP 201.

hdevalence · 2020-07-21T21:50:03Z

Thanks for the pointer! It looks like that spec is very specific to the details of the Zcashd network stack, which we don't replicate in Zebra, so I think we still need to figure out a good answer to these questions.

teor2345 · 2020-07-22T00:12:09Z

I think there are a few different questions here:

How can Zebra prefer new peers before the activation height?
How can Zebra disconnect old peers with existing connections at the activation height?
Do we need to prefer new peers before the activation height?
Can we guarantee that Zebra nodes always have a large number of peers?

Q 1. Here's a quick design sketch for preferring new peers:

for every peer that connects, if it is an older version, reject the connection with some probability P. For example, we could set P to 0.5 just before the activation height (~1.5 days), and 1.0 at the activation height.

We might need to artificially increase the peer set churn, if long-lived connections to peers can last longer than 1.5 days. (We could lower the maximum peer connection time, or randomly disconnect peers.)

Q 2. Here's a quick design sketch for disconnecting old peers:

for every peer, record its protocol version
at the activation height, disconnect old peers

if all of the peers we were connected to didn't upgrade, we're in a network full of bad peers anyways, so trying to filter them would just prepone the failure.

Q 3. If we are in a network with a large proportion of old peers, we want to cleanly separate from them at the activation height. Preferring new peers makes us gradually discover new peers, via the old peers we are connected to. So we gradually increase our chances of being connected to new peers.

Let's assume:

We are in a large network (compared with our peer set size), with a large proportion of old peers
By chance, we happen to mostly be connected to old peers

Here are a few different scenarios:
A. 1. We struggle to download the activation block, because old peers won't validate it.
A. 2. The network partitions, and mines several different activation blocks, increasing the chance of chain reorgs across the activation height

B. 1. We abruptly disconnect from all the old peers, as soon as we validate the activation block
B. 2. Every Zebra node suddenly goes looking for new peers, leading to a spike in network load

C. 1. (same as B)
C. 2. There is a chain reorganisation across the network upgrade height, so we suddenly accept old peers again, leading to thrashing

D. 1. We disconnect from old peers, isolating ourselves in a small network of new peers.
D. 2. We are vulnerable to an eclipse attack:

we connect to a very large number of peers, so we can afford to close many connections to peers that don't upgrade;

Q 4. I don't think we can guarantee that Zebra will always have a large number of peers.

Some users will want a small number of peers. For example, connecting to a single trusted node can be useful for:

wallet support,
network-level anonymity, or
sub-instances of a miner or exchange.

At the moment, we don't enforce having a large number of peers, users can set the peer limits as low as they like. (We don't seem to support a very low number of peers right now, but that's a separate issue, see #704.)

We could:

warn if the peer limit is low, or
warn if the actual number of peers is low.

But people don't read warnings, so we could:

shut down if the number of peers is very low, or
require users to configure a trusted peer mode to disable warnings or shutdowns.

I think this complicates the design, but for many users, a node shutdown is better than being attacked.

But I think it's easier for us to make sure Zebra has good peers.

hdevalence · 2020-11-19T23:56:43Z

Q 4. I don't think we can guarantee that Zebra will always have a large number of peers.

Some users will want a small number of peers. For example, connecting to a single trusted node can be useful for:
* wallet support,

* network-level anonymity, or

* sub-instances of a miner or exchange.

Supporting arbitrarily few peers was a non-goal of the architecture of the peer set (and the network stack generally), which aimed to treat peer connections as "cattle, not pets" (regrettable phrasing notwithstanding). So although some users may want a small number of peers, we should not seek to support that use-case.

teor2345 · 2021-03-08T00:50:57Z

Q 4. I don't think we can guarantee that Zebra will always have a large number of peers.
Some users will want a small number of peers. For example, connecting to a single trusted node can be useful for:
* wallet support,

* network-level anonymity, or

* sub-instances of a miner or exchange.
Supporting arbitrarily few peers was a non-goal of the architecture of the peer set (and the network stack generally), which aimed to treat peer connections as "cattle, not pets" (regrettable phrasing notwithstanding). So although some users may want a small number of peers, we should not seek to support that use-case.

Security arguments should be based on threat models, not supported use-cases.

"A Zebra node has a small number of peers" is something that can happen due to network conditions or a deliberate attack. (Even if we don't support configurations that deliberately seek to reduce the number of peers, they should be as secure as reasonably possible.)

teor2345 · 2021-03-08T01:32:05Z

I don't think we need to make any code changes for this ticket, as long as the changes in other tickets satisfy the following conditions:

network convergence: Zebra knows the addresses of a large number of upgraded peers
- Zebra doesn't need to be connected to these addresses at the network upgrade, and they don't need to be a large proportion of its address book - they just need to be in the address book
starvation: Zebra attempts other addresses before re-trying rejected or disconnected addresses
connection flooding / starvation: Zebra rate-limits closing existing connections

These changes should be made in #1334 and various peer set security tickets.

hdevalence mentioned this issue Jul 21, 2020

feature: Use the Heartwood protocol version in zebra-network #702

Merged

dconnolly added A-rust Area: Updates to Rust code C-design Category: Software design work S-needs-design Status: Needs a design decision labels Aug 27, 2020

teor2345 added this to the Sync and Network milestone Sep 29, 2020

hdevalence mentioned this issue Nov 19, 2020

After network upgrade activation, reject new connections from outdated peers #1334

Closed

8 tasks

teor2345 mentioned this issue Nov 25, 2020

ZIP-200: Network Upgrade Mechanism #337

Closed

9 tasks

mpguerra removed this from the Sync and Network milestone Jan 5, 2021

teor2345 mentioned this issue Jan 15, 2021

Update some comments in zebra-network #1601

Merged

teor2345 added C-bug Category: This is a bug C-security Category: Security issues I-heavy Problems with excessive memory, disk, or CPU usage I-slow Problems with performance or responsiveness I-unbounded-growth Zebra keeps using resources, without any limit labels Mar 8, 2021

teor2345 mentioned this issue Mar 8, 2021

NU5 Tracking: Required ZIPs #1854

Closed

teor2345 added C-enhancement Category: This is an improvement and removed S-needs-design Status: Needs a design decision labels Mar 8, 2021

teor2345 mentioned this issue Mar 9, 2021

Security: Zebra's address book can use all available memory #1873

Closed

mpguerra mentioned this issue Mar 23, 2021

NU5 Tracking: Zebra tasks #1853

Closed

mpguerra added this to the 2021 Sprint 12 milestone May 7, 2021

teor2345 modified the milestones: 2021 Sprint 12, 2021 Sprint 13 Jun 14, 2021

mpguerra removed this from the 2021 Sprint 13 milestone Jun 14, 2021

mpguerra added this to the 2021 Sprint 12 milestone Jun 14, 2021

teor2345 modified the milestones: 2021 Sprint 12, 2021 Sprint 13 Jun 14, 2021

teor2345 modified the milestones: 2021 Sprint 13, 2021 Sprint 14 Jun 29, 2021

teor2345 modified the milestones: 2021 Sprint 14, 2021 Sprint 15 Jul 12, 2021

teor2345 modified the milestones: 2021 Sprint 15, 2021 Sprint 21 Jul 26, 2021

mpguerra modified the milestones: 2021 Sprint 21, 2021 Sprint 23 Oct 1, 2021

teor2345 changed the title ~~Do we need to prefer evicting pre-upgrade peers from the peer set across a network upgrade?~~ Justify our alternative to evicting pre-upgrade peers from the peer set across a network upgrade Oct 13, 2021

teor2345 added the S-needs-investigation Status: Needs further investigation label Oct 13, 2021

teor2345 changed the title ~~Justify our alternative to evicting pre-upgrade peers from the peer set across a network upgrade~~ Justify our alternative to "evicting pre-upgrade peers from the peer set across a network upgrade" Oct 13, 2021

mpguerra mentioned this issue Oct 15, 2021

Update the README goals and scope for our first Beta release #2857

Closed

mpguerra modified the milestones: 2021 Sprint 23, 2021 Sprint 24 Oct 26, 2021

teor2345 assigned teor2345 and unassigned teor2345 Nov 25, 2021

teor2345 removed C-bug Category: This is a bug C-design Category: Software design work labels Nov 29, 2021

mpguerra assigned jvff Nov 29, 2021

teor2345 removed the I-unbounded-growth Zebra keeps using resources, without any limit label Nov 29, 2021

jvff mentioned this issue Dec 9, 2021

Describe PeerSets behavior at a network upgrade's activation height #3181

Merged

3 tasks

mpguerra closed this as completed in #3181 Dec 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Justify our alternative to "evicting pre-upgrade peers from the peer set across a network upgrade" #706

Justify our alternative to "evicting pre-upgrade peers from the peer set across a network upgrade" #706

hdevalence commented Jul 21, 2020 •

edited by teor2345

Loading

hdevalence commented Jul 21, 2020

str4d commented Jul 21, 2020

hdevalence commented Jul 21, 2020

teor2345 commented Jul 22, 2020 •

edited

Loading

hdevalence commented Nov 19, 2020

teor2345 commented Mar 8, 2021

teor2345 commented Mar 8, 2021

Justify our alternative to "evicting pre-upgrade peers from the peer set across a network upgrade" #706

Justify our alternative to "evicting pre-upgrade peers from the peer set across a network upgrade" #706

Comments

hdevalence commented Jul 21, 2020 • edited by teor2345 Loading

TODO

Motivation

hdevalence commented Jul 21, 2020

str4d commented Jul 21, 2020

hdevalence commented Jul 21, 2020

teor2345 commented Jul 22, 2020 • edited Loading

hdevalence commented Nov 19, 2020

teor2345 commented Mar 8, 2021

teor2345 commented Mar 8, 2021

hdevalence commented Jul 21, 2020 •

edited by teor2345

Loading

teor2345 commented Jul 22, 2020 •

edited

Loading