Security: Limit reconnection rate to individual peers #2275

teor2345 · 2021-06-10T05:16:12Z

Motivation

Zebra's peer liveness check is only applied to peers in the Responded state. This can lead to repeated retries of Failed peers, particularly in small address books.

Zebra takes the most recent time from all the peer time fields, and uses that time for its retry order. This makes Zebra retry some peers multiple times, before retrying other peers. (And in general, we don't want to confuse trusted and untrusted data, or success and failure times.)

Specifications

Unfortunately, there are no Zcash or Bitcoin specifications for peer reconnection rate-limits, or reconnection order.

Designs

Here is Zebra's current peer liveness timeout:

zebra/zebra-network/src/constants.rs

Lines 35 to 46 in 96a1b66

    
           /// We expect to receive a message from a live peer at least once in this time duration. 
        
           /// 
        
           /// This is the sum of: 
        
           /// - the interval between connection heartbeats 
        
           /// - the timeout of a possible pending (already-sent) request 
        
           /// - the timeout for a possible queued request 
        
           /// - the timeout for the heartbeat request itself 
        
           /// 
        
           /// This avoids explicit synchronization, but relies on the peer 
        
           /// connector actually setting up channels and these heartbeats in a 
        
           /// specific manner that matches up with this math. 
        
           pub const LIVE_PEER_DURATION: Duration = Duration::from_secs(60 + 20 + 20 + 20);

Solution

Reconnection Rate

Limit the reconnection rate to each individual peer by applying the liveness cutoff to the attempt, responded, and failure time fields. If any field is recent, the peer is skipped.

This new liveness cutoff skips any peers that have recently been connected, attempted or failed, regardless of their current state.

This change should close #1848.

Reconnection Order

Changes:

make Zebra prefers peers in more useful states: responded, never attempted, failed, attempt pending
if the states are equal, prefer the earliest attempted time, then earliest failed, then earliest responded, then the most recent gossiped last seen time

Unlike the previous order, the new order:

tries all peers in each state, before re-trying any peer in that state, and
only checks the the gossiped untrusted last seen time if all other times are equal.

Review

@jvff can review this change.

This change is important, but it doesn't seem to be currently causing any issues on the network.

Reviewer Checklist

Code implements Specs and Designs
Tests for Expected Behaviour
Tests for Errors

Related Work

This PR is based on #2273, it should automatically rebase on main once that PR merges.

This PR is part of a series of MetaAddr refactors. After this PR merges, we can close #1849.

teor2345 · 2021-06-10T05:22:34Z

I still need to write proptests to ensure:

regardless of the changes applied to a MetaAddr, it never gets tried more than once per LIVE_PEER_DURATION
- we'll need to prefer later times to earlier times to make this property hold
all disconnected MetaAddrs in a particular state are retried once, before any are retried twice
- there might be some exceptions to this property, the tests should show us what they are

Reconnection Rate Limit the reconnection rate to each individual peer by applying the liveness cutoff to the attempt, responded, and failure time fields. If any field is recent, the peer is skipped. The new liveness cutoff skips any peers that have recently been attempted or failed. (Previously, the liveness check was only applied if the peer was in the `Responded` state, which could lead to repeated retries of `Failed` peers, particularly in small address books.) Reconnection Order Zebra prefers more useful peer states, then the earliest attempted, failed, and responded times, then the most recent gossiped last seen times. Before this change, Zebra took the most recent time in all the peer time fields, and used that time for liveness and ordering. This led to confusion between trusted and untrusted data, and success and failure times. Unlike the previous order, the new order: - tries all peers in each state, before re-trying any peer in that state, and - only checks the the gossiped untrusted last seen time if all other times are equal.

zebra-chain/src/serialization/date_time.rs

teor2345 · 2021-06-15T07:45:30Z

zebra-network/src/address_book.rs

@@ -155,19 +185,22 @@ impl AddressBook {
        );

        if let Some(updated) = updated {
-            // If a node that we are directly connected to has changed to a client,
-            // remove it from the address book.
-            if updated.is_direct_client() && previous.is_some() {


We can't remove peers that are recently live.

If we get that removed peer as a gossiped or alternate address, we'll reconnect to it within the liveness interval. (The proptests discovered this bug.)

(But it's ok to ignore specific addresses or peers that were never attempted, because there's no risk of reconnecting to them.)

teor2345 · 2021-06-15T07:54:18Z

zebra-network/src/meta_addr.rs

+        // Prioritise older attempt times, so we try all peers in each state,
+        // before re-trying any of them. This avoids repeatedly reconnecting to
+        // peers that aren't working.


This is a core part of the security fix: try older peers first, to reduce rapid reconnections to the same peer.

teor2345 · 2021-06-15T07:54:42Z

zebra-network/src/meta_addr.rs

+    /// Is this address ready for a new outbound connection attempt?
+    pub fn is_ready_for_attempt(&self) -> bool {
+        self.last_known_info_is_valid_for_outbound()
+            && !self.was_recently_live()
+            && !self.was_recently_attempted()
+            && !self.was_recently_failed()
+    }


This is a core part of the security fix: skip peers that have recently been attempted, responded, or failed.

teor2345 · 2021-06-15T08:04:27Z

zebra-network/src/protocol/external/arbitrary.rs

+
+    fn arbitrary_with(_args: Self::Parameters) -> Self::Strategy {
+        any::<u64>()
+            .prop_map(PeerServices::from_bits_truncate)


Previously, derive(Arbitrary) was putting any u64 value in these bits, which caused spurious errors.

teor2345 · 2021-06-15T09:10:48Z

zebra-network/src/meta_addr/tests/prop.rs

+    /// themselves. It detects bugs in [`MetaAddr`]s, even if there are
+    /// compensating bugs in the [`CandidateSet`] or [`AddressBook`].
+    //
+    // TODO: write a similar test using the AddressBook and CandidateSet


This extra test in this TODO isn't a high priority, because outbound connection fairness itself isn't a high priority.

zebra-network/src/address_book.rs

zebra-chain/src/serialization/date_time.rs

zebra-network/src/meta_addr.rs

zebra-network/src/meta_addr/tests/prop.rs

Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>

jvff

Already looking good! Just a few ideas in case any of them are useful 👍

zebra-chain/src/serialization/date_time.rs

zebra-network/src/meta_addr.rs

zebra-network/src/meta_addr/tests/prop.rs

teor2345 · 2021-06-18T09:20:32Z

@jvff I've just changed the order of the constants, feel free to merge once the release is out.

teor2345 added A-rust Area: Updates to Rust code P-Medium C-security Category: Security issues I-remote-node-overload Zebra can overload other nodes on the network A-network Area: Network protocol updates or fixes labels Jun 10, 2021

teor2345 added this to the 2021 Sprint 11 - Zcon2 milestone Jun 10, 2021

teor2345 requested a review from jvff June 10, 2021 05:16

teor2345 self-assigned this Jun 10, 2021

mpguerra removed this from the 2021 Sprint 11 - Zcon2 milestone Jun 14, 2021

mpguerra linked an issue Jun 14, 2021 that may be closed by this pull request

Security: Limit reconnection rate to each individual peer address #1848

Closed

2 tasks

teor2345 mentioned this pull request Jun 14, 2021

Justify our alternative to "evicting pre-upgrade peers from the peer set across a network upgrade" #706

Closed

4 tasks

teor2345 marked this pull request as ready for review June 15, 2021 02:43

Base automatically changed from stop-failure-as-last-seen to main June 15, 2021 03:31

teor2345 force-pushed the limit-addr-reconnection-rate branch from ccdb6a8 to 2954ba3 Compare June 15, 2021 03:32

teor2345 added 7 commits June 15, 2021 13:34

Preserve the later time if changes arrive out of order

d046dc8

Update CandidateSet::next documentation

66e2d57

Update CandidateSet state diagram

b22ef8a

Fix variant names in comments

e9dff41

Explain why timestamps can be left out of MetaAddrChanges

2efb9ac

Add a simple test for the individual peer retry limit

318a495

teor2345 force-pushed the limit-addr-reconnection-rate branch from 2954ba3 to 318a495 Compare June 15, 2021 03:34

teor2345 marked this pull request as draft June 15, 2021 03:39

teor2345 added 2 commits June 15, 2021 14:56

Only generate valid Arbitrary PeerServices values

591ed96

Add an individual peer retry limit AddressBook and CandidateSet test

1e20bd0