Banned node for "Same state requested multiple times" #4924

girazoki · 2024-07-02T09:58:59Z

We have seen the "Banned: same state requested multiple times" using warp sync without apparent signs of the node doing anything wrong. The architecture is at follows:

Around roughly the same time, 3 different nodes warp sync against the same (and only) full-node.
They start syncing the state but at some point the full-node bans all 3 nodes with the same message: Reason: Same state request multiple times. Banned, disconnecting

Full node logs:

INFO tokio-runtime-worker substrate: [Parachain] 💤 Idle (3 peers), best: #822149 (0x8fa1…d27f), finalized #822147 (0x6e40…af94), ⬇ 2.7kiB/s ⬆ 4.4MiB/s
INFO tokio-runtime-worker substrate: [Parachain] 💤 Idle (3 peers), best: #822149 (0x8fa1…d27f), finalized #822147 (0x6e40…af94), ⬇ 0.5kiB/s ⬆ 2.7MiB/s
INFO tokio-runtime-worker substrate: [Parachain] 💤 Idle (3 peers), best: #822149 (0x8fa1…d27f), finalized #822147 (0x6e40…af94), ⬇ 0.9kiB/s ⬆ 4.8MiB/s
 Idle (3 peers), best: #822149 (0x8fa1…d27f), finalized #822147 (0x6e40…af94), ⬇ 2.0kiB/s ⬆ 10.1MiB/s
 INFO tokio-runtime-worker substrate: [Parachain] 💤 Idle (0 peers), best: #822149 (0x8fa1…d27f), finalized #822147 (0x6e40…af94), ⬇ 1.1kiB/s ⬆ 6.3MiB/s
 WARN tokio-runtime-worker peerset: [Parachain] Report 12D3KooWA4gQgEZgPka1UW5KWZYcjYMJUrpHgDWaJFJUaUUFknkx: -2147483648 to -2147483648. Reason: Same state request multiple times. Banned, disconnecting.
 WARN tokio-runtime-worker peerset: [Parachain] Report 12D3KooWKc1dwqMibdf5DYXM3B3vDxwpFXr6SKcvtJ8BSCZoehiZ: -2147483648 to -2147483648. Reason: Same state request multiple times. Banned, disconnecting.

We are wondering if the issue could be caused because the full-node cannot serve 3 nodes simultaneously asking for the latest state, which causes these 3 nodes to not receive the indicated bytes and therefore re-ask again for the same byttes, which causes the banning. Could this be any possible explanation?

From what we have seen from the substrate code, the banning happens after receiving the same request 3 times

The text was updated successfully, but these errors were encountered:

bkchr · 2024-07-02T12:00:46Z

CC @paritytech/networking

girazoki · 2024-07-02T16:56:02Z

Possibly related: #531 (comment)

lexnv · 2024-07-05T14:31:42Z

Possible root cause for multiple block requests, needs debug patch to test with a local net:

T0: When the peer is connected, the peer is added with a state to the self.peers

polkadot-sdk/substrate/client/network/sync/src/strategy/chain_sync.rs

Line 475 in d250dcb

    
           pub fn add_peer(&mut self, peer_id: PeerId, best_hash: B::Hash, best_number: NumberFor<B>) {

T1: The node makes a request to the peer setting the state to Downloading
T2: The remote peer takes longer than 20 secs to provide a response back
T3: keep-alive disconnects the peer causing us to lose the state about it

polkadot-sdk/substrate/client/network/sync/src/strategy/chain_sync.rs

Lines 1138 to 1144 in d250dcb

    
           /// Notify that a sync peer has disconnected. 
        
           pub fn remove_peer(&mut self, peer_id: &PeerId) { 
        
           	self.blocks.clear_peer_download(peer_id); 
        
           	if let Some(gap_sync) = &mut self.gap_sync { 
        
           		gap_sync.blocks.clear_peer_download(peer_id) 
        
           	} 
        
           	self.peers.remove(peer_id);

T4: We reconnect to the peer, set the state to Available

On one side, we store the peers in a HashMap (strategy/chain_sync logic):

polkadot-sdk/substrate/client/network/sync/src/strategy/chain_sync.rs

Line 252 in d250dcb

peers: HashMap<PeerId, PeerSync<B>>,

On the other side, we store in a LRU map the previous requests:

polkadot-sdk/substrate/client/network/sync/src/block_request_handler.rs

Line 155 in d250dcb

seen_requests: LruMap<SeenRequestsKey<B>, SeenRequestsValue>,

lexnv · 2024-07-05T16:20:34Z

Managed to reproduce this behavior with the following branch:

https://github.com/paritytech/polkadot-sdk/tree/lexnv/check-multiple-blocks-rootcause

2024-07-05 19:15:40 Finished sleeping for block request from 12D3KooWPfpK9xaCkg4eukAT9nJunVjc1drNisHFcE6qpdQWTHiu!
2024-07-05 19:15:40 Received block request from 12D3KooWPfpK9xaCkg4eukAT9nJunVjc1drNisHFcE6qpdQWTHiu simulating sleep!
2024-07-05 19:15:40 Peer 12D3KooWPfpK9xaCkg4eukAT9nJunVjc1drNisHFcE6qpdQWTHiu disconnected. Removed from internal mappings
2024-07-05 19:15:40 New peer 12D3KooWPfpK9xaCkg4eukAT9nJunVjc1drNisHFcE6qpdQWTHiu with best hash 0x1c69…fbf9 (0)
2024-07-05 19:15:40 Inserting peer 12D3KooWPfpK9xaCkg4eukAT9nJunVjc1drNisHFcE6qpdQWTHiu with state PeerSync { peer_id: PeerId("12D3KooWPfpK9xaCkg4eukAT9nJunVjc1drNisHFcE6qpdQWTHiu"), common_number: 0, best_hash: 0x1c69d379ff8920602d3ccb392e6f8fdd319b47971775f7696b17a69ca06cfbf9, best_number: 0, state: Available } (InChainPruned)


2024-07-05 19:16:00 Finished sleeping for block request from 12D3KooWPfpK9xaCkg4eukAT9nJunVjc1drNisHFcE6qpdQWTHiu!

2024-07-05 19:16:00 Received block request from 12D3KooWPfpK9xaCkg4eukAT9nJunVjc1drNisHFcE6qpdQWTHiu simulating sleep!
2024-07-05 19:16:00 Report 12D3KooWPfpK9xaCkg4eukAT9nJunVjc1drNisHFcE6qpdQWTHiu: -2147483648 to -2147483648. Reason: Same block request multiple times. Banned, disconnecting.
2024-07-05 19:16:00 Peer 12D3KooWPfpK9xaCkg4eukAT9nJunVjc1drNisHFcE6qpdQWTHiu disconnected. Removed from internal mappings

Will come up with a PR to fix this next week

girazoki · 2024-07-10T08:28:01Z

Thank you @lexnv ! appreciated!

…g the same request multiple times (#5029) This PR avoids submitting the same block or state request multiple times to the same slow peer. Previously, we submitted the same request to the same slow peer, which resulted in reputation bans on the slow peer side. Furthermore, the strategy selected the same slow peer multiple times to submit queries to, although a better candidate may exist. Instead, in this PR we: - introduce a `DisconnectedPeers` via LRU with 512 peer capacity to only track the state of disconnected peers with a request in flight - when the `DisconnectedPeers` detects a peer disconnected with a request in flight, the peer is backed off - on the first disconnection: 60 seconds - on second disconnection: 120 seconds - on the third disconnection the peer is banned, and the peer remains banned until the peerstore decays its reputation This PR lifts the pressure from overloaded nodes that cannot process requests in due time. And if a peer is detected to be slow after backoffs, the peer is banned. Theoretically, submitting the same request multiple times can still happen when: - (a) we backoff and ban the peer - (b) the network does not discover other peers -- this may also be a test net - (c) the peer gets reconnected after the reputation decay and is still slow to respond Aims to improve: - #4924 - #531 Next Steps: - Investigate the network after this is deployed, possibly bumping the keep-alive timeout or seeing if there's something else misbehaving This PR builds on top of: - #4987 ### Testing Done - Added a couple of unit tests where test harness were set in place - Local testnet ```bash 13:13:25.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Added first time peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD 13:14:39.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 2, last_disconnect: Instant { tv_sec: 93355, tv_nsec: 942016062 } }, should ban: false 13:16:49.107 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 3, last_disconnect: Instant { tv_sec: 93485, tv_nsec: 947551051 } }, should ban: true 13:16:49.108 WARN tokio-runtime-worker peerset: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648 to -2147483648. Reason: Slow peer after backoffs. Banned, disconnecting. ``` cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

…g the same request multiple times (paritytech#5029) This PR avoids submitting the same block or state request multiple times to the same slow peer. Previously, we submitted the same request to the same slow peer, which resulted in reputation bans on the slow peer side. Furthermore, the strategy selected the same slow peer multiple times to submit queries to, although a better candidate may exist. Instead, in this PR we: - introduce a `DisconnectedPeers` via LRU with 512 peer capacity to only track the state of disconnected peers with a request in flight - when the `DisconnectedPeers` detects a peer disconnected with a request in flight, the peer is backed off - on the first disconnection: 60 seconds - on second disconnection: 120 seconds - on the third disconnection the peer is banned, and the peer remains banned until the peerstore decays its reputation This PR lifts the pressure from overloaded nodes that cannot process requests in due time. And if a peer is detected to be slow after backoffs, the peer is banned. Theoretically, submitting the same request multiple times can still happen when: - (a) we backoff and ban the peer - (b) the network does not discover other peers -- this may also be a test net - (c) the peer gets reconnected after the reputation decay and is still slow to respond Aims to improve: - paritytech#4924 - paritytech#531 Next Steps: - Investigate the network after this is deployed, possibly bumping the keep-alive timeout or seeing if there's something else misbehaving This PR builds on top of: - paritytech#4987 ### Testing Done - Added a couple of unit tests where test harness were set in place - Local testnet ```bash 13:13:25.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Added first time peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD 13:14:39.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 2, last_disconnect: Instant { tv_sec: 93355, tv_nsec: 942016062 } }, should ban: false 13:16:49.107 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 3, last_disconnect: Instant { tv_sec: 93485, tv_nsec: 947551051 } }, should ban: true 13:16:49.108 WARN tokio-runtime-worker peerset: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648 to -2147483648. Reason: Slow peer after backoffs. Banned, disconnecting. ``` cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

This was referenced Jul 9, 2024

network: Investigate importing block with unknown parent #4984

Open

chainSync: Avoid reputation bans for slower timed out peer block requests #4987

Closed

lexnv mentioned this issue Jul 16, 2024

network/strategy: Backoff and ban overloaded peers to avoid submitting the same request multiple times #5029

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Banned node for "Same state requested multiple times" #4924

Banned node for "Same state requested multiple times" #4924

girazoki commented Jul 2, 2024 •

edited

Loading

bkchr commented Jul 2, 2024

girazoki commented Jul 2, 2024

lexnv commented Jul 5, 2024

lexnv commented Jul 5, 2024

girazoki commented Jul 10, 2024

Banned node for "Same state requested multiple times" #4924

Banned node for "Same state requested multiple times" #4924

Comments

girazoki commented Jul 2, 2024 • edited Loading

bkchr commented Jul 2, 2024

girazoki commented Jul 2, 2024

lexnv commented Jul 5, 2024

lexnv commented Jul 5, 2024

girazoki commented Jul 10, 2024

girazoki commented Jul 2, 2024 •

edited

Loading