-
Notifications
You must be signed in to change notification settings - Fork 708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Banned node for "Same state requested multiple times" #4924
Comments
CC @paritytech/networking |
Possibly related: #531 (comment) |
Possible root cause for multiple block requests, needs debug patch to test with a local net:
polkadot-sdk/substrate/client/network/sync/src/strategy/chain_sync.rs Lines 1138 to 1144 in d250dcb
On one side, we store the peers in a HashMap (strategy/chain_sync logic):
On the other side, we store in a LRU map the previous requests:
|
Managed to reproduce this behavior with the following branch:
Will come up with a PR to fix this next week |
Thank you @lexnv ! appreciated! |
…g the same request multiple times (#5029) This PR avoids submitting the same block or state request multiple times to the same slow peer. Previously, we submitted the same request to the same slow peer, which resulted in reputation bans on the slow peer side. Furthermore, the strategy selected the same slow peer multiple times to submit queries to, although a better candidate may exist. Instead, in this PR we: - introduce a `DisconnectedPeers` via LRU with 512 peer capacity to only track the state of disconnected peers with a request in flight - when the `DisconnectedPeers` detects a peer disconnected with a request in flight, the peer is backed off - on the first disconnection: 60 seconds - on second disconnection: 120 seconds - on the third disconnection the peer is banned, and the peer remains banned until the peerstore decays its reputation This PR lifts the pressure from overloaded nodes that cannot process requests in due time. And if a peer is detected to be slow after backoffs, the peer is banned. Theoretically, submitting the same request multiple times can still happen when: - (a) we backoff and ban the peer - (b) the network does not discover other peers -- this may also be a test net - (c) the peer gets reconnected after the reputation decay and is still slow to respond Aims to improve: - #4924 - #531 Next Steps: - Investigate the network after this is deployed, possibly bumping the keep-alive timeout or seeing if there's something else misbehaving This PR builds on top of: - #4987 ### Testing Done - Added a couple of unit tests where test harness were set in place - Local testnet ```bash 13:13:25.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Added first time peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD 13:14:39.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 2, last_disconnect: Instant { tv_sec: 93355, tv_nsec: 942016062 } }, should ban: false 13:16:49.107 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 3, last_disconnect: Instant { tv_sec: 93485, tv_nsec: 947551051 } }, should ban: true 13:16:49.108 WARN tokio-runtime-worker peerset: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648 to -2147483648. Reason: Slow peer after backoffs. Banned, disconnecting. ``` cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
…g the same request multiple times (paritytech#5029) This PR avoids submitting the same block or state request multiple times to the same slow peer. Previously, we submitted the same request to the same slow peer, which resulted in reputation bans on the slow peer side. Furthermore, the strategy selected the same slow peer multiple times to submit queries to, although a better candidate may exist. Instead, in this PR we: - introduce a `DisconnectedPeers` via LRU with 512 peer capacity to only track the state of disconnected peers with a request in flight - when the `DisconnectedPeers` detects a peer disconnected with a request in flight, the peer is backed off - on the first disconnection: 60 seconds - on second disconnection: 120 seconds - on the third disconnection the peer is banned, and the peer remains banned until the peerstore decays its reputation This PR lifts the pressure from overloaded nodes that cannot process requests in due time. And if a peer is detected to be slow after backoffs, the peer is banned. Theoretically, submitting the same request multiple times can still happen when: - (a) we backoff and ban the peer - (b) the network does not discover other peers -- this may also be a test net - (c) the peer gets reconnected after the reputation decay and is still slow to respond Aims to improve: - paritytech#4924 - paritytech#531 Next Steps: - Investigate the network after this is deployed, possibly bumping the keep-alive timeout or seeing if there's something else misbehaving This PR builds on top of: - paritytech#4987 ### Testing Done - Added a couple of unit tests where test harness were set in place - Local testnet ```bash 13:13:25.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Added first time peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD 13:14:39.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 2, last_disconnect: Instant { tv_sec: 93355, tv_nsec: 942016062 } }, should ban: false 13:16:49.107 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 3, last_disconnect: Instant { tv_sec: 93485, tv_nsec: 947551051 } }, should ban: true 13:16:49.108 WARN tokio-runtime-worker peerset: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648 to -2147483648. Reason: Slow peer after backoffs. Banned, disconnecting. ``` cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
We have seen the
"Banned: same state requested multiple times"
using warp sync without apparent signs of the node doing anything wrong. The architecture is at follows:Reason: Same state request multiple times. Banned, disconnecting
Full node logs:
We are wondering if the issue could be caused because the full-node cannot serve 3 nodes simultaneously asking for the latest state, which causes these 3 nodes to not receive the indicated bytes and therefore re-ask again for the same byttes, which causes the banning. Could this be any possible explanation?
From what we have seen from the substrate code, the banning happens after receiving the same request 3 times
The text was updated successfully, but these errors were encountered: