-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node can send multiple same block requests while syncing from other nodes #531
Comments
What's the reply sent back to all these requests? |
Here is node_B.log, it's running in Docker with $ cat node_B.log | grep 'block request from 12D3KooWRkb1h3RqQ8aMNLoNvfNsceBKftMcRuTnS8bTnDphAewh'
{"log":"2022-06-22 08:34:56.712 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handling block request from 12D3KooWRkb1h3RqQ8aMNLoNvfNsceBKftMcRuTnS8bTnDphAewh: Starting at `BlockId::Number(2854)` with maximum blocks of `1`, direction `Ascending` and attributes `HEADER | JUSTIFICATION`. \n","stream":"stderr","time":"2022-06-22T08:34:56.730348545Z"}
{"log":"2022-06-22 08:34:56.712 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handled block request from 12D3KooWRkb1h3RqQ8aMNLoNvfNsceBKftMcRuTnS8bTnDphAewh. \n","stream":"stderr","time":"2022-06-22T08:34:56.730353307Z"}
{"log":"2022-06-22 08:34:57.622 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handling block request from 12D3KooWRkb1h3RqQ8aMNLoNvfNsceBKftMcRuTnS8bTnDphAewh: Starting at `BlockId::Number(2918)` with maximum blocks of `64`, direction `Descending` and attributes `HEADER | BODY | JUSTIFICATION`. \n","stream":"stderr","time":"2022-06-22T08:34:57.633748356Z"}
{"log":"2022-06-22 08:34:57.625 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handled block request from 12D3KooWRkb1h3RqQ8aMNLoNvfNsceBKftMcRuTnS8bTnDphAewh. \n","stream":"stderr","time":"2022-06-22T08:34:57.633750961Z"}
{"log":"2022-06-22 08:35:02.606 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handling block request from 12D3KooWRkb1h3RqQ8aMNLoNvfNsceBKftMcRuTnS8bTnDphAewh: Starting at `BlockId::Number(2982)` with maximum blocks of `64`, direction `Descending` and attributes `HEADER | BODY | JUSTIFICATION`. \n","stream":"stderr","time":"2022-06-22T08:35:02.649238914Z"}
{"log":"2022-06-22 08:35:02.608 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handled block request from 12D3KooWRkb1h3RqQ8aMNLoNvfNsceBKftMcRuTnS8bTnDphAewh. \n","stream":"stderr","time":"2022-06-22T08:35:02.649277629Z"}
{"log":"2022-06-22 08:35:28.043 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handling block request from 12D3KooWRkb1h3RqQ8aMNLoNvfNsceBKftMcRuTnS8bTnDphAewh: Starting at `BlockId::Number(2918)` with maximum blocks of `1`, direction `Ascending` and attributes `HEADER | JUSTIFICATION`. \n","stream":"stderr","time":"2022-06-22T08:35:28.082550206Z"}
{"log":"2022-06-22 08:35:28.043 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handled block request from 12D3KooWRkb1h3RqQ8aMNLoNvfNsceBKftMcRuTnS8bTnDphAewh. \n","stream":"stderr","time":"2022-06-22T08:35:28.082553911Z"}
{"log":"2022-06-22 08:35:29.563 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handling block request from 12D3KooWRkb1h3RqQ8aMNLoNvfNsceBKftMcRuTnS8bTnDphAewh: Starting at `BlockId::Number(2982)` with maximum blocks of `64`, direction `Descending` and attributes `HEADER | BODY | JUSTIFICATION`. \n","stream":"stderr","time":"2022-06-22T08:35:29.586889696Z"}
...... |
Block response handling could use more logging indeed. Could it be that there's a block over 16Mb in size? |
Hmm, I don't think so. The other node C on the same machine with node A can sync from B without any issue. This edge case could be largely caused by the network uncertainty, but I believe there is something we can do to avoid the duplicate block requests. |
Maybe this one here: paritytech/substrate#11094 ? |
Might be, I thought that one was long time merged already 😂 |
paritytech/substrate#11094 is related, but might be incomplete? With paritytech/substrate#11094 applied, I still see the duplicated requests(
|
I don't think paritytech/substrate#11094 is related here. Duplicate requests are not the issue. They are to be expected. Node requests blocks, but for some reason can not receive them. This lowers the reputatons and disconnects the nodes. Normally this means that the nodes will select a peer with better reputation for this slot. But in your case the same peers keep reconnecting. Probably because there are no other peers to choose from, as the number of slots is higher than the number of known peers on the network. After thy reconnect, the same request is issued and the cycle repeats. What I'd like to figure out is why handling the block request fails, and if this happens on the receiving side or the sending side. |
An interesting thing is that, with paritytech/substrate#11094 now each restart will be able to sync a number of blocks before getting stuck again, restarting the node is not much helpful before. Yes, it's a small internal experimental network at present, only one bootnode is available.
The link points to nowhere :(, I think you meant this result of the block response. Will add more logs to collect more info both on the sending and receiving side and update here. |
node A keeps sending the duplicated block request from 12309
node_B.log will send
@arkpar Let me know if more info is needed. |
So the very first request for 12309 was handled correctly on
It is not clear why |
@arkpar Here is the full node_A.log Hmm, looks like A imported 12309 successfully and was even able to proceed to a higher block...
|
Add more info to the log to help debug the issue like https://github.com/paritytech/substrate/issues/11732.
@liuchengxu Could you check again with paritytech/substrate#11763 merged? |
@arkpar It's fine to only patch the node on the sending side, right? |
@liuchengxu Yes, it should be fine |
@arkpar Hmm, no good progress with paritytech/substrate#11763, not sure if it's caused by some other problem :( From the log, looks like node B sent back the block requests for the first time, but due to the network timeout, node A wasn't able to receive and import these blocks but keep duplicating the requests.
|
Add more info to the log to help debug the issue like https://github.com/paritytech/substrate/issues/11732. Co-authored-by: Bastian Köcher <info@kchr.de>
Add more info to the log to help debug the issue like https://github.com/paritytech/substrate/issues/11732. Co-authored-by: Bastian Köcher <info@kchr.de>
Add more info to the log to help debug the issue like https://github.com/paritytech/substrate/issues/11732. Co-authored-by: Bastian Köcher <info@kchr.de>
* Account for `pallet_ethereum::on_initialize` weight * Account for `pallet_ethereum::on_finalize` weight * Account for `pallet_ethereum::on_runtime_upgrade` weight
…g the same request multiple times (#5029) This PR avoids submitting the same block or state request multiple times to the same slow peer. Previously, we submitted the same request to the same slow peer, which resulted in reputation bans on the slow peer side. Furthermore, the strategy selected the same slow peer multiple times to submit queries to, although a better candidate may exist. Instead, in this PR we: - introduce a `DisconnectedPeers` via LRU with 512 peer capacity to only track the state of disconnected peers with a request in flight - when the `DisconnectedPeers` detects a peer disconnected with a request in flight, the peer is backed off - on the first disconnection: 60 seconds - on second disconnection: 120 seconds - on the third disconnection the peer is banned, and the peer remains banned until the peerstore decays its reputation This PR lifts the pressure from overloaded nodes that cannot process requests in due time. And if a peer is detected to be slow after backoffs, the peer is banned. Theoretically, submitting the same request multiple times can still happen when: - (a) we backoff and ban the peer - (b) the network does not discover other peers -- this may also be a test net - (c) the peer gets reconnected after the reputation decay and is still slow to respond Aims to improve: - #4924 - #531 Next Steps: - Investigate the network after this is deployed, possibly bumping the keep-alive timeout or seeing if there's something else misbehaving This PR builds on top of: - #4987 ### Testing Done - Added a couple of unit tests where test harness were set in place - Local testnet ```bash 13:13:25.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Added first time peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD 13:14:39.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 2, last_disconnect: Instant { tv_sec: 93355, tv_nsec: 942016062 } }, should ban: false 13:16:49.107 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 3, last_disconnect: Instant { tv_sec: 93485, tv_nsec: 947551051 } }, should ban: true 13:16:49.108 WARN tokio-runtime-worker peerset: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648 to -2147483648. Reason: Slow peer after backoffs. Banned, disconnecting. ``` cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
…g the same request multiple times (paritytech#5029) This PR avoids submitting the same block or state request multiple times to the same slow peer. Previously, we submitted the same request to the same slow peer, which resulted in reputation bans on the slow peer side. Furthermore, the strategy selected the same slow peer multiple times to submit queries to, although a better candidate may exist. Instead, in this PR we: - introduce a `DisconnectedPeers` via LRU with 512 peer capacity to only track the state of disconnected peers with a request in flight - when the `DisconnectedPeers` detects a peer disconnected with a request in flight, the peer is backed off - on the first disconnection: 60 seconds - on second disconnection: 120 seconds - on the third disconnection the peer is banned, and the peer remains banned until the peerstore decays its reputation This PR lifts the pressure from overloaded nodes that cannot process requests in due time. And if a peer is detected to be slow after backoffs, the peer is banned. Theoretically, submitting the same request multiple times can still happen when: - (a) we backoff and ban the peer - (b) the network does not discover other peers -- this may also be a test net - (c) the peer gets reconnected after the reputation decay and is still slow to respond Aims to improve: - paritytech#4924 - paritytech#531 Next Steps: - Investigate the network after this is deployed, possibly bumping the keep-alive timeout or seeing if there's something else misbehaving This PR builds on top of: - paritytech#4987 ### Testing Done - Added a couple of unit tests where test harness were set in place - Local testnet ```bash 13:13:25.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Added first time peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD 13:14:39.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 2, last_disconnect: Instant { tv_sec: 93355, tv_nsec: 942016062 } }, should ban: false 13:16:49.107 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 3, last_disconnect: Instant { tv_sec: 93485, tv_nsec: 947551051 } }, should ban: true 13:16:49.108 WARN tokio-runtime-worker peerset: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648 to -2147483648. Reason: Slow peer after backoffs. Banned, disconnecting. ``` cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Is there an existing issue?
Experiencing problems? Have you tried our Stack Exchange first?
Description of bug
Let's say node A is trying to sync from B, the block requests from A are not protected from sending the same block request multiple times, which can cause the B to mark A as a bad node sending the same requests over and over again and then send back an error result which will be considered as
Refused
on the side A.From node_A.log, you can see the same block request was sent dozens of times.
If I read it correctly, once the request was processed for the first time, with 2 more same requests, the reputation would change,
Err(())
will be returned, on the side of node A,Err(())
will be converted asRequestFailure::Refused
. Those two nodes will disconnect from each other afterwards.https://github.com/paritytech/substrate/blob/3c99545cb062fbc56b84e2f3e44def3b933123cd/client/network/sync/src/block_request_handler.rs#L216
https://github.com/paritytech/substrate/blob/3c99545cb062fbc56b84e2f3e44def3b933123cd/client/network/src/request_responses.rs#L662
Steps to reproduce
No response
The text was updated successfully, but these errors were encountered: