A larger data payload in the response of a block request can't be sent back due to KeepAliveTimeout and then peer disconnected #12105

liuchengxu · 2022-08-25T04:25:40Z

Is there an existing issue?

I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

This is not a support question.

Description of bug

When a node requests a chunk of blocks from another node, if these blocks happen to be huge, the block request is successfully handled on the server node but failed to be sent back to the requester due to the error KeepAliveTimeout

2022-08-24 17:33:04.394 DEBUG tokio-runtime-worker libp2p_swarm: [PrimaryChain] Connection closed with error KeepAliveTimeout: Connected { endpoint: Listener { local_addr: "/ip4/192.168.1.2/tcp/30333", send_back_addr: "/ip4/85.41.34.50/tcp/55227" }, peer_id: PeerId("12D3KooWG87xHK71UEKkw4pLCAkaa2C4vCnAsk5hVYD6dzQdua6G") }; Total (peer): 0.    
2022-08-24 17:33:04.394 DEBUG tokio-runtime-worker sub-libp2p: [PrimaryChain] Libp2p => Disconnected(PeerId("12D3KooWG87xHK71UEKkw4pLCAkaa2C4vCnAsk5hVYD6dzQdua6G"), Some(KeepAliveTimeout))

Here is the server node log, we can see the byte size of the first request result is 299402, which is sent successfully, the second request result size is 10534306 which failed to be sent back. The reason is that there are some data store transactions after block 64, causing the block size to increase significantly.

2022-08-24 17:28:48.407 DEBUG tokio-runtime-worker libp2p_dns: [PrimaryChain] Dialing /ip4/172.19.0.1/tcp/21000/p2p/12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS    
2022-08-24 17:28:52.713 DEBUG tokio-runtime-worker libp2p_swarm: [PrimaryChain] Connection established: PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS") Listener { local_addr: "/ip4/192.168.1.2/tcp/30333", send_back_addr: "/ip4/125.121.39.96/tcp/34940" }; Total (peer): 1. Total non-banned (peer): 1    
2022-08-24 17:28:52.713 DEBUG tokio-runtime-worker sub-libp2p: [PrimaryChain] Libp2p => Connected(PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS"))    
2022-08-24 17:28:53.318 DEBUG tokio-runtime-worker libp2p_ping: [PrimaryChain] Ping sent to PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS")    
2022-08-24 17:28:53.319 DEBUG tokio-runtime-worker libp2p_ping: [PrimaryChain] Ping received from PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS")    
2022-08-24 17:28:53.930 DEBUG tokio-runtime-worker sync: [PrimaryChain] Connected 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS    
2022-08-24 17:28:54.983 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handling block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS: Starting at `BlockId::Number(64)` with maximum blocks of `64`, reputation_change: `None`, small_request `false`, direction `Descending` and attributes `HEADER | BODY | JUSTIFICATION`.    
2022-08-24 17:28:55.054 DEBUG tokio-runtime-worker sync: [PrimaryChain] Sending result of block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS starting at `BlockId::Number(64)`: blocks: Some(64), data: Some(299402)    
2022-08-24 17:28:55.055 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handled block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS.    
2022-08-24 17:28:57.831 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handling block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS: Starting at `BlockId::Number(128)` with maximum blocks of `64`, reputation_change: `None`, small_request `false`, direction `Descending` and attributes `HEADER | BODY | JUSTIFICATION`.    
2022-08-24 17:28:57.837 DEBUG tokio-runtime-worker sync: [PrimaryChain] Sending result of block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS starting at `BlockId::Number(128)`: blocks: Some(3), data: Some(10534306)    
2022-08-24 17:28:57.840 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handled block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS.    
2022-08-24 17:29:08.407 DEBUG tokio-runtime-worker libp2p_kad::behaviour: [PrimaryChain] Address '/ip4/172.19.0.1/tcp/21000/p2p/12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS' removed from peer '12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS' due to error.    
2022-08-24 17:29:08.407 DEBUG tokio-runtime-worker libp2p_swarm: [PrimaryChain] Connection attempt to PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS") failed with Transport([("/ip4/172.19.0.1/tcp/21000/p2p/12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS", Other(Custom { kind: Other, error: Timeout }))]).    
2022-08-24 17:29:09.122 DEBUG tokio-runtime-worker libp2p_ping: [PrimaryChain] Ping received from PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS")    
2022-08-24 17:29:11.989 DEBUG tokio-runtime-worker libp2p_ping: [PrimaryChain] Ping sent to PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS")    
2022-08-24 17:29:26.115 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handling block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS: Starting at `BlockId::Number(64)` with maximum blocks of `1`, reputation_change: `None`, small_request `true`, direction `Ascending` and attributes `HEADER | JUSTIFICATION`.    
2022-08-24 17:29:26.115 DEBUG tokio-runtime-worker sync: [PrimaryChain] Sending result of block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS starting at `BlockId::Number(64)`: blocks: Some(1), data: Some(4664)    
2022-08-24 17:29:26.115 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handled block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS.    
2022-08-24 17:29:26.760 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handling block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS: Starting at `BlockId::Number(128)` with maximum blocks of `64`, reputation_change: `None`, small_request `false`, direction `Descending` and attributes `HEADER | BODY | JUSTIFICATION`.    
2022-08-24 17:29:26.767 DEBUG tokio-runtime-worker sync: [PrimaryChain] Sending result of block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS starting at `BlockId::Number(128)`: blocks: Some(3), data: Some(10534306)    
2022-08-24 17:29:26.770 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handled block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS.    
2022-08-24 17:29:26.989 DEBUG tokio-runtime-worker libp2p_ping: [PrimaryChain] Ping received from PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS")    
2022-08-24 17:29:41.875 DEBUG tokio-runtime-worker libp2p_ping: [PrimaryChain] Ping sent to PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS")    
2022-08-24 17:29:56.875 DEBUG tokio-runtime-worker libp2p_ping: [PrimaryChain] Ping received from PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS")    
2022-08-24 17:30:01.218 DEBUG tokio-runtime-worker libp2p_ping: [PrimaryChain] Ping sent to PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS")    
2022-08-24 17:30:01.721 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handling block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS: Starting at `BlockId::Number(64)` with maximum blocks of `1`, reputation_change: `None`, small_request `true`, direction `Ascending` and attributes `HEADER | JUSTIFICATION`.    
2022-08-24 17:30:01.721 DEBUG tokio-runtime-worker sync: [PrimaryChain] Sending result of block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS starting at `BlockId::Number(64)`: blocks: Some(1), data: Some(4664)    
2022-08-24 17:30:01.721 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handled block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS.    
2022-08-24 17:30:02.747 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handling block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS: Starting at `BlockId::Number(128)` with maximum blocks of `64`, reputation_change: `Some(ReputationChange { value: -2147483648, reason: "Same block request multiple times" })`, small_request `false`, direction `Descending` and attributes `HEADER | BODY | JUSTIFICATION`.    
2022-08-24 17:30:02.747 DEBUG tokio-runtime-worker sync: [PrimaryChain] Sending result of block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS starting at `BlockId::Number(128)`: blocks: None, data: None    
2022-08-24 17:30:02.747 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handled block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS.    
2022-08-24 17:30:02.747 DEBUG tokio-runtime-worker peerset: [PrimaryChain] Report

With interior networking and large blocks which are normal for a storage chain, this issue can occur quite often. I see a few ways to help this:

Allow increasing the timeout, both request and response. If a chain can have a lot of large blocks, a larger timeout should be used.
Send back just one block of data if the data result is huge. 10534306 is actually for 3 blocks, reducing it to one will help a lot. Furthermore, a chain has a nature block size limit, once configurable, the timeout may be adjusted based on this value.

I think paritytech/polkadot-sdk#531 is related, after Timeout, we see a flood of duplicated block requests, maybe we can also do something to improve that.

Steps to reproduce

No response

The text was updated successfully, but these errors were encountered:

bkchr · 2022-08-25T08:17:41Z

CC @arkpar

arkpar · 2022-08-25T09:08:50Z

@tomaka Could you pleas clarify when exactly KeepAliveTimeout is generated? Documentation says that the connection will be closed after being "idle" for specified time. Is it considered idle when the request is in progress? Here we hardcode keep-alive timeout to 10 seconds.

Also, does KeepAliveTimeout handling mean that the peer disconnects from the sync protocol as well?

@liuchengxu how fast is the network speed? Is it at least 100mbps?

tomaka · 2022-08-25T09:38:51Z

The keep-alive timeout only starts after there is no activity on a connection anymore. It's the mechanism that we use to kill inactive connections.

Each protocol decides what "no activity" means. In our situation, it means no notifications protocols open and no request (in the request/response system) in progress.

Looking at the issue, the logs are cropped, and I don't see any hint of the fact that the problem could be caused by the KeepAliveTimeout error.

arkpar · 2022-08-25T09:59:21Z

@liuchengxu Could you please full logs for both nodes for the same period of time.

nazar-pc · 2022-08-25T10:27:08Z

Here is logs of the node that Liu-Cheng's node tried to sync from (we had them added as reserved nodes of each other with --reserved-only): https://nextcloud.mokrynskyi.com/s/xNtYpAsZTFY8YN7
This is with debug log level for everything, thus quite big.

liuchengxu · 2022-08-25T10:44:56Z

Thanks Nazar.

@arkpar For the convient review, here is the server node log: https://files.slack.com/files-pri/T03LJ85UR5G-F03UZ31SHC3/untitled.txt, the client node log: https://files.slack.com/files-pri/T03LJ85UR5G-F03V6JZ7HQC/client_node.txt Let me know if they are not sufficient. Both nodes are running from home, I doubt the network speed can reach 100mbps.

melekes · 2023-01-06T07:35:42Z

I think paritytech/polkadot-sdk#531 is related, after Timeout, we see a flood of duplicated block requests, maybe we can also do something to improve that.

yeah, duplicated block requests lead to us disconnecting from that node

2022-08-24 17:30:02.747 DEBUG tokio-runtime-worker peerset: [PrimaryChain] Report 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS: -2147483648 to -214748
3648. Reason: Same block request multiple times, Disconnecting
2022-08-24 17:30:02.747  WARN tokio-runtime-worker sync: [PrimaryChain] Reserved peer 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS disconnected

melekes · 2023-01-06T07:50:47Z

To me this looks like this is a duplicate of paritytech/polkadot-sdk#531

nazar-pc · 2023-01-06T08:09:00Z

I'd say this one is caused by paritytech/polkadot-sdk#531, but they are not quite the same as other slow network data transfers can potentially cause this timeout too.

nazar-pc mentioned this issue Aug 30, 2022

Node is banned by other node during sync autonomys/subspace#391

Closed

arkpar mentioned this issue Sep 5, 2022

Enforce blocks response limit #12146

Merged

nazar-pc mentioned this issue Sep 13, 2022

Node keeps loosing peers. paritytech/polkadot-sdk#528

Open

dmitry-markin mentioned this issue Feb 27, 2023

Lost peers due to state-cache #10130

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A larger data payload in the response of a block request can't be sent back due to KeepAliveTimeout and then peer disconnected #12105

A larger data payload in the response of a block request can't be sent back due to KeepAliveTimeout and then peer disconnected #12105

liuchengxu commented Aug 25, 2022

bkchr commented Aug 25, 2022

arkpar commented Aug 25, 2022

tomaka commented Aug 25, 2022

arkpar commented Aug 25, 2022

nazar-pc commented Aug 25, 2022

liuchengxu commented Aug 25, 2022 •

edited

Loading

melekes commented Jan 6, 2023

melekes commented Jan 6, 2023

nazar-pc commented Jan 6, 2023

A larger data payload in the response of a block request can't be sent back due to KeepAliveTimeout and then peer disconnected #12105

A larger data payload in the response of a block request can't be sent back due to KeepAliveTimeout and then peer disconnected #12105

Comments

liuchengxu commented Aug 25, 2022

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

Steps to reproduce

bkchr commented Aug 25, 2022

arkpar commented Aug 25, 2022

tomaka commented Aug 25, 2022

arkpar commented Aug 25, 2022

nazar-pc commented Aug 25, 2022

liuchengxu commented Aug 25, 2022 • edited Loading

melekes commented Jan 6, 2023

melekes commented Jan 6, 2023

nazar-pc commented Jan 6, 2023

liuchengxu commented Aug 25, 2022 •

edited

Loading