-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Rejected nodes keep reconnecting without delay #13778
Rejected nodes keep reconnecting without delay #13778
Comments
The disconnect happens on the local side here: substrate/client/network/sync/src/engine.rs Lines 810 to 818 in 4bf67fb
So, it's possible to reject a specific peer by |
One possible explanation would be that a remote node thinks it has more than one connection with us, and does not insert a back-off record here: substrate/client/network/src/protocol/notifications/behaviour.rs Lines 1323 to 1331 in 4bf67fb
I've checked that the rejecting node observes only one connection with the remote one though. |
could you add your findings regarding this bug here? I can take a look at some point if I'm able to make any sense of this |
There is not much to add apart from what I shared above in the description and comments. It looks like the peers in fact reconnect as soon as they are disconnected on the default peerset. The only place where the ban is not installed after receiving a disconnect event is highlighted in #13778 (comment), but I have not found how the state machine could end in that state. |
I think I know what the underlying issue is. As reported in paritytech/polkadot-sdk#512, By providing I'm not entirely sure what happens that causes the remote peer to open another substream right away but the state machine is probably getting confused somewhere about substream acceptance and then sudden disconnection. It could be that the handshake of either substream is not properly finished if it's closed before fully negotiated. I don't want to burn too much time so I'll mark this as superseded by paritytech/polkadot-sdk#512. |
I'm able reproduce this locally for the misbehaving node with a custom notification protocol. Here's an excerpt from the logs:
It's caused by this code because there is no back-off applied to the outbound connection even if it was just closed: substrate/client/network/src/protocol/notifications/behaviour.rs Lines 628 to 641 in 94be94b
|
It looks like the proper place to add back-off for outbound connections is |
I think so too and it relates to this issue as well: paritytech/polkadot-sdk#494 |
Is there an existing issue?
Experiencing problems? Have you tried our Stack Exchange first?
Description of bug
If the local node doesn't have incoming slots available and rejects incoming connections, the rejected nodes reconnect without a delay not respecting a back-off interval.
The ping time with the reconnecting nodes breaks with the reconnection interval, what supports the hypothesis that this is in fact a remote node reconnecting with us, and not the local misinterpretation of network events that is reported as reconnection attempts.
What complicates debugging is that it's not easy to observe the issue from the perspective of the reconnecting node. All attempts to reproduce the issue with two nodes connecting to each other have failed so far. The connecting node either removes the node it's being rejected by from its peer set and stops any subsequent connection attempt, or respects the back-off interval if the rejecting node is marked as reserved.
The issue was first mentioned in https://github.com/paritytech/infrastructure-alerts/issues/26#issuecomment-1411988935.
CC @altonen @melekes
Steps to reproduce
Run polkadot with
-l sync=debug
, and once all the incoming slots are filled up there will be plenty ofToo many full nodes, rejecting
messages in the logs with the same peer ids:The text was updated successfully, but these errors were encountered: