Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Rejected nodes keep reconnecting without delay #13778

Closed
2 tasks done
dmitry-markin opened this issue Mar 31, 2023 · 8 comments · May be fixed by paritytech/polkadot-sdk#1563
Closed
2 tasks done

Rejected nodes keep reconnecting without delay #13778

dmitry-markin opened this issue Mar 31, 2023 · 8 comments · May be fixed by paritytech/polkadot-sdk#1563
Assignees
Labels
I3-bug The node fails to follow expected behavior. U2-some_time_soon Issue is worth doing soon.

Comments

@dmitry-markin
Copy link
Contributor

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

If the local node doesn't have incoming slots available and rejects incoming connections, the rejected nodes reconnect without a delay not respecting a back-off interval.

The ping time with the reconnecting nodes breaks with the reconnection interval, what supports the hypothesis that this is in fact a remote node reconnecting with us, and not the local misinterpretation of network events that is reported as reconnection attempts.

What complicates debugging is that it's not easy to observe the issue from the perspective of the reconnecting node. All attempts to reproduce the issue with two nodes connecting to each other have failed so far. The connecting node either removes the node it's being rejected by from its peer set and stops any subsequent connection attempt, or respects the back-off interval if the rejecting node is marked as reserved.

The issue was first mentioned in https://github.com/paritytech/infrastructure-alerts/issues/26#issuecomment-1411988935.

CC @altonen @melekes

Steps to reproduce

Run polkadot with -l sync=debug, and once all the incoming slots are filled up there will be plenty of Too many full nodes, rejecting messages in the logs with the same peer ids:

too_many_full_nodes

@dmitry-markin
Copy link
Contributor Author

dmitry-markin commented Mar 31, 2023

The disconnect happens on the local side here:

if status.roles.is_full() &&
self.chain_sync.num_peers() >=
self.default_peers_set_num_full +
self.default_peers_set_no_slot_connected_peers.len() +
this_peer_reserved_slot
{
log::debug!(target: "sync", "Too many full nodes, rejecting {}", who);
return Err(())
}

So, it's possible to reject a specific peer by PeerId similarly to debug the issue locally.

@dmitry-markin dmitry-markin moved this to In Progress 🛠 in Networking Mar 31, 2023
@dmitry-markin dmitry-markin added I3-bug The node fails to follow expected behavior. U2-some_time_soon Issue is worth doing soon. labels Mar 31, 2023
@dmitry-markin
Copy link
Contributor Author

dmitry-markin commented Mar 31, 2023

One possible explanation would be that a remote node thinks it has more than one connection with us, and does not insert a back-off record here:

} else if !connections.iter().any(|(_, s)| {
matches!(s, ConnectionState::Opening | ConnectionState::Open(_))
}) {
trace!(target: "sub-libp2p", "PSM <= Dropped({}, {:?})", peer_id, set_id);
self.peerset.dropped(set_id, peer_id, DropReason::Unknown);
*entry.get_mut() =
PeerState::Disabled { connections, backoff_until: None };
} else {

I've checked that the rejecting node observes only one connection with the remote one though.

@altonen
Copy link
Contributor

altonen commented Apr 5, 2023

@dmitry-markin

could you add your findings regarding this bug here? I can take a look at some point if I'm able to make any sense of this

@dmitry-markin
Copy link
Contributor Author

could you add your findings regarding this bug here? I can take a look at some point if I'm able to make any sense of this

There is not much to add apart from what I shared above in the description and comments. It looks like the peers in fact reconnect as soon as they are disconnected on the default peerset. The only place where the ban is not installed after receiving a disconnect event is highlighted in #13778 (comment), but I have not found how the state machine could end in that state.

@altonen
Copy link
Contributor

altonen commented Apr 6, 2023

I think I know what the underlying issue is. As reported in paritytech/polkadot-sdk#512, Peerset is not aware that some of the num_in slots are reserved for light clients which then causes it to accept connections from full nodes which get rejected in SyncingEngine. This is the state of Peerset, and as can be seen, there are way too many accepted inbound connections than there should be (48 vs 32):

peerset

By providing --in-peers-light 0, the constant inbound connection attempts stop. The node still sometimes gets redials from other peers but this time there is a proper back-off between the attempts. Setting --in-peers-light to 0 is obviously not a fix for this issue because we want to allow inbound light client connections.

I'm not entirely sure what happens that causes the remote peer to open another substream right away but the state machine is probably getting confused somewhere about substream acceptance and then sudden disconnection. It could be that the handshake of either substream is not properly finished if it's closed before fully negotiated. I don't want to burn too much time so I'll mark this as superseded by paritytech/polkadot-sdk#512.

@altonen altonen closed this as completed Apr 6, 2023
@github-project-automation github-project-automation bot moved this from In Progress 🛠 to Blocked ⛔️ in Networking Apr 6, 2023
@altonen altonen moved this from Blocked ⛔️ to Done ✅ in Networking Apr 6, 2023
@altonen
Copy link
Contributor

altonen commented Aug 19, 2023

I'm able reproduce this locally for the misbehaving node with a custom notification protocol. Here's an excerpt from the logs:

2023-08-19T11:34:44.948 TRACE sub-libp2p: External API => Disconnect(12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3, SetId(0))
2023-08-19T11:34:44.948 TRACE sub-libp2p: PSM <= Dropped(12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3, SetId(0))
2023-08-19T11:34:44.948 TRACE sub-libp2p: External API <= Closed(12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3, SetId(0))    
2023-08-19T11:34:44.948 TRACE sub-libp2p: Handler(PeerId("12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3"), ConnectionId(1)) <= Close(SetId(0))    
2023-08-19T11:34:44.949 TRACE peerset: Peer 12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3 (Outbound) dropped from SetId(0).    
2023-08-19T11:34:44.949 TRACE peerset: Peer 12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3 disconnected, reputation: -256 to -256    
2023-08-19T11:34:44.949 TRACE sub-libp2p: Handler(12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3, ConnectionId(1)) => CloseResult(SetId(0))    
2023-08-19T11:34:45.818 TRACE peerset: Connecting to 12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3 on SetId(0) (1/1 num_out/max_out).    
2023-08-19T11:34:45.818 TRACE sub-libp2p: PSM => Connect(12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3, SetId(0)): Enabling connections.    
2023-08-19T11:34:45.818 TRACE sub-libp2p: Handler(PeerId("12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3"), ConnectionId(1)) <= Open(SetId(0))    
andler(PeerId("12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3"), ConnectionId(1)) <= Open(SetId(0))    

It's caused by this code because there is no back-off applied to the outbound connection even if it was just closed:

// The first element of `closed` is chosen to open the notifications substream.
if let Some((connec_id, connec_state)) =
connections.iter_mut().find(|(_, s)| matches!(s, ConnectionState::Closed))
{
trace!(target: "sub-libp2p", "PSM => Connect({}, {:?}): Enabling connections.",
occ_entry.key().0, set_id);
trace!(target: "sub-libp2p", "Handler({:?}, {:?}) <= Open({:?})", peer_id, *connec_id, set_id);
self.events.push_back(ToSwarm::NotifyHandler {
peer_id,
handler: NotifyHandler::One(*connec_id),
event: NotifsHandlerIn::Open { protocol_index: set_id.into() },
});
*connec_state = ConnectionState::Opening;
*occ_entry.into_mut() = PeerState::Enabled { connections };

@dmitry-markin
Copy link
Contributor Author

It looks like the proper place to add back-off for outbound connections is ProtocolController. Otherwise we would end up in a state where ProtocolController provides outbound candidates to Notifications, Notifications immediately report to ProtocolController that these peers have "dropped" (if Notifications is the one implementing the back-off), and everything would repeat every second.

@altonen
Copy link
Contributor

altonen commented Aug 23, 2023

I think so too and it relates to this issue as well: paritytech/polkadot-sdk#494

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I3-bug The node fails to follow expected behavior. U2-some_time_soon Issue is worth doing soon.
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants