Rejected nodes keep reconnecting without delay #13778

dmitry-markin · 2023-03-31T09:44:25Z

Is there an existing issue?

I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

This is not a support question.

Description of bug

If the local node doesn't have incoming slots available and rejects incoming connections, the rejected nodes reconnect without a delay not respecting a back-off interval.

The ping time with the reconnecting nodes breaks with the reconnection interval, what supports the hypothesis that this is in fact a remote node reconnecting with us, and not the local misinterpretation of network events that is reported as reconnection attempts.

What complicates debugging is that it's not easy to observe the issue from the perspective of the reconnecting node. All attempts to reproduce the issue with two nodes connecting to each other have failed so far. The connecting node either removes the node it's being rejected by from its peer set and stops any subsequent connection attempt, or respects the back-off interval if the rejecting node is marked as reserved.

The issue was first mentioned in https://github.com/paritytech/infrastructure-alerts/issues/26#issuecomment-1411988935.

CC @altonen @melekes

Steps to reproduce

Run polkadot with -l sync=debug, and once all the incoming slots are filled up there will be plenty of Too many full nodes, rejecting messages in the logs with the same peer ids:

The text was updated successfully, but these errors were encountered:

dmitry-markin · 2023-03-31T09:49:00Z

The disconnect happens on the local side here:

substrate/client/network/sync/src/engine.rs

Lines 810 to 818 in 4bf67fb

    
           if status.roles.is_full() && 
        
           	self.chain_sync.num_peers() >= 
        
           		self.default_peers_set_num_full + 
        
           			self.default_peers_set_no_slot_connected_peers.len() + 
        
           			this_peer_reserved_slot 
        
           { 
        
           	log::debug!(target: "sync", "Too many full nodes, rejecting {}", who); 
        
           	return Err(()) 
        
           }

So, it's possible to reject a specific peer by PeerId similarly to debug the issue locally.

dmitry-markin · 2023-03-31T09:56:24Z

One possible explanation would be that a remote node thinks it has more than one connection with us, and does not insert a back-off record here:

substrate/client/network/src/protocol/notifications/behaviour.rs

Lines 1323 to 1331 in 4bf67fb

    
           } else if !connections.iter().any(|(_, s)| { 
        
           	matches!(s, ConnectionState::Opening | ConnectionState::Open(_)) 
        
           }) { 
        
           	trace!(target: "sub-libp2p", "PSM <= Dropped({}, {:?})", peer_id, set_id); 
        
           	self.peerset.dropped(set_id, peer_id, DropReason::Unknown); 
        
           	*entry.get_mut() = 
        
           		PeerState::Disabled { connections, backoff_until: None }; 
        
           } else {

I've checked that the rejecting node observes only one connection with the remote one though.

altonen · 2023-04-05T09:57:01Z

@dmitry-markin

could you add your findings regarding this bug here? I can take a look at some point if I'm able to make any sense of this

dmitry-markin · 2023-04-05T15:16:27Z

could you add your findings regarding this bug here? I can take a look at some point if I'm able to make any sense of this

There is not much to add apart from what I shared above in the description and comments. It looks like the peers in fact reconnect as soon as they are disconnected on the default peerset. The only place where the ban is not installed after receiving a disconnect event is highlighted in #13778 (comment), but I have not found how the state machine could end in that state.

altonen · 2023-04-06T14:17:52Z

I think I know what the underlying issue is. As reported in paritytech/polkadot-sdk#512, Peerset is not aware that some of the num_in slots are reserved for light clients which then causes it to accept connections from full nodes which get rejected in SyncingEngine. This is the state of Peerset, and as can be seen, there are way too many accepted inbound connections than there should be (48 vs 32):

By providing --in-peers-light 0, the constant inbound connection attempts stop. The node still sometimes gets redials from other peers but this time there is a proper back-off between the attempts. Setting --in-peers-light to 0 is obviously not a fix for this issue because we want to allow inbound light client connections.

I'm not entirely sure what happens that causes the remote peer to open another substream right away but the state machine is probably getting confused somewhere about substream acceptance and then sudden disconnection. It could be that the handshake of either substream is not properly finished if it's closed before fully negotiated. I don't want to burn too much time so I'll mark this as superseded by paritytech/polkadot-sdk#512.

altonen · 2023-08-19T11:42:55Z

I'm able reproduce this locally for the misbehaving node with a custom notification protocol. Here's an excerpt from the logs:

2023-08-19T11:34:44.948 TRACE sub-libp2p: External API => Disconnect(12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3, SetId(0))
2023-08-19T11:34:44.948 TRACE sub-libp2p: PSM <= Dropped(12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3, SetId(0))
2023-08-19T11:34:44.948 TRACE sub-libp2p: External API <= Closed(12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3, SetId(0))    
2023-08-19T11:34:44.948 TRACE sub-libp2p: Handler(PeerId("12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3"), ConnectionId(1)) <= Close(SetId(0))    
2023-08-19T11:34:44.949 TRACE peerset: Peer 12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3 (Outbound) dropped from SetId(0).    
2023-08-19T11:34:44.949 TRACE peerset: Peer 12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3 disconnected, reputation: -256 to -256    
2023-08-19T11:34:44.949 TRACE sub-libp2p: Handler(12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3, ConnectionId(1)) => CloseResult(SetId(0))    
2023-08-19T11:34:45.818 TRACE peerset: Connecting to 12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3 on SetId(0) (1/1 num_out/max_out).    
2023-08-19T11:34:45.818 TRACE sub-libp2p: PSM => Connect(12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3, SetId(0)): Enabling connections.    
2023-08-19T11:34:45.818 TRACE sub-libp2p: Handler(PeerId("12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3"), ConnectionId(1)) <= Open(SetId(0))    
andler(PeerId("12D3KooWKKagT5BccuPpumRpnX4kWrqM13aFbuiBDqUZ1i44bRz3"), ConnectionId(1)) <= Open(SetId(0))

It's caused by this code because there is no back-off applied to the outbound connection even if it was just closed:

substrate/client/network/src/protocol/notifications/behaviour.rs

Lines 628 to 641 in 94be94b

    
           // The first element of `closed` is chosen to open the notifications substream. 
        
           if let Some((connec_id, connec_state)) = 
        
           	connections.iter_mut().find(|(_, s)| matches!(s, ConnectionState::Closed)) 
        
           { 
        
           	trace!(target: "sub-libp2p", "PSM => Connect({}, {:?}): Enabling connections.", 
        
           		occ_entry.key().0, set_id); 
        
           	trace!(target: "sub-libp2p", "Handler({:?}, {:?}) <= Open({:?})", peer_id, *connec_id, set_id); 
        
           	self.events.push_back(ToSwarm::NotifyHandler { 
        
           		peer_id, 
        
           		handler: NotifyHandler::One(*connec_id), 
        
           		event: NotifsHandlerIn::Open { protocol_index: set_id.into() }, 
        
           	}); 
        
           	*connec_state = ConnectionState::Opening; 
        
           	*occ_entry.into_mut() = PeerState::Enabled { connections };

dmitry-markin · 2023-08-23T07:58:20Z

It looks like the proper place to add back-off for outbound connections is ProtocolController. Otherwise we would end up in a state where ProtocolController provides outbound candidates to Notifications, Notifications immediately report to ProtocolController that these peers have "dropped" (if Notifications is the one implementing the back-off), and everything would repeat every second.

altonen · 2023-08-23T10:52:34Z

I think so too and it relates to this issue as well: paritytech/polkadot-sdk#494

dmitry-markin added this to Networking Mar 31, 2023

dmitry-markin moved this to In Progress 🛠 in Networking Mar 31, 2023

dmitry-markin added I3-bug The node fails to follow expected behavior. U2-some_time_soon Issue is worth doing soon. labels Mar 31, 2023

dmitry-markin self-assigned this Mar 31, 2023

melekes mentioned this issue Mar 31, 2023

[client/network] ban peer for 10s after disconnecting #13747

Closed

altonen closed this as completed Apr 6, 2023

github-project-automation bot moved this from In Progress 🛠 to Blocked ⛔️ in Networking Apr 6, 2023

altonen moved this from Blocked ⛔️ to Done ✅ in Networking Apr 6, 2023

altonen mentioned this issue Jun 28, 2023

0.9.43 Release Regressions paritytech/polkadot-sdk#94

Closed

altonen mentioned this issue Sep 14, 2023

Prevent opening another substream right after the connection is closed paritytech/polkadot-sdk#1563

Open

dmitry-markin mentioned this issue Oct 9, 2023

[network/notifications] A remote peer is not backed off when it rejects a substream paritytech/polkadot-sdk#1827

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rejected nodes keep reconnecting without delay #13778

Rejected nodes keep reconnecting without delay #13778

dmitry-markin commented Mar 31, 2023

dmitry-markin commented Mar 31, 2023 •

edited

Loading

dmitry-markin commented Mar 31, 2023 •

edited

Loading

altonen commented Apr 5, 2023

dmitry-markin commented Apr 5, 2023

altonen commented Apr 6, 2023

altonen commented Aug 19, 2023

dmitry-markin commented Aug 23, 2023

altonen commented Aug 23, 2023

Rejected nodes keep reconnecting without delay #13778

Rejected nodes keep reconnecting without delay #13778

Comments

dmitry-markin commented Mar 31, 2023

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

Steps to reproduce

dmitry-markin commented Mar 31, 2023 • edited Loading

dmitry-markin commented Mar 31, 2023 • edited Loading

altonen commented Apr 5, 2023

dmitry-markin commented Apr 5, 2023

altonen commented Apr 6, 2023

altonen commented Aug 19, 2023

dmitry-markin commented Aug 23, 2023

altonen commented Aug 23, 2023

dmitry-markin commented Mar 31, 2023 •

edited

Loading

dmitry-markin commented Mar 31, 2023 •

edited

Loading