Peers all drops when a bad peer try to connect #3346

crystalin · 2024-02-15T23:15:32Z

Is there an existing issue?

I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

This is not a support question.

Description of bug

Running a regular node (Moonriver, v0.35.0, based on polkadot-sdk 1.3.0), which maintains 40 peers (configured 20 in, 20 outs), when a bad node connect with a bad genesis, all the peers get dropped:

2024-02-15 23:06:31 [🌗] 💤 Idle (40 peers), best: #6153831 (0xf3ff…06ed), finalized #6153829 (0xb749…a45a), ⬇ 12.5kiB/s ⬆ 12.5kiB/s
2024-02-15 23:06:36 [Relaychain] Received imported block via RPC: #21894430 (0xef28…3f0c -> 0x59b1…1fad)
2024-02-15 23:06:36 [Relaychain] Failed to handle incoming network message err=ImplicitViewFetchError(ProspectiveParachainsUnavailable)
2024-02-15 23:06:36 [Relaychain] Received imported block via RPC: #21894430 (0xef28…3f0c -> 0x992a…2263)
2024-02-15 23:06:36 [Relaychain] Failed to handle incoming network message err=ImplicitViewFetchError(ProspectiveParachainsUnavailable)
2024-02-15 23:06:36 [🌗] 💤 Idle (40 peers), best: #6153831 (0xf3ff…06ed), finalized #6153829 (0xb749…a45a), ⬇ 11.1kiB/s ⬆ 11.4kiB/s
2024-02-15 23:06:38 [Relaychain] Received finalized block via RPC: #21894427 (0xdeae…4c01 -> 0x15ab…8cd9)
2024-02-15 23:06:40 Accepting new connection 1/100
2024-02-15 23:06:41 [🌗] 💤 Idle (40 peers), best: #6153831 (0xf3ff…06ed), finalized #6153830 (0x31fd…5597), ⬇ 15.7kiB/s ⬆ 15.8kiB/s
2024-02-15 23:06:42 [Relaychain] Received imported block via RPC: #21894431 (0x992a…2263 -> 0x0eed…e67f)
2024-02-15 23:06:42 [Relaychain] Failed to handle incoming network message err=ImplicitViewFetchError(ProspectiveParachainsUnavailable)
2024-02-15 23:06:46 [🌗] 💤 Idle (40 peers), best: #6153831 (0xf3ff…06ed), finalized #6153830 (0x31fd…5597), ⬇ 4.5kiB/s ⬆ 4.7kiB/s
2024-02-15 23:06:48 [Relaychain] Received imported block via RPC: #21894432 (0x0eed…e67f -> 0xdc8a…7a4f)
2024-02-15 23:06:48 [Relaychain] Failed to handle incoming network message err=ImplicitViewFetchError(ProspectiveParachainsUnavailable)
2024-02-15 23:06:50 [🌗] Report 12D3KooWJi6QVo4oVJK7Yfs8tGaUmw7MezZmJxd8LFpfoB5neggu: -2147483648 to -2147483648. Reason: Genesis mismatch. Banned, disconnecting.
2024-02-15 23:06:50 Accepting new connection 1/100
2024-02-15 23:06:51 [Relaychain] Received finalized block via RPC: #21894428 (0x15ab…8cd9 -> 0xade3…59c3)
2024-02-15 23:06:51 [🌗] Report 12D3KooWEdyRSv9z43ZyATSxvVYtAbHBVBo2J68JnCYxfzGbHqrB: -2147483648 to -2147483648. Reason: Genesis mismatch. Banned, disconnecting.
2024-02-15 23:06:51 [🌗] 💤 Idle (0 peers), best: #6153831 (0xf3ff…06ed), finalized #6153830 (0x31fd…5597), ⬇ 0.9kiB/s ⬆ 1.0kiB/s

Steps to reproduce

Running a moonbeam v0.35.0 on chain moonriver will likely reproduce the issue

The text was updated successfully, but these errors were encountered:

altonen · 2024-02-16T09:51:27Z

I will try to reproduce but in the meantime, could you post sync=trace logs? We've seen a similar issue of peer count dropping to zero on Kusama AssetHub but it was not related genesis mismatches. Based on that short snippet, it looks like the parachain best block is not progressing which could indicate syncing has stalled, SyncingEngine notices this, evicts all peers and tries to establish new connections.

crystalin · 2024-02-16T10:56:48Z

It makes sense. Also as Kusama (and attached parachains) is having issues at the moment, block production is unstable

crystalin · 2024-02-16T11:03:15Z

I think it would deserve a log INFO line when dropping all the peers as it is quite a drastic event.

crystalin · 2024-02-17T15:03:45Z

Using TRACE, I get:

2024-02-17 13:03:31.634 DEBUG tokio-runtime-worker sync: [🌗] Propagating transactions
2024-02-17 13:03:33.789  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 13.3kiB/s ⬆ 12.6kiB/s
2024-02-17 13:03:34.273 DEBUG tokio-runtime-worker sync: [🌗] syncing has halted due to inactivity, evicting all peers
2024-02-17 13:03:34.273 DEBUG tokio-runtime-worker sync: [🌗] 12D3KooWJFRnyQ1ZPrVTNk3YqWN4AiERWGYCU7htJCZTvMBtsRvt disconnected
2024-02-17 13:03:34.273 TRACE tokio-runtime-worker sync: [🌗] 0 blocks ready for import
2024-02-17 13:03:34.274 DEBUG tokio-runtime-worker sync: [🌗] 12D3KooWGGL4Y9ewgPd3iteMDPQPuKtccZaYUqa5gR16zQaKMsEu disconnected
2024-02-17 13:03:34.274 TRACE tokio-runtime-worker sync: [🌗] 0 blocks ready for import
2024-02-17 13:03:34.274 DEBUG tokio-runtime-worker sync: [🌗] 12D3KooWKxSMShP8pZwAc9yifedCyNGTnt9ZcrpHf4m5P5HhArdC disconnected

crystalin · 2024-02-17T15:09:17Z

When checkiong the timing:

2024-02-17 13:02:33.784  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163833 (0x739b…7058), finalized #6163832 (0x4c78…a1cf), ⬇ 34.4kiB/s ⬆ 35.9kiB/s
2024-02-17 13:02:36.828  INFO tokio-runtime-worker substrate: [🌗] ✨ Imported #6163835 (0x550d…20d3)
2024-02-17 13:02:38.784  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163834 (0x9671…2c58), finalized #6163832 (0x4c78…a1cf), ⬇ 68.4kiB/s ⬆ 111.3kiB/s
2024-02-17 13:02:43.785  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163834 (0x9671…2c58), finalized #6163832 (0x4c78…a1cf), ⬇ 10.6kiB/s ⬆ 11.1kiB/s
2024-02-17 13:02:48.785  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163835 (0x550d…20d3), finalized #6163832 (0x4c78…a1cf), ⬇ 22.6kiB/s ⬆ 23.5kiB/s
2024-02-17 13:02:49.277  INFO tokio-runtime-worker substrate: [🌗] ✨ Imported #6163836 (0x079d…ee45)
2024-02-17 13:02:53.786  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163835 (0x550d…20d3), finalized #6163833 (0x739b…7058), ⬇ 20.0kiB/s ⬆ 97.9kiB/s
2024-02-17 13:02:58.786  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163835 (0x550d…20d3), finalized #6163833 (0x739b…7058), ⬇ 13.0kiB/s ⬆ 13.6kiB/s
2024-02-17 13:03:03.642  INFO tokio-runtime-worker substrate: [🌗] ✨ Imported #6163837 (0x0b54…607c)
2024-02-17 13:03:03.787  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163836 (0x079d…ee45), finalized #6163834 (0x9671…2c58), ⬇ 31.7kiB/s ⬆ 11.1kiB/s
2024-02-17 13:03:08.787  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163836 (0x079d…ee45), finalized #6163835 (0x550d…20d3), ⬇ 6.9kiB/s ⬆ 82.7kiB/s
2024-02-17 13:03:13.788  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163836 (0x079d…ee45), finalized #6163835 (0x550d…20d3), ⬇ 24.0kiB/s ⬆ 24.5kiB/s
2024-02-17 13:03:18.788  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163836 (0x079d…ee45), finalized #6163835 (0x550d…20d3), ⬇ 15.8kiB/s ⬆ 16.6kiB/s
2024-02-17 13:03:23.789  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 4.9kiB/s ⬆ 5.1kiB/s
2024-02-17 13:03:28.789  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 0.4kiB/s ⬆ 0.4kiB/s
2024-02-17 13:03:33.789  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 13.3kiB/s ⬆ 12.6kiB/s
2024-02-17 13:03:34.273 DEBUG tokio-runtime-worker sync: [🌗] syncing has halted due to inactivity, evicting all peers
2024-02-17 13:03:38.789  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (2 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 3.0kiB/s ⬆ 7.3kiB/s
2024-02-17 13:03:43.789  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (2 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 1.7kiB/s ⬆ 1.3kiB/s
2024-02-17 13:03:48.790  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (2 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 1.3kiB/s ⬆ 1.7kiB/s
2024-02-17 13:03:53.791  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (2 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 1.6kiB/s ⬆ 1.5kiB/s
2024-02-17 13:03:58.791  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (2 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 1.6kiB/s ⬆ 1.4kiB/s
2024-02-17 13:04:01.277  INFO tokio-runtime-worker substrate: [🌗] ✨ Imported #6163837 (0x860b…0017)
2024-02-17 13:04:03.791  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (2 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 10.0kiB/s ⬆ 9.3kiB/s
2024-02-17 13:04:07.883  INFO tokio-runtime-worker substrate: [🌗] ✨ Imported #6163837 (0xfe7a…e5be)
2024-02-17 13:04:08.792  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (2 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 8.5kiB/s ⬆ 0.9kiB/s
2024-02-17 13:04:13.792  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (2 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 2.0kiB/s ⬆ 2.0kiB/s
2024-02-17 13:04:18.793  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (2 peers), best: #6163837 (0xfe7a…e5be), finalized #6163836 (0x079d…ee45), ⬇ 1.0kiB/s ⬆ 0.9kiB/s
2024-02-17 13:04:19.785  INFO tokio-runtime-worker substrate: [🌗] ✨ Imported #6163838 (0x9aed…2b7b)

It seems the peers were dropped after 30s which doesn't seem a lot for a parachain considering Kusama current issues.
I think:

probably dropping only half of the validators would be be enough
changing the log level of syncing has halted due to inactivity, evicting all peers to INFO would help administrators
(maybe) increase the delay ? (Not sure if a good idea or not)

altonen · 2024-02-17T15:39:35Z

You're right, the eviction event should've been logged as INFO or even WARN. I don't think there is an upside to increasing the delay because SyncingEngine is essentially stuck with peers who are not connected to it and it can't unstuck itself by waiting more. I also don't think dropping only some portion of the peers is very effective because this happens when all of the connected nodes are "idle" so dropping half of them would leave the other half the idle peers still connected and when an active peer connects and sends a new block announcement, the local node will announce this block to the "idle" peers and they will get disconnected. Essentially the whole set of connected peers is changed either way.

I will prepare a fix which will bring back the original version of eviction where it tracked the peers individually. This will result in a more unstable peer count but it will not cause all of the peers to get dropped at the same time.

Previously `SyncingEngine` would use `last_notification_io` to track when it last sent or received a block announcement. If there had been no notification I/O for 30 seconds, `SyncingEngine` considered itself stalled and evicted all peers which gave `ProtocolController` a chance to establish new outbound connections. The problem with this approach was that if `SyncingEngine` actually got stalled, peer count would suddenly drop to zero after which it would slowly climb back to the desired amount of peers. Start tracking the notification I/O of peers individually and evict them as they're deemed idle which prevents the peer count from suddenly dropping to zero and instead makes `SyncingEngine` continuously look for more active peers. This might make the peer count more unstable on average if there are a lot of inactive peers in the network. Fixes #3346

crystalin · 2024-02-19T08:05:40Z

@altonen I understand your point. I think we need to split the logic based on parachain vs relaychain behavior.
Because what you says makes sense for Relaychain, but is problematic for parachain where not producing a block for 30s is not uncommon, which make ALL nodes disconnect ALL their peers. This creates a significant networking chaos.

See how frequent this is:

altonen · 2024-02-19T11:54:03Z

We ran a burn-in for the proposed fix in Kusama AssetHub and it didn't work quite as well as I had hoped so I'm re-evaluating our approach.

girazoki · 2024-04-22T10:23:58Z

hey @altonen does this #3983 fix the issue? We are also observing this in a internal network and I think this is going to be a problem for parathreads (where you will be constantly evicting peers as the block production rate is much slower)

bkchr · 2024-04-22T19:26:50Z

@girazoki yes that should be fixed by the mentioned pr, as the logic was removed. Aaro is also not working anymore for us.

Going to close the issue as it should be solved.

girazoki · 2024-04-23T09:54:55Z

thank you!

crystalin added I10-unconfirmed Issue might be valid, but it's not yet known. I2-bug The node fails to follow expected behavior. labels Feb 15, 2024

crystalin changed the title ~~Peers all drops when a bad peers try to connect~~ Peers all drops when a bad peer try to connect Feb 15, 2024

altonen added this to Networking Feb 16, 2024

altonen moved this to In Progress 🛠 in Networking Feb 16, 2024

altonen mentioned this issue Feb 17, 2024

Start tracking idle peers individually in SyncingEngine #3375

Closed

lexnv mentioned this issue Apr 5, 2024

Detect closed notification substreams instead of evicting all peers #3983

Merged

bkchr closed this as completed Apr 22, 2024

github-project-automation bot moved this from In Progress 🛠 to Blocked ⛔️ in Networking Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peers all drops when a bad peer try to connect #3346

Peers all drops when a bad peer try to connect #3346

crystalin commented Feb 15, 2024

altonen commented Feb 16, 2024

crystalin commented Feb 16, 2024

crystalin commented Feb 16, 2024

crystalin commented Feb 17, 2024

crystalin commented Feb 17, 2024 •

edited

Loading

altonen commented Feb 17, 2024

crystalin commented Feb 19, 2024

altonen commented Feb 19, 2024

girazoki commented Apr 22, 2024

bkchr commented Apr 22, 2024

girazoki commented Apr 23, 2024

Peers all drops when a bad peer try to connect #3346

Peers all drops when a bad peer try to connect #3346

Comments

crystalin commented Feb 15, 2024

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

Steps to reproduce

altonen commented Feb 16, 2024

crystalin commented Feb 16, 2024

crystalin commented Feb 16, 2024

crystalin commented Feb 17, 2024

crystalin commented Feb 17, 2024 • edited Loading

altonen commented Feb 17, 2024

crystalin commented Feb 19, 2024

altonen commented Feb 19, 2024

girazoki commented Apr 22, 2024

bkchr commented Apr 22, 2024

girazoki commented Apr 23, 2024

crystalin commented Feb 17, 2024 •

edited

Loading