Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peers all drops when a bad peer try to connect #3346

Closed
2 tasks done
crystalin opened this issue Feb 15, 2024 · 11 comments
Closed
2 tasks done

Peers all drops when a bad peer try to connect #3346

crystalin opened this issue Feb 15, 2024 · 11 comments
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.

Comments

@crystalin
Copy link

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

Running a regular node (Moonriver, v0.35.0, based on polkadot-sdk 1.3.0), which maintains 40 peers (configured 20 in, 20 outs), when a bad node connect with a bad genesis, all the peers get dropped:

2024-02-15 23:06:31 [🌗] 💤 Idle (40 peers), best: #6153831 (0xf3ff…06ed), finalized #6153829 (0xb749…a45a), ⬇ 12.5kiB/s ⬆ 12.5kiB/s
2024-02-15 23:06:36 [Relaychain] Received imported block via RPC: #21894430 (0xef28…3f0c -> 0x59b1…1fad)
2024-02-15 23:06:36 [Relaychain] Failed to handle incoming network message err=ImplicitViewFetchError(ProspectiveParachainsUnavailable)
2024-02-15 23:06:36 [Relaychain] Received imported block via RPC: #21894430 (0xef28…3f0c -> 0x992a…2263)
2024-02-15 23:06:36 [Relaychain] Failed to handle incoming network message err=ImplicitViewFetchError(ProspectiveParachainsUnavailable)
2024-02-15 23:06:36 [🌗] 💤 Idle (40 peers), best: #6153831 (0xf3ff…06ed), finalized #6153829 (0xb749…a45a), ⬇ 11.1kiB/s ⬆ 11.4kiB/s
2024-02-15 23:06:38 [Relaychain] Received finalized block via RPC: #21894427 (0xdeae…4c01 -> 0x15ab…8cd9)
2024-02-15 23:06:40 Accepting new connection 1/100
2024-02-15 23:06:41 [🌗] 💤 Idle (40 peers), best: #6153831 (0xf3ff…06ed), finalized #6153830 (0x31fd…5597), ⬇ 15.7kiB/s ⬆ 15.8kiB/s
2024-02-15 23:06:42 [Relaychain] Received imported block via RPC: #21894431 (0x992a…2263 -> 0x0eed…e67f)
2024-02-15 23:06:42 [Relaychain] Failed to handle incoming network message err=ImplicitViewFetchError(ProspectiveParachainsUnavailable)
2024-02-15 23:06:46 [🌗] 💤 Idle (40 peers), best: #6153831 (0xf3ff…06ed), finalized #6153830 (0x31fd…5597), ⬇ 4.5kiB/s ⬆ 4.7kiB/s
2024-02-15 23:06:48 [Relaychain] Received imported block via RPC: #21894432 (0x0eed…e67f -> 0xdc8a…7a4f)
2024-02-15 23:06:48 [Relaychain] Failed to handle incoming network message err=ImplicitViewFetchError(ProspectiveParachainsUnavailable)
2024-02-15 23:06:50 [🌗] Report 12D3KooWJi6QVo4oVJK7Yfs8tGaUmw7MezZmJxd8LFpfoB5neggu: -2147483648 to -2147483648. Reason: Genesis mismatch. Banned, disconnecting.
2024-02-15 23:06:50 Accepting new connection 1/100
2024-02-15 23:06:51 [Relaychain] Received finalized block via RPC: #21894428 (0x15ab…8cd9 -> 0xade3…59c3)
2024-02-15 23:06:51 [🌗] Report 12D3KooWEdyRSv9z43ZyATSxvVYtAbHBVBo2J68JnCYxfzGbHqrB: -2147483648 to -2147483648. Reason: Genesis mismatch. Banned, disconnecting.
2024-02-15 23:06:51 [🌗] 💤 Idle (0 peers), best: #6153831 (0xf3ff…06ed), finalized #6153830 (0x31fd…5597), ⬇ 0.9kiB/s ⬆ 1.0kiB/s

Steps to reproduce

Running a moonbeam v0.35.0 on chain moonriver will likely reproduce the issue

@crystalin crystalin added I10-unconfirmed Issue might be valid, but it's not yet known. I2-bug The node fails to follow expected behavior. labels Feb 15, 2024
@crystalin crystalin changed the title Peers all drops when a bad peers try to connect Peers all drops when a bad peer try to connect Feb 15, 2024
@altonen altonen moved this to In Progress 🛠 in Networking Feb 16, 2024
@altonen
Copy link
Contributor

altonen commented Feb 16, 2024

I will try to reproduce but in the meantime, could you post sync=trace logs? We've seen a similar issue of peer count dropping to zero on Kusama AssetHub but it was not related genesis mismatches. Based on that short snippet, it looks like the parachain best block is not progressing which could indicate syncing has stalled, SyncingEngine notices this, evicts all peers and tries to establish new connections.

@crystalin
Copy link
Author

It makes sense. Also as Kusama (and attached parachains) is having issues at the moment, block production is unstable

@crystalin
Copy link
Author

I think it would deserve a log INFO line when dropping all the peers as it is quite a drastic event.

@crystalin
Copy link
Author

Using TRACE, I get:

2024-02-17 13:03:31.634 DEBUG tokio-runtime-worker sync: [🌗] Propagating transactions
2024-02-17 13:03:33.789  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 13.3kiB/s ⬆ 12.6kiB/s
2024-02-17 13:03:34.273 DEBUG tokio-runtime-worker sync: [🌗] syncing has halted due to inactivity, evicting all peers
2024-02-17 13:03:34.273 DEBUG tokio-runtime-worker sync: [🌗] 12D3KooWJFRnyQ1ZPrVTNk3YqWN4AiERWGYCU7htJCZTvMBtsRvt disconnected
2024-02-17 13:03:34.273 TRACE tokio-runtime-worker sync: [🌗] 0 blocks ready for import
2024-02-17 13:03:34.274 DEBUG tokio-runtime-worker sync: [🌗] 12D3KooWGGL4Y9ewgPd3iteMDPQPuKtccZaYUqa5gR16zQaKMsEu disconnected
2024-02-17 13:03:34.274 TRACE tokio-runtime-worker sync: [🌗] 0 blocks ready for import
2024-02-17 13:03:34.274 DEBUG tokio-runtime-worker sync: [🌗] 12D3KooWKxSMShP8pZwAc9yifedCyNGTnt9ZcrpHf4m5P5HhArdC disconnected

@crystalin
Copy link
Author

crystalin commented Feb 17, 2024

When checkiong the timing:

2024-02-17 13:02:33.784  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163833 (0x739b…7058), finalized #6163832 (0x4c78…a1cf), ⬇ 34.4kiB/s ⬆ 35.9kiB/s
2024-02-17 13:02:36.828  INFO tokio-runtime-worker substrate: [🌗] ✨ Imported #6163835 (0x550d…20d3)
2024-02-17 13:02:38.784  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163834 (0x9671…2c58), finalized #6163832 (0x4c78…a1cf), ⬇ 68.4kiB/s ⬆ 111.3kiB/s
2024-02-17 13:02:43.785  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163834 (0x9671…2c58), finalized #6163832 (0x4c78…a1cf), ⬇ 10.6kiB/s ⬆ 11.1kiB/s
2024-02-17 13:02:48.785  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163835 (0x550d…20d3), finalized #6163832 (0x4c78…a1cf), ⬇ 22.6kiB/s ⬆ 23.5kiB/s
2024-02-17 13:02:49.277  INFO tokio-runtime-worker substrate: [🌗] ✨ Imported #6163836 (0x079d…ee45)
2024-02-17 13:02:53.786  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163835 (0x550d…20d3), finalized #6163833 (0x739b…7058), ⬇ 20.0kiB/s ⬆ 97.9kiB/s
2024-02-17 13:02:58.786  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163835 (0x550d…20d3), finalized #6163833 (0x739b…7058), ⬇ 13.0kiB/s ⬆ 13.6kiB/s
2024-02-17 13:03:03.642  INFO tokio-runtime-worker substrate: [🌗] ✨ Imported #6163837 (0x0b54…607c)
2024-02-17 13:03:03.787  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163836 (0x079d…ee45), finalized #6163834 (0x9671…2c58), ⬇ 31.7kiB/s ⬆ 11.1kiB/s
2024-02-17 13:03:08.787  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163836 (0x079d…ee45), finalized #6163835 (0x550d…20d3), ⬇ 6.9kiB/s ⬆ 82.7kiB/s
2024-02-17 13:03:13.788  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163836 (0x079d…ee45), finalized #6163835 (0x550d…20d3), ⬇ 24.0kiB/s ⬆ 24.5kiB/s
2024-02-17 13:03:18.788  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163836 (0x079d…ee45), finalized #6163835 (0x550d…20d3), ⬇ 15.8kiB/s ⬆ 16.6kiB/s
2024-02-17 13:03:23.789  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 4.9kiB/s ⬆ 5.1kiB/s
2024-02-17 13:03:28.789  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 0.4kiB/s ⬆ 0.4kiB/s
2024-02-17 13:03:33.789  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (40 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 13.3kiB/s ⬆ 12.6kiB/s
2024-02-17 13:03:34.273 DEBUG tokio-runtime-worker sync: [🌗] syncing has halted due to inactivity, evicting all peers
2024-02-17 13:03:38.789  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (2 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 3.0kiB/s ⬆ 7.3kiB/s
2024-02-17 13:03:43.789  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (2 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 1.7kiB/s ⬆ 1.3kiB/s
2024-02-17 13:03:48.790  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (2 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 1.3kiB/s ⬆ 1.7kiB/s
2024-02-17 13:03:53.791  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (2 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 1.6kiB/s ⬆ 1.5kiB/s
2024-02-17 13:03:58.791  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (2 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 1.6kiB/s ⬆ 1.4kiB/s
2024-02-17 13:04:01.277  INFO tokio-runtime-worker substrate: [🌗] ✨ Imported #6163837 (0x860b…0017)
2024-02-17 13:04:03.791  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (2 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 10.0kiB/s ⬆ 9.3kiB/s
2024-02-17 13:04:07.883  INFO tokio-runtime-worker substrate: [🌗] ✨ Imported #6163837 (0xfe7a…e5be)
2024-02-17 13:04:08.792  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (2 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 8.5kiB/s ⬆ 0.9kiB/s
2024-02-17 13:04:13.792  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (2 peers), best: #6163836 (0x079d…ee45), finalized #6163836 (0x079d…ee45), ⬇ 2.0kiB/s ⬆ 2.0kiB/s
2024-02-17 13:04:18.793  INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (2 peers), best: #6163837 (0xfe7a…e5be), finalized #6163836 (0x079d…ee45), ⬇ 1.0kiB/s ⬆ 0.9kiB/s
2024-02-17 13:04:19.785  INFO tokio-runtime-worker substrate: [🌗] ✨ Imported #6163838 (0x9aed…2b7b)

It seems the peers were dropped after 30s which doesn't seem a lot for a parachain considering Kusama current issues.
I think:

  1. probably dropping only half of the validators would be be enough
  2. changing the log level of syncing has halted due to inactivity, evicting all peers to INFO would help administrators
  3. (maybe) increase the delay ? (Not sure if a good idea or not)

@altonen
Copy link
Contributor

altonen commented Feb 17, 2024

You're right, the eviction event should've been logged as INFO or even WARN. I don't think there is an upside to increasing the delay because SyncingEngine is essentially stuck with peers who are not connected to it and it can't unstuck itself by waiting more. I also don't think dropping only some portion of the peers is very effective because this happens when all of the connected nodes are "idle" so dropping half of them would leave the other half the idle peers still connected and when an active peer connects and sends a new block announcement, the local node will announce this block to the "idle" peers and they will get disconnected. Essentially the whole set of connected peers is changed either way.

I will prepare a fix which will bring back the original version of eviction where it tracked the peers individually. This will result in a more unstable peer count but it will not cause all of the peers to get dropped at the same time.

altonen added a commit that referenced this issue Feb 17, 2024
Previously `SyncingEngine` would use `last_notification_io` to track
when it last sent or received a block announcement. If there had been
no notification I/O for 30 seconds, `SyncingEngine` considered itself
stalled and evicted all peers which gave `ProtocolController` a chance
to establish new outbound connections.

The problem with this approach was that if `SyncingEngine` actually got
stalled, peer count would suddenly drop to zero after which it would
slowly climb back to the desired amount of peers.

Start tracking the notification I/O of peers individually and evict them
as they're deemed idle which prevents the peer count from suddenly
dropping to zero and instead makes `SyncingEngine` continuously look for
more active peers. This might make the peer count more unstable
on average if there are a lot of inactive peers in the network.

Fixes #3346
altonen added a commit that referenced this issue Feb 17, 2024
Previously `SyncingEngine` would use `last_notification_io` to track
when it last sent or received a block announcement. If there had been
no notification I/O for 30 seconds, `SyncingEngine` considered itself
stalled and evicted all peers which gave `ProtocolController` a chance
to establish new outbound connections.

The problem with this approach was that if `SyncingEngine` actually got
stalled, peer count would suddenly drop to zero after which it would
slowly climb back to the desired amount of peers.

Start tracking the notification I/O of peers individually and evict them
as they're deemed idle which prevents the peer count from suddenly
dropping to zero and instead makes `SyncingEngine` continuously look for
more active peers. This might make the peer count more unstable
on average if there are a lot of inactive peers in the network.

Fixes #3346
altonen added a commit that referenced this issue Feb 17, 2024
Previously `SyncingEngine` would use `last_notification_io` to track
when it last sent or received a block announcement. If there had been
no notification I/O for 30 seconds, `SyncingEngine` considered itself
stalled and evicted all peers which gave `ProtocolController` a chance
to establish new outbound connections.

The problem with this approach was that if `SyncingEngine` actually got
stalled, peer count would suddenly drop to zero after which it would
slowly climb back to the desired amount of peers.

Start tracking the notification I/O of peers individually and evict them
as they're deemed idle which prevents the peer count from suddenly
dropping to zero and instead makes `SyncingEngine` continuously look for
more active peers. This might make the peer count more unstable
on average if there are a lot of inactive peers in the network.

Fixes #3346
@crystalin
Copy link
Author

@altonen I understand your point. I think we need to split the logic based on parachain vs relaychain behavior.
Because what you says makes sense for Relaychain, but is problematic for parachain where not producing a block for 30s is not uncommon, which make ALL nodes disconnect ALL their peers. This creates a significant networking chaos.

See how frequent this is:
image

@altonen
Copy link
Contributor

altonen commented Feb 19, 2024

We ran a burn-in for the proposed fix in Kusama AssetHub and it didn't work quite as well as I had hoped so I'm re-evaluating our approach.

@girazoki
Copy link
Contributor

hey @altonen does this #3983 fix the issue? We are also observing this in a internal network and I think this is going to be a problem for parathreads (where you will be constantly evicting peers as the block production rate is much slower)

@bkchr
Copy link
Member

bkchr commented Apr 22, 2024

@girazoki yes that should be fixed by the mentioned pr, as the logic was removed. Aaro is also not working anymore for us.

Going to close the issue as it should be solved.

@bkchr bkchr closed this as completed Apr 22, 2024
@github-project-automation github-project-automation bot moved this from In Progress 🛠 to Blocked ⛔️ in Networking Apr 22, 2024
@girazoki
Copy link
Contributor

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.
Projects
Status: Blocked ⛔️
Development

Successfully merging a pull request may close this issue.

4 participants