Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Versi high number of PeerDisconnect when scaling up number of validators and parachains #1263

Closed
alexggh opened this issue Aug 29, 2023 · 3 comments

Comments

@alexggh
Copy link
Contributor

alexggh commented Aug 29, 2023

During my testing of #1191 I observed there is a high number of PeerDisconnects when running ~500 validators and ~100 parachains.

PeerDisconnects are very bad and when a high number of them happen the network slows down and gets flooded with storms of unnecessary messages, for example in approval-distribution on reconnect, all peers will try to send their known messages to a peer: https://github.com/paritytech/polkadot-sdk/blob/master/polkadot/node/network/approval-distribution/src/lib.rs#L383 which amounts for thousands of messages all going to the same node from multiple sources.

Things I investigated/noticed.

  1. A part of the disconnects are coming when we change the era and are caused by this sync: Invalid justification provided  #1147, I disabled the new era change in versi and disconnects still happened.
  2. There are a few negative reputation updates coming before.
2023-08-24 04:46:25.110 DEBUG tokio-runtime-worker parachain::reputation-aggregator: Modify reputation peer=PeerId("12D3KooWQnESXDiEFtkVU2oMLgXVDTkQpmeAwGeRf8qqHfbGEgxL") rep=CostMajorRepeated("Statement sent more than once by peer")

But this is definitely, not the root cause, because I disabled this reputation update and the Disconnects continued reproducing.

  1. When the disconnects start happening, because on reconnect we sent all the knowledge that we have to the peer, we enter a loop where the peer will complain we send them duplicates and create more disconnects.

  2. To exclude 2 & 3, I disabled the behaviour with this 53f8556 and started logging errors when BANNED_THRESHOLD is passed with this: 5e004e1

  3. With 4) I ran 380 validators & 90 parachains and continuously see a lot of disconnects here: https://grafana.teleport.parity.io/goto/G7di2ZzSR?orgId=1, and almost 0 of them because of reputation changes.

  4. With 10 parachains I increased the number of validators 60, 160, 260 .. 380, the disconnects seem to stay within reasonable bounds until we got to 260 validators, but after that we seem to always have 4-5 disconnects per minute for nodes, see here https://grafana.teleport.parity.io/goto/iOrfTZkIg?orgId=1

  5. Once I increase the number of parachains to 90, disconnects spiked even more, see here https://grafana.teleport.parity.io/goto/13V_oZzSg?orgId=1

Note! The dashboards are measuring only the disconnects for protocol="validation/2" which is what the validators talk between each other.

Relevant timeline and dashboards:

@altonen
Copy link
Contributor

altonen commented Aug 29, 2023

Is it possible to get sub-libp2p=trace logging target enabled for any of the nodes that have number of peer disconnections?

@alexggh
Copy link
Contributor Author

alexggh commented Aug 29, 2023

Is it possible to get sub-libp2p=trace logging target enabled for any of the nodes that have number of peer disconnections?

Will do.

bkchr pushed a commit that referenced this issue Apr 10, 2024
@alexggh
Copy link
Contributor Author

alexggh commented May 14, 2024

Stale issue.

@alexggh alexggh closed this as completed May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants