You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A part of the disconnects are coming when we change the era and are caused by this sync: Invalid justification provided #1147, I disabled the new era change in versi and disconnects still happened.
There are a few negative reputation updates coming before.
2023-08-24 04:46:25.110 DEBUG tokio-runtime-worker parachain::reputation-aggregator: Modify reputation peer=PeerId("12D3KooWQnESXDiEFtkVU2oMLgXVDTkQpmeAwGeRf8qqHfbGEgxL") rep=CostMajorRepeated("Statement sent more than once by peer")
But this is definitely, not the root cause, because I disabled this reputation update and the Disconnects continued reproducing.
When the disconnects start happening, because on reconnect we sent all the knowledge that we have to the peer, we enter a loop where the peer will complain we send them duplicates and create more disconnects.
To exclude 2 & 3, I disabled the behaviour with this 53f8556 and started logging errors when BANNED_THRESHOLD is passed with this: 5e004e1
With 10 parachains I increased the number of validators 60, 160, 260 .. 380, the disconnects seem to stay within reasonable bounds until we got to 260 validators, but after that we seem to always have 4-5 disconnects per minute for nodes, see here https://grafana.teleport.parity.io/goto/iOrfTZkIg?orgId=1
During my testing of #1191 I observed there is a high number of PeerDisconnects when running ~500 validators and ~100 parachains.
PeerDisconnects are very bad and when a high number of them happen the network slows down and gets flooded with storms of unnecessary messages, for example in approval-distribution on reconnect, all peers will try to send their known messages to a peer: https://github.com/paritytech/polkadot-sdk/blob/master/polkadot/node/network/approval-distribution/src/lib.rs#L383 which amounts for thousands of messages all going to the same node from multiple sources.
Things I investigated/noticed.
But this is definitely, not the root cause, because I disabled this reputation update and the Disconnects continued reproducing.
When the disconnects start happening, because on reconnect we sent all the knowledge that we have to the peer, we enter a loop where the peer will complain we send them duplicates and create more disconnects.
To exclude 2 & 3, I disabled the behaviour with this 53f8556 and started logging errors when BANNED_THRESHOLD is passed with this: 5e004e1
With 4) I ran 380 validators & 90 parachains and continuously see a lot of disconnects here: https://grafana.teleport.parity.io/goto/G7di2ZzSR?orgId=1, and almost 0 of them because of reputation changes.
With 10 parachains I increased the number of validators 60, 160, 260 .. 380, the disconnects seem to stay within reasonable bounds until we got to 260 validators, but after that we seem to always have 4-5 disconnects per minute for nodes, see here https://grafana.teleport.parity.io/goto/iOrfTZkIg?orgId=1
Once I increase the number of parachains to 90, disconnects spiked even more, see here https://grafana.teleport.parity.io/goto/13V_oZzSg?orgId=1
Note! The dashboards are measuring only the disconnects for
protocol="validation/2"
which is what the validators talk between each other.Relevant timeline and dashboards:
The text was updated successfully, but these errors were encountered: