-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Revert reversion of authority-discovery enabling #1544
Comments
I don't really have a suggestion on expected absolute numbers, but certainly spawning 4k sockets so that we can connect to an extra 10 authorities is excessive (I know it's because of the DHT querying). With authority discovery enabled the lower bound for the number of sockets on my local node was about 1.4k sockets. We should try to reduce this number before re-enabling it. |
Documenting insights looking at the metrics of a single sentry node running with
|
With libp2p/rust-libp2p#1698 included in Substrate via paritytech/substrate#6891 idle connections are now properly cleaned up. Below you see graphs of two nodes, While the connection count still spikes periodically (every 10 minutes) to around ~1600 connetions, the connection count returns back to the same level (~220) of node The CPU load statistics of the two nodes do not show a correlation between periodic lookups and CPU usage. With the above in mind do you feel comfortable re-enabling authority discovery by default @andresilva @tomaka? Instead of reverting the revert right away I would suggest enabling it on all of Parity's Kusama nodes explicitly for a week and only then rolling it out to all Kusama and Polkadot nodes. In the future the authority discovery module will need to lookup addresses of past session validators as well. At this point we likely need to spread the lookups over the entire 10 minutes instead of doing them all at once. |
I don't really see how fixing the keep-alive mechanism would significantly reduce the number of TCP connections? In the same release where we disabled authority-discovery by default, we also shipped paritytech/substrate#6549 and paritytech/substrate#6822. These two PRs alone might have fixed the problem entirely, but we can't know. Unfortunately I don't know what the threshold is before validators start complaining about reaching the limit. The Parity nodes in particular didn't reach the fdlimit. |
I have asked the validator community what the file descriptor limit on their nodes is. Sadly only 2 reported back. Both of them had a hard limit of In addition I reached out to our internal operations team. Our nodes are running with a hard limit of With the above in mind and the fact that the amount of connections initiated through authority discovery is below 10_000 I am having difficulties understanding how the authority discovery module exhausted the file descriptor limit of a node. As suggested above I would like to enable the authority discovery module on all Parity nodes once a new Polkadot version with libp2p 0.24.0 (paritytech/substrate#6891) is released. In case those nodes are running without issues I will prepare a pull request reverting #1532 thus the authority discovery module being enabled by default. With paritytech/substrate#6946 Prometheus will alert us once the number of file descriptors on our nodes is high. |
#1532 should be reverted again.
I don't know under which criteria, though.
The text was updated successfully, but these errors were encountered: