-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Polkadot service stops listening on port 30333 when reaching some threshold of load. #856
Comments
Updated logs that happened on another sentry. This time, We received the alert at Feb 20, 2020 @ 18:10:00.000 UTC for when the service stopped listening on port 30333. ( This is not a new message as we've been seeing this since the beginning of February. However, this latest failure for listening on this port happened with around 1,000 TCP connections. Screenshots and evidenceWARN message:
Raw Logs:
|
TCP connections are reaching 2k+ still @tomaka |
The few fixes that we made are not in Polkadot v0.7.20 (paritytech/substrate#4828 paritytech/substrate#4830 paritytech/substrate#4889) |
For some control testing, I was configuring the The testing that I did was reduce the file descriptor count from Based off some of this testing it raises the question of whether or not we should increase process limits for file descriptors or not? It is interesting to note it appears that under normal conditions, the total number of TCP connections remain steady, but the file descriptor count continues to go up, so it's not always a Below are some logs on standard testing with screenshots. Default configuration
Test case configuration using
|
Ah thanks, so it is indeed related to the number of file descriptors. If that's the case, then libp2p/rust-libp2p#1427 (which is available in Polkadot's master branch but not released yet) should fix it. We're also trying to fix the fact that so many TCP connections are being held open all the time. |
Looking forward to it. Thanks for the information @tomaka |
This should normally be fixed (by libp2p/rust-libp2p#1427) |
System:
Setup:
This is a secure validator setup. We have 4 sentries that are same spec as well as 2 validators (primary/secondary). The configuration that we are using for the systemd service includes these flags:
Additional information about the setup:
When we first observed this behavior, it was brought to our attention from an external service called
Pingdom
. It is used to verify our sentries are available to the public (checks every 1 minute). In using this service, if the host is unreachable (via a specified port such as30333
) from various servers around the world, we will get an alert. Since the process stops listening on port30333
fairly consistently after upgrading v7.20, we have implementedmonit
to check and restart the service automatically (this checks every 2 minutes) in such a scenario.Behavior of the problem:
The process stops listening on port
30333
. With the logs and screenshots of metrics, the most recent example is around 18:20 UTC. The spikes on CPU/Disk Utilization is after the restart of the service. The service itself is syncing with the chain and communicating with our validator. We monitor the connected peers across all our nodes. Particularly from the validator perspective. When we receive alerts from our external service, we are still connected (from a sentry to validator and from a validator to sentry) when looking at theconnected peers
using the RPC call.This is still a concern to us as it is unclear on what causes this exactly. If the process begins to fail on port
30333
does that mean it can stop on other ports? How does this affect the network if others observe this behavior and sentries are not reachable? This or more or less extreme scenarios, but is a concern.Had we not had an external service such as pingdom to check the lib p2p port, this would've have gone unnoticed as it appears the process still is able to have outbound connections which allows it to stay in sync with the chain. This may or may not affect the overall health of the network if only half the connections for sentries are properly communicating.
Logs:
There is no indication of anything that is the cause of the problem within the logs. There is however been an introduction of an error in the paste when p2p had been previously update (7.17) which was later removed in version
7.18
. The error message is as follows:ERROR yamux::connection 671ba303: socket error: decode error: i/o error: I/O error: Connection reset by peer (os error 104)
Raw logs:
Screenshot of metrics:
The spikes on CPU/Disk Utilization is after the restart of the service.
Observation:
By the looks of it, it appears that once a certain number of connections are met (which is around 2k TCP connections) the process is exceeding it's limitation in some way. Either file descriptors or something else, however, logs don't suggest that it's a problem and is unaware. This was not a problem until the recent update to v7.20 for polkadot. We upgraded from v7.18. Here's some information about our default constraints for the process:
The text was updated successfully, but these errors were encountered: