-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster.Bootstrap causes network socket port exhaustion due to socket leak during cluster formation #2571
Comments
Problem isolated to the Ceen.Httpd package. Socket usage was stable when Ceen was replaced with Kestrel.
|
Update here since I've been investigating this using our test lab - we can reproduce this issue on Linux too, and it's making me think that the problem might be related to how aggressively we try to re-connect during cluster formation. Cluster Fully Formed (20/20 nodes)Only about ~20 active TCP connections per node, which makes sense - most of these are Akka.Remote, an OTLP exporter, and maybe a few others Cluster Unable to Form (18/20 nodes)About ~1100 active TCP connections per node. This looks like hyper-aggressive retries, not some kind of TCP handling issue. |
Another piece of evidence in favor of the "aggressive retries" theory of the case, look at the step function of active TCP connections when cluster formation does occur: The oldest nodes have significantly more open TCP connections than the newer nodes that were started later during the deployment by Kubernetes. This looks more like a "Thundering Herd" problem rather than a resource leak. |
We did some more work on this over the weekend and captured more data from more experiments - the problem is definitely caused by how frequently Akka.Management's cluster bootstrapper is HTTP-polling its peers: TCP Connectivity Data1s interval - ~1100 connections per node5s interval - ~260-280 connections per node10s interval - ~100-105 connections per nodeThe key setting at play here is the Cluster Formation Times
|
One other setting that can alleviate major stressors that contribute to this port exhaustion problem: Akka.Management/src/management/Akka.Management/Resources/reference.conf Lines 151 to 156 in d7812ff
Set that to |
Related fix: #2589 |
Should we just change the default polling interval to 5s - that should help put this issue to bed |
It has been observed that Cluster.Bootstrap can cause network socket port exhaustion due to TCP protocol holding the socket port open in the WAIT_TIME linger state if Cluster.Bootstrap failed to form a cluster immediately.
This has been observed especially in conjunction with Akka.Discovery.Azure.
The text was updated successfully, but these errors were encountered: