-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible connection leak due to linkerd #9724
Comments
We've seen this with multiple of our applications as well, with no clear view of why. Random example today where the proxy OOMd (250mb limit) for a newly onboarded service:
I raised a discussion on this topic a while back here. Interestingly, we also run GKE 1.22.12.. My understanding (which could well be wrong) is that there are 4 stages of connections:
We typically see connection leaks either from external to the local proxy, or the local app to the local proxy. I've never got any further with answers to why as we remove the affected applications from the mesh. I wonder if GKE is the culprit here; we began our Linkerd journey with 2.12 edge so I'm unable to comment on 2.10 behaviour. |
Hi @Anthony-Bible. We'd definitely like to get to the bottom of this. You mentioned that this behavior is reproducible in statefulsets; can you provide steps that we can use to reproduce the connection leak so that we can investigate? |
@adleong what can we do to help diagnose this, other than providing manifests? We see this a lot but it seems pretty isolated to java applications only, and I'm not sure how to isolate this outside of our GKE cluster. I have been digging into one of our applications showing this behaviour, running a container that just spits out tcpdump to stdout. An example request without the application being meshed: The connection is alive for ~2-300ms. Once I enable Linkerd, the connection remains open indefinitely (I deleted the pod at And these are the debug logs showing this connection (note, nothing happens between the 16:07 and 16:12 when the pod terminated)
Not every request to the pod results in an open connection, but every request from a particular java app does, only when Linkerd is enabled. I'm particularly interested in I would love to help get to the bottom of this as this affects multiple of our applications. If there are additional logs I can provide, or if you want me to run a modified version of the proxy for extra logs etc, I'm happy to help. I'm just not sure how to repro outside of our cluster! |
@adleong should Linkerd proxy send FIN packets back to the app? In our testing app with a Java Netty client, we see FIN back from the server, but this FIN isn't sent back to the application so the connection provider doesn't see a closed event and the connection hangs. |
Not necessarily. Linkerd treats both connections (client to Linkerd and Linkerd to server) as separate and reuses connections when possible. Based on what you're describing, it sound like what might be happening is:
Does this seem consistent with what you're seeing? If so, it strikes me as unusual behavior on the part of the client. I would expect the client to either re-use an existing open connection or to close a connection it was no longer interested in using. |
Yes, that is consistent with what we are seeing. I had been under the impression each request spawns two linked connections (# 1 and # 2), not that # 1 can be kept alive separately.
This is the default behaviour seen when using
Where the connection is marked inactive but persists until it receives a FIN/RST which doesn't happen until the proxy shuts down. We've found a potential workaround by setting
but this requires rebuilding all of our apps with a custom connection pool, which is no small task. It also doesn't directly address the problem as it just culls all active connections after X period, so it's more brute force than elegant. I will also look into why these connections aren't being reused. |
@Anthony-Bible is your app also using Java and Netty? |
Unfortunately not, ours is a python app we package up with pex. |
Sorry correction, it's golang so no wrappers. And FWIW it's connections that are pod to pod communication in the same statefulset. |
To add to @Anthony-Bible's case: Our app relies on standard golang http: Client init:
Request (sync):
Request (async):
We tested This was tested in a very small environment, but same tests executed on Friday generated the majority of the connections seen to the left of the graphs. |
After months of agonising, we finally found the culprit for our issue with our Java apps. We are using Reactor Netty and otel-agent v1.9.0, we found an open bug in otel-agent that causes a new connection pool to be opened for each request, which never gets closed. We've upgraded otel and this has gone away. open-telemetry/opentelemetry-java-instrumentation#4862 Sadly not relevant to your issue but wanted to follow up in case anyone stumbles upon this and sees my previous responses. |
Just wanted to clarify, @SIGUSR2 is my coworker so that's the solution we're currently using as a workaround |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
What is the issue?
One set of our pods had has a connection leak when doing intrapod (pod to pod) communication. This causes it to leave connections open until 2k concurrent connections where it then crashes linkerd-proxy due to memory limits.
If you want to look at it yourself, do the following:
How can it be reproduced?
This appears to be reproducible in statefulsets with pod to pod communication when communicating via http
Logs, error output, etc
Please see screenshots below
output of
linkerd check -o short
Environment
Kubernetes version:
Environment:
GKE
linkerd version:
Additional context
This appears to have been introduced in 2.11, 2.10 tcp connections stay consistently low.
2.11:
2.10:
Would you like to work on fixing this bug?
No response
The text was updated successfully, but these errors were encountered: