-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client should expose a mechanism to close underlying TCP connections #374
Comments
Is this failure mode unique to HTTP/2 connections? |
Is the problem that the timeout is wrong, or that the client doesn't notice when connections have been half-closed? |
I can't tell if it's limited to HTTP/2, but it looks to me like clients maintain persistent TCP connections no matter the application protocol, so they would be hit by the same failure. It's the latter problem - as far as we can tell, the load balancer is hanging up on decommissioned IP addresses without sending a FIN packet (or we never receive it), so the client thinks the connection is still open. The HTTP timeout is set correctly, but opening a new session appears to reuse the existing half-closed connection. |
Same issue happens also with NLB in AWS. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
We have situations in which pods using the client have to be restarted in order to work around this problem. What can we do to get some movement here? |
There is sufficient plumbing in the rest.Config to wrap the transport or set up a connection-tracking dialer yourself today, while this is considered as a built-in feature |
Right, i experimented while ago with own Transport. Unfortunately problem is deeper on TCP level (at least for us). In Transport i can control via dialer some TCP keepalive settings (this was my initial idea for a fix), but this does not help. At least because by default transport TCP keepalive settings are already quite good in Go. It turned out that our problem was/is with incorrect load-balancer settings. Following happens for us:
What helps is to set quite low timeouts on load balancers. |
How much overlap does this have with #65012? See: kubernetes/kubernetes#65012 (comment) |
@lavalamp It is not only h2, which is affected. Basically there are some problems in the Go default Transport. I guess basically everywhere in Kubernetes and most of the 3rd party controllers, this is broken. |
@szuecs thank you, that was helpful. I think there are multiple reasons this class of issue has existed for a long time:
A confounding factor is that we do need to reuse connection for performance. It still seems like calling Ping() periodically for HTTP/2 connections (like I suggested in #65012) and then recommending that everyone use HTTP/2 should solve all of the problems-- e.g. DNS should get re-resolved if the connection is closed. |
H2 + Ping() is probably the safest bet we have, because you test end-to-end. |
- rest.Config Timeout setting does not work right now [1] - number of connections will not increase along with newed client: they use a single connection always... [1] kubernetes/client-go#374 (comment)
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
@caesarxuchao @hakman I have created a k8s.io/apimachinery PR and a relevant golang/net PR for this problem, please take a look at it. |
Can anybody help me understand why tcp keepalive is not the answer here? If the load balancer is doing the right thing (by healthchecking apiservers and reset existing sessions, or it's a simple natting LB using iptables), I think tcp keepalive should solve the problem. And golang already has it enabled by default (see Dailer.Keepalive. So I would expect a dead connection to be closed in 2-3 mins. Not the best but at least better than 10+ mins that people are reporting. |
There are several kubernetes bugs [0,1,2] involving connection problems that seem related to the Go net/http2 library, where the stream state and connection state can get out of sync. This can manifest as a kubelet issue, where the node status gets stuck in a NotReady state, but can also happen elsewhere. In newer versions of the Go libraries some issues are fixed [3,4], but the fixes are not present in k8s 1.18. This change disables http2 in kube-apiserver and webhook-apiserver. This should be sufficient to avoid the majority of the issues, as disabling on one side of the connection is enough, and apiserver is generally either the client or the server. 0: kubernetes/kubernetes#87615 1: kubernetes/kubernetes#80313 2: kubernetes/client-go#374 3: golang/go#40423 4: golang/go#40201 Change-Id: Id693a7201acffccbc4b3db8f4e4b96290fd50288
@liuyuan10 I wondered the same thing a while ago. Turns out for tcp keepalive to trigger the connection cannot have anything in its transmit buffer. Since something is usually trying to use the client connection there will almost always be some untransmitted bytes causing keepalive to never trigger. |
@hpdvanwyk Thanks for sharing. Even in that case, because there is pending data in transmit buffer, TCP will always try to send something. To me it basically acts as keepalive probes. Are you saying TCP will not shutdown the session when ack is not received for more than 10mins? |
I must admit I'm not totally sure why TCP keepalive didn't help (but the http2 timeouts do), but at least in our case we assume connections getting lost "somewhere in the network", not on either client or server. Finding the root cause seems very difficult, in large setups this only happens occasionally (but often enough to trigger every few hours with a few hundreds of servers), multiple infrastructure teams involved and we cannot run and store network traces at that large scale. We haven't been able to finally trace down what happens, but after weeks/months of trouble, enabling the http2 timeouts brought relief. In fact, by enforcing the timeouts over our entire code base (we "hacked" the vendored x/net library) with a timeout and a debug statement in case the timeout triggered, we were able to actually trace back a whole bunch of random, weird issues to those lost connections al over our own codebase, not only Kubernetes (and actually, we first observed the issue after upgrading Prometheus after they started upgrading connections to http2). In fact, we assume two root causes in our setup:
Both of these causes have not been "frequent" at all (for frequent being a matter of scale) -- but if they occurred, they resulted in 15-30 minute outages including pages. Prometheus authors in fact stated a "normal" developer shouldn't even have to think about http2 timeouts, and Golang should have reasonable defaults. I somehow tend to agree with them. |
@liuyuan10 TCP will retry sending the packet and eventually shutdown the connection. From https://linux.die.net/man/7/tcp
This is why it takes 10+ minutes for the connection to be shut down if there is something in the transmit buffer instead of the 2 to 3 minutes you would expect if it was just TCP keepalives. |
@hpdvanwyk this pretty much exactly matches our observations, and sounds pretty reasonable. 👍 We had a 10 minutes delay in getting paged (so 10+ minutes of broken connection), but when not in office getting towards your laptop and connecting to the clusters was most often too slow to observe the issue -- already gone again. When we started measuring outage times, we never observed more than ~30 minutes until everything recovered on its own. I'm really glad we finally have some kind of explanation for the observed recovery times. Networking is way more complicated and involved than one would expect. |
In case it may be of interest to anyone, calico accessing apiserver over ipvs also suffers [1] from this issue. [1] https://github.com/projectcalico/libcalico-go/issues/1267 |
@hpdvanwyk That's very convincing explanation. Thanks a lot. This basically makes tcp keepalive useless in our case.... |
@yousong we had the issues all over our own code base (lots of automation code written in golang accessing the API server). Enforcing |
Re kubernetes/kubernetes#52176, kubernetes/kubernetes#56720, #342
(Short form: a stalled TCP connection to apiserver from kubelet or kube-proxy can cause ~15 minutes of disruption across a substantial number of nodes until the local kernel closes the socket.)
I believe we need a mechanism for
requestCanceler.CancelRequest
to invoketransport.CloseIdleConnections
based on config. It looks like we could do this with a smallhttp.RoundTripper
built to purpose. I hope to submit a PR with this change shortly.I'm not sure if that behavior should be activated on a case-by-case basis using
config.WrapTransport
(less invasive, narrower change) or if it should be part of the core config used intransport.HTTPWrappersForConfig
(needed in multiple use cases per issues listed above). What's the convention here?The text was updated successfully, but these errors were encountered: