Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client should expose a mechanism to close underlying TCP connections #374

Closed
jfoy opened this issue Mar 8, 2018 · 51 comments · Fixed by kubernetes/kubernetes#95981
Closed
Assignees

Comments

@jfoy
Copy link

jfoy commented Mar 8, 2018

Re kubernetes/kubernetes#52176, kubernetes/kubernetes#56720, #342

(Short form: a stalled TCP connection to apiserver from kubelet or kube-proxy can cause ~15 minutes of disruption across a substantial number of nodes until the local kernel closes the socket.)

I believe we need a mechanism for requestCanceler.CancelRequest to invoke transport.CloseIdleConnections based on config. It looks like we could do this with a small http.RoundTripper built to purpose. I hope to submit a PR with this change shortly.

I'm not sure if that behavior should be activated on a case-by-case basis using config.WrapTransport (less invasive, narrower change) or if it should be part of the core config used in transport.HTTPWrappersForConfig (needed in multiple use cases per issues listed above). What's the convention here?

@lavalamp
Copy link
Member

lavalamp commented Mar 8, 2018

Is this failure mode unique to HTTP/2 connections?

@lavalamp
Copy link
Member

lavalamp commented Mar 8, 2018

Is the problem that the timeout is wrong, or that the client doesn't notice when connections have been half-closed?

@jfoy
Copy link
Author

jfoy commented Mar 8, 2018

I can't tell if it's limited to HTTP/2, but it looks to me like clients maintain persistent TCP connections no matter the application protocol, so they would be hit by the same failure.

It's the latter problem - as far as we can tell, the load balancer is hanging up on decommissioned IP addresses without sending a FIN packet (or we never receive it), so the client thinks the connection is still open. The HTTP timeout is set correctly, but opening a new session appears to reuse the existing half-closed connection.

@stephbu
Copy link

stephbu commented Mar 30, 2018

Same issue happens also with NLB in AWS.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 28, 2018
@warmchang
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 5, 2018
@xh3b4sd
Copy link

xh3b4sd commented Jul 24, 2018

We have situations in which pods using the client have to be restarted in order to work around this problem. What can we do to get some movement here?

@liggitt
Copy link
Member

liggitt commented Jul 24, 2018

What can we do to get some movement here?

There is sufficient plumbing in the rest.Config to wrap the transport or set up a connection-tracking dialer yourself today, while this is considered as a built-in feature

@r7vme
Copy link

r7vme commented Jul 25, 2018

There is sufficient plumbing in the rest.Config to wrap the transport or set up a connection-tracking dialer yourself today, while this is considered as a built-in feature

Right, i experimented while ago with own Transport. Unfortunately problem is deeper on TCP level (at least for us). In Transport i can control via dialer some TCP keepalive settings (this was my initial idea for a fix), but this does not help. At least because by default transport TCP keepalive settings are already quite good in Go.

It turned out that our problem was/is with incorrect load-balancer settings. Following happens for us:

  1. endpoint (k8s api) hardly killed (e.g. kill -9 on qemu process)
  2. load balancer detects that backend died after some timeout
  3. load balancer does not reset existing connection to the client and keeps answering heartbeats
  4. on client we see connection stays ~15 min in ESTABLISHED state

What helps is to set quite low timeouts on load balancers.

@lavalamp
Copy link
Member

lavalamp commented Aug 6, 2018

How much overlap does this have with #65012? See: kubernetes/kubernetes#65012 (comment)

@szuecs
Copy link
Member

szuecs commented Aug 17, 2018

@lavalamp It is not only h2, which is affected.

Basically there are some problems in the Go default Transport.
For example it caches DNS responses forever golang/go#23427 .
The other as described in #374 (comment) is, that the loadbalancer (iptables/ipvs service for example) does not send TCP RST packet to the client, which will lead to a Linux kernel TCP timeout, because the current Timeout setting in client-go does not work.

I guess basically everywhere in Kubernetes and most of the 3rd party controllers, this is broken.
We saw it also in etcd-proxy, which we rewrote ourselves, because it was easier to build it, than to get a fix upstream merged.

@lavalamp
Copy link
Member

@szuecs thank you, that was helpful. I think there are multiple reasons this class of issue has existed for a long time:

  • There's actually multiple different issues with similar symptoms
    • DNS caching due to long connection retention time + apiserver IP changing in a HA environment
    • Long connection retention time + broken load balancer + apiserver restart in HA environment
    • Long connection retention time + shared connection fate due to HTTP/2 + apiserver restart in HA environment
  • Most people don't hit any of them, so reports are rare (for a long time I thought this was 100% broken load balancers)
  • Most reporters aren't clear about what exactly has gone on

A confounding factor is that we do need to reuse connection for performance.

It still seems like calling Ping() periodically for HTTP/2 connections (like I suggested in #65012) and then recommending that everyone use HTTP/2 should solve all of the problems-- e.g. DNS should get re-resolved if the connection is closed.

@szuecs
Copy link
Member

szuecs commented Aug 20, 2018

H2 + Ping() is probably the safest bet we have, because you test end-to-end.

yunioncorp pushed a commit to yunionio/cloudpods that referenced this issue Aug 25, 2018
 - rest.Config Timeout setting does not work right now [1]
 - number of connections will not increase along with newed client: they
   use a single connection always...

 [1] kubernetes/client-go#374 (comment)
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 18, 2018
@george-angel
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 18, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2019
@szuecs
Copy link
Member

szuecs commented Feb 16, 2019

/remove-lifecycle stale

@duyanghao
Copy link

@caesarxuchao @hakman I have created a k8s.io/apimachinery PR and a relevant golang/net PR for this problem, please take a look at it.

@liuyuan10
Copy link

liuyuan10 commented Oct 23, 2020

Can anybody help me understand why tcp keepalive is not the answer here? If the load balancer is doing the right thing (by healthchecking apiservers and reset existing sessions, or it's a simple natting LB using iptables), I think tcp keepalive should solve the problem. And golang already has it enabled by default (see Dailer.Keepalive. So I would expect a dead connection to be closed in 2-3 mins. Not the best but at least better than 10+ mins that people are reporting.

airshipbot pushed a commit to airshipit/promenade that referenced this issue Oct 24, 2020
There are several kubernetes bugs [0,1,2] involving connection problems
that seem related to the Go net/http2 library, where the stream state
and connection state can get out of sync. This can manifest as a kubelet
issue, where the node status gets stuck in a NotReady state, but can
also happen elsewhere.

In newer versions of the Go libraries some issues are fixed [3,4], but
the fixes are not present in k8s 1.18.

This change disables http2 in kube-apiserver and webhook-apiserver. This
should be sufficient to avoid the majority of the issues, as disabling
on one side of the connection is enough, and apiserver is generally
either the client or the server.

0: kubernetes/kubernetes#87615
1: kubernetes/kubernetes#80313
2: kubernetes/client-go#374
3: golang/go#40423
4: golang/go#40201

Change-Id: Id693a7201acffccbc4b3db8f4e4b96290fd50288
@hpdvanwyk
Copy link

@liuyuan10 I wondered the same thing a while ago. Turns out for tcp keepalive to trigger the connection cannot have anything in its transmit buffer. Since something is usually trying to use the client connection there will almost always be some untransmitted bytes causing keepalive to never trigger.

@liuyuan10
Copy link

@hpdvanwyk Thanks for sharing. Even in that case, because there is pending data in transmit buffer, TCP will always try to send something. To me it basically acts as keepalive probes. Are you saying TCP will not shutdown the session when ack is not received for more than 10mins?

@JensErat
Copy link

I must admit I'm not totally sure why TCP keepalive didn't help (but the http2 timeouts do), but at least in our case we assume connections getting lost "somewhere in the network", not on either client or server. Finding the root cause seems very difficult, in large setups this only happens occasionally (but often enough to trigger every few hours with a few hundreds of servers), multiple infrastructure teams involved and we cannot run and store network traces at that large scale.

We haven't been able to finally trace down what happens, but after weeks/months of trouble, enabling the http2 timeouts brought relief. In fact, by enforcing the timeouts over our entire code base (we "hacked" the vendored x/net library) with a timeout and a debug statement in case the timeout triggered, we were able to actually trace back a whole bunch of random, weird issues to those lost connections al over our own codebase, not only Kubernetes (and actually, we first observed the issue after upgrading Prometheus after they started upgrading connections to http2).

In fact, we assume two root causes in our setup:

  • VMware VIO load balancers closing connections without sending a RST package during reconfiguration in specific releases. I think they acknowledged this as a bug by now, maybe it is even resolved. I didn't personally track down the issue further after we applied our own http2 timeout patches. Still this didn't break Kubernetes often enough for us to actually reproduce and debug the issue.
  • Some weird, very rare networking issues in a specific cross-datacenter connect. It happens that rarely, that we cannot really blame anybody for this (a network will fail, we'll have to be able to deal with that).

Both of these causes have not been "frequent" at all (for frequent being a matter of scale) -- but if they occurred, they resulted in 15-30 minute outages including pages. Prometheus authors in fact stated a "normal" developer shouldn't even have to think about http2 timeouts, and Golang should have reasonable defaults. I somehow tend to agree with them.

@hpdvanwyk
Copy link

@liuyuan10 TCP will retry sending the packet and eventually shutdown the connection. From https://linux.die.net/man/7/tcp

tcp_retries2 (integer; default: 15; since Linux 2.2)
The maximum number of times a TCP packet is retransmitted in established state before giving up. The default value is 15, which corresponds to a duration of approximately between 13 to 30 minutes, depending on the retransmission timeout. The RFC 1122 specified minimum limit of 100 seconds is typically deemed too short.

This is why it takes 10+ minutes for the connection to be shut down if there is something in the transmit buffer instead of the 2 to 3 minutes you would expect if it was just TCP keepalives.

@JensErat
Copy link

@hpdvanwyk this pretty much exactly matches our observations, and sounds pretty reasonable. 👍 We had a 10 minutes delay in getting paged (so 10+ minutes of broken connection), but when not in office getting towards your laptop and connecting to the clusters was most often too slow to observe the issue -- already gone again. When we started measuring outage times, we never observed more than ~30 minutes until everything recovered on its own.

I'm really glad we finally have some kind of explanation for the observed recovery times. Networking is way more complicated and involved than one would expect.

@yousong
Copy link

yousong commented Oct 28, 2020

In case it may be of interest to anyone, calico accessing apiserver over ipvs also suffers [1] from this issue.

[1] https://github.com/projectcalico/libcalico-go/issues/1267

@liuyuan10
Copy link

@hpdvanwyk That's very convincing explanation. Thanks a lot. This basically makes tcp keepalive useless in our case....

@JensErat
Copy link

@yousong we had the issues all over our own code base (lots of automation code written in golang accessing the API server). Enforcing ReadIdleTimeout just magically made our entire test base and e2e tests much more robust, lots of the "we need to track that down" flakiness was gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet