Client should expose a mechanism to close underlying TCP connections #374

jfoy · 2018-03-08T22:23:53Z

Re kubernetes/kubernetes#52176, kubernetes/kubernetes#56720, #342

(Short form: a stalled TCP connection to apiserver from kubelet or kube-proxy can cause ~15 minutes of disruption across a substantial number of nodes until the local kernel closes the socket.)

I believe we need a mechanism for requestCanceler.CancelRequest to invoke transport.CloseIdleConnections based on config. It looks like we could do this with a small http.RoundTripper built to purpose. I hope to submit a PR with this change shortly.

I'm not sure if that behavior should be activated on a case-by-case basis using config.WrapTransport (less invasive, narrower change) or if it should be part of the core config used in transport.HTTPWrappersForConfig (needed in multiple use cases per issues listed above). What's the convention here?

The text was updated successfully, but these errors were encountered:

lavalamp · 2018-03-08T22:28:33Z

Is this failure mode unique to HTTP/2 connections?

lavalamp · 2018-03-08T22:30:54Z

Is the problem that the timeout is wrong, or that the client doesn't notice when connections have been half-closed?

jfoy · 2018-03-08T22:51:47Z

I can't tell if it's limited to HTTP/2, but it looks to me like clients maintain persistent TCP connections no matter the application protocol, so they would be hit by the same failure.

It's the latter problem - as far as we can tell, the load balancer is hanging up on decommissioned IP addresses without sending a FIN packet (or we never receive it), so the client thinks the connection is still open. The HTTP timeout is set correctly, but opening a new session appears to reuse the existing half-closed connection.

stephbu · 2018-03-30T21:19:59Z

Same issue happens also with NLB in AWS.

fejta-bot · 2018-06-28T21:23:45Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

warmchang · 2018-07-05T13:49:25Z

/remove-lifecycle stale

xh3b4sd · 2018-07-24T14:21:16Z

We have situations in which pods using the client have to be restarted in order to work around this problem. What can we do to get some movement here?

liggitt · 2018-07-24T14:29:16Z

What can we do to get some movement here?

There is sufficient plumbing in the rest.Config to wrap the transport or set up a connection-tracking dialer yourself today, while this is considered as a built-in feature

r7vme · 2018-07-25T09:14:24Z

There is sufficient plumbing in the rest.Config to wrap the transport or set up a connection-tracking dialer yourself today, while this is considered as a built-in feature

Right, i experimented while ago with own Transport. Unfortunately problem is deeper on TCP level (at least for us). In Transport i can control via dialer some TCP keepalive settings (this was my initial idea for a fix), but this does not help. At least because by default transport TCP keepalive settings are already quite good in Go.

It turned out that our problem was/is with incorrect load-balancer settings. Following happens for us:

endpoint (k8s api) hardly killed (e.g. kill -9 on qemu process)
load balancer detects that backend died after some timeout
load balancer does not reset existing connection to the client and keeps answering heartbeats
on client we see connection stays ~15 min in ESTABLISHED state

What helps is to set quite low timeouts on load balancers.

lavalamp · 2018-08-06T21:50:26Z

How much overlap does this have with #65012? See: kubernetes/kubernetes#65012 (comment)

szuecs · 2018-08-17T11:58:45Z

@lavalamp It is not only h2, which is affected.

Basically there are some problems in the Go default Transport.
For example it caches DNS responses forever golang/go#23427 .
The other as described in #374 (comment) is, that the loadbalancer (iptables/ipvs service for example) does not send TCP RST packet to the client, which will lead to a Linux kernel TCP timeout, because the current Timeout setting in client-go does not work.

I guess basically everywhere in Kubernetes and most of the 3rd party controllers, this is broken.
We saw it also in etcd-proxy, which we rewrote ourselves, because it was easier to build it, than to get a fix upstream merged.

lavalamp · 2018-08-17T16:42:15Z

@szuecs thank you, that was helpful. I think there are multiple reasons this class of issue has existed for a long time:

There's actually multiple different issues with similar symptoms
- DNS caching due to long connection retention time + apiserver IP changing in a HA environment
- Long connection retention time + broken load balancer + apiserver restart in HA environment
- Long connection retention time + shared connection fate due to HTTP/2 + apiserver restart in HA environment
Most people don't hit any of them, so reports are rare (for a long time I thought this was 100% broken load balancers)
Most reporters aren't clear about what exactly has gone on

A confounding factor is that we do need to reuse connection for performance.

It still seems like calling Ping() periodically for HTTP/2 connections (like I suggested in #65012) and then recommending that everyone use HTTP/2 should solve all of the problems-- e.g. DNS should get re-resolved if the connection is closed.

szuecs · 2018-08-20T09:31:39Z

H2 + Ping() is probably the safest bet we have, because you test end-to-end.

- rest.Config Timeout setting does not work right now [1] - number of connections will not increase along with newed client: they use a single connection always... [1] kubernetes/client-go#374 (comment)

fejta-bot · 2018-11-18T09:55:53Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

george-angel · 2018-11-18T10:12:54Z

/remove-lifecycle stale

fejta-bot · 2019-02-16T10:41:08Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

szuecs · 2019-02-16T17:11:07Z

/remove-lifecycle stale

duyanghao · 2020-09-17T02:34:38Z

@caesarxuchao @hakman I have created a k8s.io/apimachinery PR and a relevant golang/net PR for this problem, please take a look at it.

liuyuan10 · 2020-10-23T20:23:58Z

Can anybody help me understand why tcp keepalive is not the answer here? If the load balancer is doing the right thing (by healthchecking apiservers and reset existing sessions, or it's a simple natting LB using iptables), I think tcp keepalive should solve the problem. And golang already has it enabled by default (see Dailer.Keepalive. So I would expect a dead connection to be closed in 2-3 mins. Not the best but at least better than 10+ mins that people are reporting.

There are several kubernetes bugs [0,1,2] involving connection problems that seem related to the Go net/http2 library, where the stream state and connection state can get out of sync. This can manifest as a kubelet issue, where the node status gets stuck in a NotReady state, but can also happen elsewhere. In newer versions of the Go libraries some issues are fixed [3,4], but the fixes are not present in k8s 1.18. This change disables http2 in kube-apiserver and webhook-apiserver. This should be sufficient to avoid the majority of the issues, as disabling on one side of the connection is enough, and apiserver is generally either the client or the server. 0: kubernetes/kubernetes#87615 1: kubernetes/kubernetes#80313 2: kubernetes/client-go#374 3: golang/go#40423 4: golang/go#40201 Change-Id: Id693a7201acffccbc4b3db8f4e4b96290fd50288

hpdvanwyk · 2020-10-26T13:26:06Z

@liuyuan10 I wondered the same thing a while ago. Turns out for tcp keepalive to trigger the connection cannot have anything in its transmit buffer. Since something is usually trying to use the client connection there will almost always be some untransmitted bytes causing keepalive to never trigger.

liuyuan10 · 2020-10-26T16:59:36Z

@hpdvanwyk Thanks for sharing. Even in that case, because there is pending data in transmit buffer, TCP will always try to send something. To me it basically acts as keepalive probes. Are you saying TCP will not shutdown the session when ack is not received for more than 10mins?

JensErat · 2020-10-26T21:58:21Z

I must admit I'm not totally sure why TCP keepalive didn't help (but the http2 timeouts do), but at least in our case we assume connections getting lost "somewhere in the network", not on either client or server. Finding the root cause seems very difficult, in large setups this only happens occasionally (but often enough to trigger every few hours with a few hundreds of servers), multiple infrastructure teams involved and we cannot run and store network traces at that large scale.

We haven't been able to finally trace down what happens, but after weeks/months of trouble, enabling the http2 timeouts brought relief. In fact, by enforcing the timeouts over our entire code base (we "hacked" the vendored x/net library) with a timeout and a debug statement in case the timeout triggered, we were able to actually trace back a whole bunch of random, weird issues to those lost connections al over our own codebase, not only Kubernetes (and actually, we first observed the issue after upgrading Prometheus after they started upgrading connections to http2).

In fact, we assume two root causes in our setup:

VMware VIO load balancers closing connections without sending a RST package during reconfiguration in specific releases. I think they acknowledged this as a bug by now, maybe it is even resolved. I didn't personally track down the issue further after we applied our own http2 timeout patches. Still this didn't break Kubernetes often enough for us to actually reproduce and debug the issue.
Some weird, very rare networking issues in a specific cross-datacenter connect. It happens that rarely, that we cannot really blame anybody for this (a network will fail, we'll have to be able to deal with that).

Both of these causes have not been "frequent" at all (for frequent being a matter of scale) -- but if they occurred, they resulted in 15-30 minute outages including pages. Prometheus authors in fact stated a "normal" developer shouldn't even have to think about http2 timeouts, and Golang should have reasonable defaults. I somehow tend to agree with them.

hpdvanwyk · 2020-10-27T17:14:58Z

@liuyuan10 TCP will retry sending the packet and eventually shutdown the connection. From https://linux.die.net/man/7/tcp

tcp_retries2 (integer; default: 15; since Linux 2.2)
The maximum number of times a TCP packet is retransmitted in established state before giving up. The default value is 15, which corresponds to a duration of approximately between 13 to 30 minutes, depending on the retransmission timeout. The RFC 1122 specified minimum limit of 100 seconds is typically deemed too short.

This is why it takes 10+ minutes for the connection to be shut down if there is something in the transmit buffer instead of the 2 to 3 minutes you would expect if it was just TCP keepalives.

JensErat · 2020-10-27T19:42:33Z

@hpdvanwyk this pretty much exactly matches our observations, and sounds pretty reasonable. 👍 We had a 10 minutes delay in getting paged (so 10+ minutes of broken connection), but when not in office getting towards your laptop and connecting to the clusters was most often too slow to observe the issue -- already gone again. When we started measuring outage times, we never observed more than ~30 minutes until everything recovered on its own.

I'm really glad we finally have some kind of explanation for the observed recovery times. Networking is way more complicated and involved than one would expect.

yousong · 2020-10-28T02:30:52Z

In case it may be of interest to anyone, calico accessing apiserver over ipvs also suffers [1] from this issue.

[1] https://github.com/projectcalico/libcalico-go/issues/1267

liuyuan10 · 2020-10-28T17:18:41Z

@hpdvanwyk That's very convincing explanation. Thanks a lot. This basically makes tcp keepalive useless in our case....

JensErat · 2020-10-28T18:49:04Z

@yousong we had the issues all over our own code base (lots of automation code written in golang accessing the API server). Enforcing ReadIdleTimeout just magically made our entire test base and e2e tests much more robust, lots of the "we need to track that down" flakiness was gone.

stephbu mentioned this issue Mar 30, 2018

Kubelet unable to send status updates(heartbeats) to APIServer kubernetes/kubernetes#61917

Closed

mxey mentioned this issue May 8, 2018

Handle Kubernetes API server failover kubernetes/dashboard#3008

Closed

xh3b4sd mentioned this issue Jun 8, 2018

delete guest cluster nodes from Kubernetes API giantswarm/kvm-operator#484

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 28, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 5, 2018

afritzler mentioned this issue Jul 24, 2018

initializeShootClient stuck in Timeout Error in the Cluster Creation Process gardener/gardener#288

Closed

afritzler mentioned this issue Jul 25, 2018

Full Send Queue in local TCP stack leads to connection timeouts against GCP Load Balancer #442

Closed

szuecs mentioned this issue Aug 17, 2018

client-go timeouts do not work for non WATCH zalando-incubator/stackset-controller#23

Closed

abhinavdahiya mentioned this issue Nov 3, 2018

cmd/openshift-install/create: Retry watch connections openshift/installer#606

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 18, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 18, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2019

duyanghao mentioned this issue Sep 17, 2020

add http2 health check for SetTransportDefaults kubernetes/kubernetes#94844

Closed

yliaog mentioned this issue Sep 29, 2020

APIservice triggers write: broken pipe of websocket kubernetes/kubernetes#94990

Closed

szuecs mentioned this issue Oct 28, 2020

fix transport issue with WATCH zalando-incubator/stackset-controller#263

Open

caesarxuchao mentioned this issue Oct 29, 2020

Enables HTTP/2 health check kubernetes/kubernetes#95981

Merged

hakman mentioned this issue Nov 10, 2020

system-cluster-critical pod "calico-kube-controllers" is not yet ready over 10 minutes when performing rolling update kubernetes/kops#10207

Closed

k8s-ci-robot closed this as completed in kubernetes/kubernetes#95981 Nov 17, 2020

AtzeDeVries mentioned this issue Nov 26, 2020

allow the vault-webhook to be contacted over loadbalancer or ingress bank-vaults/bank-vaults#1196

Merged

3 tasks

hakman mentioned this issue Feb 18, 2021

Add liveness probe for calico-kube-controllers projectcalico/calico#4407

Merged

3 tasks

brendandburns mentioned this issue Mar 6, 2021

watcher connection stops receiving events after some time kubernetes-client/javascript#596

Closed

chenyw1990 mentioned this issue Jun 24, 2021

Revert "kubelet: fix fail to close kubelet->API connections on heartbeat failure" kubernetes/kubernetes#103149

Closed

chenyw1990 mentioned this issue Sep 6, 2021

kubelet don't share transport by default after #95427 kubernetes/kubernetes#100849

Closed

xuzhenglun mentioned this issue Dec 13, 2021

[Feature] close downstream if upstream is closed mholt/caddy-l4#38

Closed

szuecs mentioned this issue Dec 17, 2021

✨ AWS CNI Mode zalando-incubator/kube-ingress-aws-controller#461

Merged

jayanthvn mentioned this issue Dec 21, 2021

Connectetivity to Kubernetes API service timeout aws/amazon-vpc-cni-k8s#1797

Closed

hakman mentioned this issue Jun 29, 2022

calico-kube-controllers pod stuck in not Ready for 13 min projectcalico/calico#3751

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client should expose a mechanism to close underlying TCP connections #374

Client should expose a mechanism to close underlying TCP connections #374

jfoy commented Mar 8, 2018

lavalamp commented Mar 8, 2018

lavalamp commented Mar 8, 2018

jfoy commented Mar 8, 2018

stephbu commented Mar 30, 2018

fejta-bot commented Jun 28, 2018

warmchang commented Jul 5, 2018

xh3b4sd commented Jul 24, 2018

liggitt commented Jul 24, 2018

r7vme commented Jul 25, 2018 •

edited

Loading

lavalamp commented Aug 6, 2018

szuecs commented Aug 17, 2018 •

edited

Loading

lavalamp commented Aug 17, 2018

szuecs commented Aug 20, 2018

fejta-bot commented Nov 18, 2018

george-angel commented Nov 18, 2018

fejta-bot commented Feb 16, 2019

szuecs commented Feb 16, 2019

duyanghao commented Sep 17, 2020

liuyuan10 commented Oct 23, 2020 •

edited

Loading

hpdvanwyk commented Oct 26, 2020

liuyuan10 commented Oct 26, 2020

JensErat commented Oct 26, 2020

hpdvanwyk commented Oct 27, 2020

JensErat commented Oct 27, 2020

yousong commented Oct 28, 2020

liuyuan10 commented Oct 28, 2020

JensErat commented Oct 28, 2020

Client should expose a mechanism to close underlying TCP connections #374

Client should expose a mechanism to close underlying TCP connections #374

Comments

jfoy commented Mar 8, 2018

lavalamp commented Mar 8, 2018

lavalamp commented Mar 8, 2018

jfoy commented Mar 8, 2018

stephbu commented Mar 30, 2018

fejta-bot commented Jun 28, 2018

warmchang commented Jul 5, 2018

xh3b4sd commented Jul 24, 2018

liggitt commented Jul 24, 2018

r7vme commented Jul 25, 2018 • edited Loading

lavalamp commented Aug 6, 2018

szuecs commented Aug 17, 2018 • edited Loading

lavalamp commented Aug 17, 2018

szuecs commented Aug 20, 2018

fejta-bot commented Nov 18, 2018

george-angel commented Nov 18, 2018

fejta-bot commented Feb 16, 2019

szuecs commented Feb 16, 2019

duyanghao commented Sep 17, 2020

liuyuan10 commented Oct 23, 2020 • edited Loading

hpdvanwyk commented Oct 26, 2020

liuyuan10 commented Oct 26, 2020

JensErat commented Oct 26, 2020

hpdvanwyk commented Oct 27, 2020

JensErat commented Oct 27, 2020

yousong commented Oct 28, 2020

liuyuan10 commented Oct 28, 2020

JensErat commented Oct 28, 2020

r7vme commented Jul 25, 2018 •

edited

Loading

szuecs commented Aug 17, 2018 •

edited

Loading

liuyuan10 commented Oct 23, 2020 •

edited

Loading