-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enables HTTP/2 health check #95981
Enables HTTP/2 health check #95981
Conversation
I disagree with setting update: I think the |
@cheftako, this health check is very similar to the GRPC health check we want to add to the apiserver-network-proxy to solve kubernetes-sigs/apiserver-network-proxy#152. Their performance implication is also similar. |
@cheftako
It would be more convincing if we can test the behavior in a large cluster. @kubernetes/sig-scalability is there an easy way to test this PR in a large cluster? @JensErat do you mind sharing the size of the clusters where you turned on the http/2 health check feature? Thanks. |
/test |
@BenTheElder: The
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I think https://github.com/kubernetes/test-infra/blob/b40bcac24d0ec96cc7d440ca5ece690cce7a62fe/config/jobs/kubernetes/sig-scalability/sig-scalability-presubmit-jobs.yaml#L67 is what you want, there are some optional scale presubmit. On second thought though, even if that were the right one, we should check with someone from the scale team before using these resources 🙃 |
Thanks, @BenTheElder ! The pull-kubernetes-e2e-gce-large-performance uses 2000 nodes, that seems to be the right test. I'll check with the scalability sig. |
Seems relatively harmless from scalability point of view. I'm fine with running 2k presubmit to verify, however our approval is not needed: @tosi3k - for visibility as next oncaller |
I think we should backport to 1.18, given:
I opened #100376 to pick this to 1.18 |
Given the consequences when this fails, I'm in favor of backporting this change if we can get an ack (ideally from @caesarxuchao) that the backport is fairly safe. My knowledge of the golang stack at this depth is fairly limited. Is there a particular minimum golang level that is required or was it just the |
@caesarxuchao acked the backport PR (#100376)
It was contained to |
Cherrypick #95981 to 1.18, Enables HTTP/2 health check
Following up from cilium#19259, this commit implements the same logic as kubelet [1] to close all idle connections. We can rely on HTTP/2 to determine which connections are idle (HTTP/2 pings), instead of closing all connections indiscriminately (when the heartbeat fails) via the custom dialer in setDialer(). We can assume HTTP/2 support as it has been supported since K8s 1.20 [2]. Users can disable HTTP/2 by setting the DISABLE_HTTP2 environment variable, however this only really meant from very advanced users, hence no documentation. Kubernetes has been running with HTTP/2 as the default for a long time (~since Dec 2020) without major issues. CloseIdleConnectionsFor() was introduced in K8s 1.23 [3], hence this commit cannot be backported to any earlier version of Cilium which doesn't support K8s 1.23. [1]: kubernetes/kubernetes#104844 [2]: kubernetes/kubernetes#95981 [3]: kubernetes/kubernetes@b9d865a Suggested-by: André Martins <andre@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>
Following up from #19259, this commit implements the same logic as kubelet [1] to close all idle connections. We can rely on HTTP/2 to determine which connections are idle (HTTP/2 pings), instead of closing all connections indiscriminately (when the heartbeat fails) via the custom dialer in setDialer(). We can assume HTTP/2 support as it has been supported since K8s 1.20 [2]. Users can disable HTTP/2 by setting the DISABLE_HTTP2 environment variable, however this only really meant from very advanced users, hence no documentation. Kubernetes has been running with HTTP/2 as the default for a long time (~since Dec 2020) without major issues. CloseIdleConnectionsFor() was introduced in K8s 1.23 [3], hence this commit cannot be backported to any earlier version of Cilium which doesn't support K8s 1.23. [1]: kubernetes/kubernetes#104844 [2]: kubernetes/kubernetes#95981 [3]: kubernetes/kubernetes@b9d865a Suggested-by: André Martins <andre@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>
[ upstream commit 365c788 ] Following up from cilium#19259, this commit implements the same logic as kubelet [1] to close all idle connections. We can rely on HTTP/2 to determine which connections are idle (HTTP/2 pings), instead of closing all connections indiscriminately (when the heartbeat fails) via the custom dialer in setDialer(). We can assume HTTP/2 support as it has been supported since K8s 1.20 [2]. Users can disable HTTP/2 by setting the DISABLE_HTTP2 environment variable, however this only really meant from very advanced users, hence no documentation. Kubernetes has been running with HTTP/2 as the default for a long time (~since Dec 2020) without major issues. CloseIdleConnectionsFor() was introduced in K8s 1.23 [3], hence this commit cannot be backported to any earlier version of Cilium which doesn't support K8s 1.23. [1]: kubernetes/kubernetes#104844 [2]: kubernetes/kubernetes#95981 [3]: kubernetes/kubernetes@b9d865a Suggested-by: André Martins <andre@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 365c788 ] Following up from #19259, this commit implements the same logic as kubelet [1] to close all idle connections. We can rely on HTTP/2 to determine which connections are idle (HTTP/2 pings), instead of closing all connections indiscriminately (when the heartbeat fails) via the custom dialer in setDialer(). We can assume HTTP/2 support as it has been supported since K8s 1.20 [2]. Users can disable HTTP/2 by setting the DISABLE_HTTP2 environment variable, however this only really meant from very advanced users, hence no documentation. Kubernetes has been running with HTTP/2 as the default for a long time (~since Dec 2020) without major issues. CloseIdleConnectionsFor() was introduced in K8s 1.23 [3], hence this commit cannot be backported to any earlier version of Cilium which doesn't support K8s 1.23. [1]: kubernetes/kubernetes#104844 [2]: kubernetes/kubernetes#95981 [3]: kubernetes/kubernetes@b9d865a Suggested-by: André Martins <andre@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
The env vars were implemented in kubernetes#95981. Effectively no impacts if using the default values here. BUG=255296578 Change-Id: I2441ef1a181a0fb202af7893c25cff5ff15ef17a
The env vars were implemented in kubernetes#95981. Effectively no impacts if using the default values here. BUG=255296578 Change-Id: I2441ef1a181a0fb202af7893c25cff5ff15ef17a Remove GODEBUG=x509sha1=1 workaround Bug: 227456358a Revert changes made in I5165e6c2fe73e8e1b2a617ced591133228b6d275 Change-Id: I7dd1ca8f9796c542c89074703e44e6b6f4b6edab
The env vars were implemented in kubernetes#95981. Effectively no impacts if using the default values here. BUG=255296578 Change-Id: I2441ef1a181a0fb202af7893c25cff5ff15ef17a Remove GODEBUG=x509sha1=1 workaround Bug: 227456358a Revert changes made in I5165e6c2fe73e8e1b2a617ced591133228b6d275 Change-Id: I7dd1ca8f9796c542c89074703e44e6b6f4b6edab
This PR enables the HTTP/2 connection health check for all the clients that call
SetTransportDefaults
. This includes all clients created via k8s.io/client-go. Without the health check, a broken transport layer connection can stay in the HTTP/2 connection pool for a long time, causing requests to the same host to fail. For example, a broken TCP connection can linger for 15 minutes before being closed.The HTTP/2 feature exposes two parameters,
ReadIdleTimeout
: if the HTTP/2 client has not received any frame from the connection for this amount of time, it starts sending periodic pings to the server. The period of the pings is alsoReadIdleTimeout
.PingTimeout
: if an ACK to the ping is not received by the client after this amount of time, the connection is considered broken and will be closed by the client.For Kubernetes, I default the
ReadIdleTimeout
to 30s, which I think is not going to cause performance issues even in large clusters, becauseI'll wait for results from pull-kubernetes-e2e-gce-large-performance to confirm.
I default the
PingTimeout
to 15s, which is relaxed enough in most network conditions.Although I think most users don't need to tune these two parameters, I add environment variables to allow user to tune them, or even disable the health check, by setting
HTTP2_READ_IDLE_TIMEOUT_SECONDS=0
.Fixes kubernetes/client-go#374 and
#87615.
Ref. #95898
#94844
/kind bug
/sig api-machinery