Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet don't share transport by default after #95427 #100849

Closed
chenyw1990 opened this issue Apr 6, 2021 · 17 comments · Fixed by #104844
Closed

kubelet don't share transport by default after #95427 #100849

chenyw1990 opened this issue Apr 6, 2021 · 17 comments · Fixed by #104844
Assignees
Labels
area/kubelet kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@chenyw1990
Copy link
Contributor

chenyw1990 commented Apr 6, 2021

What happened:

In #95427, client-go don't share transport when c.Dial is not nil,

if c.TLS.GetCert != nil || c.Dial != nil || c.Proxy != nil {

but kubelet custorm Dial by default,
clientConfig.Dial = d.DialContext

After PR is integrated, the connections between kubelet and kube-apiserver in our cluster with 4000 nodes increases by five times.

What you expected to happen:

kubelet share transport by default, one kubelet only keep one connection to kube-apiserver.
@liggitt

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

start kubelet with rotate-certificates=false

Environment:

  • Kubernetes version (use kubectl version): v1.19.4
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):EulerOS 2.9
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:
@chenyw1990 chenyw1990 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 6, 2021
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 6, 2021
@chenyw1990
Copy link
Contributor Author

/sig api-machinery
/sig node

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 6, 2021
@jpbetz
Copy link
Contributor

jpbetz commented Apr 6, 2021

/triage accepted
/assign @yliaog

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 6, 2021
@yliaog
Copy link
Contributor

yliaog commented Apr 6, 2021

how many connections between kubelet and kube apiserver on one node after PR #95427 in your case? it's mentioned it increases by five times, does that mean before the PR, there is one connection, after the PR, there are 5 or 6 connections?

@chenyw1990
Copy link
Contributor Author

before the PR, there is one connection, after the PR, there 3 connections,
there are some customization capabilities in our product, and we'll have another two connections.
In our kubelet startup parameters, rotate-certificates=false

@liggitt
Copy link
Member

liggitt commented Apr 7, 2021

Some increase was expected, since multiple REST clients constructed from a config with a custom dialer cannot safely share a transport, and client-go constructs a REST client for each API group/version accessed. Something like #97821 would be required to rework how client-go constructs clients to start sharing REST clients above the transport level between API groups/versions.

I think this is working as intended until client-go client construction is reworked.

@yliaog
Copy link
Contributor

yliaog commented Apr 7, 2021

do you know where in the code the 5 connections are created? are they all safe to share the transport?

also are the 5 connections all active all the time? if not, IdleConnTimeout should cause the connection to be closed.

@chenyw1990
Copy link
Contributor Author

I test the code on master branch, if I change rotateCertificates to false, there are 6 connections between kubelet and kube-apiserver

image

image

@chenyw1990
Copy link
Contributor Author

chenyw1990 commented Apr 7, 2021

transport will share when config.Transport not nil

if rotateCertificates is true, config.Transport will been set in kubeletcertificate.UpdateTransport

closeAllConns, err := kubeletcertificate.UpdateTransport(wait.NeverStop, transportConfig, clientCertificateManager, 5*time.Minute)

but if rotateCertificates is false, config.Transport will not been set.

@gjkim42
Copy link
Member

gjkim42 commented Jun 24, 2021

/area kubelet

@chenyw1990 @yliaog
Any update on this issue?

@chenyw1990
Copy link
Contributor Author

@gjkim42 I will work on this issue

@chenyw1990
Copy link
Contributor Author

@liggitt

Some increase was expected, since multiple REST clients constructed from a config with a custom dialer cannot safely share a transport, and client-go constructs a REST client for each API group/version accessed. Something like #97821 would be required to rework how client-go constructs clients to start sharing REST clients above the transport level between API groups/versions.

I think this is working as intended until client-go client construction is reworked.

Hi @liggitt, if rotate-certificates=false kubelet customizes the dialer to provide a closeAllConns to close the connection when the connection is dead but not been closed.

In #78016, there is a discuss about another solution

As discussed in kubernetes/client-go#374, another way to fix this is using http/2.0's ping frame to keep connection alive and identify failed connections. but it seems to be a long-term solution.

after #96778 client-go has HTTP 2.0 health check, so we don't need this change.

So we can solve the current problem by reverting #78016.

I have submitted #103149 to revert #78016, and this change requires your final confirmation.

Thanks you.

@liggitt
Copy link
Member

liggitt commented Sep 7, 2021

are we guaranteed kubelets always speak to kube-apiserver over http/2? if they use a load-balancer that is not http/2 capable, isn't the "close all connections" fix still required as a backstop?

@aojea
Copy link
Member

aojea commented Sep 7, 2021

I thought http1.1 wasn't affected because it will always try to connect a new connection
kubernetes/client-go#374 (comment)

@liggitt
Copy link
Member

liggitt commented Sep 8, 2021

I thought http1.1 wasn't affected because it will always try to connect a new connection

I'm not confident in that. The original issue dated back to before client-go defaulted to http/2, and we definitely saw issues with stuck TCP connections then.

@aojea
Copy link
Member

aojea commented Sep 8, 2021

I thought http1.1 wasn't affected because it will always try to connect a new connection

I'm not confident in that. The original issue dated back to before client-go defaulted to http/2, and we definitely saw issues with stuck TCP connections then.

yeah, I could reproduce it with http1.1, however, it can be solved using the transport CloseIdleConnections method

@liggitt please take a look #104844, the last commit needs more work, but I think that the 2 first commits exposeing the CloseIdleConnections() are useful

@aojea
Copy link
Member

aojea commented Sep 13, 2021

ok, this seems to work #104844, but it will be nice to have more feedback
@chenyw1990 would you be able to test #104844 and see if this solve your issue?

@chenyw1990
Copy link
Contributor Author

ok, this seems to work #104844, but it will be nice to have more feedback
@chenyw1990 would you be able to test #104844 and see if this solve your issue?

Hi aojea, I have test #104844 on v1.21.1
before merge #104844, there are 6 connections to kube-apiserver from kubelet
image

after merge #104844, there is only 1 connection to kube-apiserver from kubelet
image

so, #104844 can solve my issue.
thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
7 participants