-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
track/close kubelet->API connections on heartbeat failure #63492
track/close kubelet->API connections on heartbeat failure #63492
Conversation
@kubernetes/sig-node-pr-reviews |
60fd863
to
05116f9
Compare
cc @sjenning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend adding a test to make sure OnHeartbeatFailure
triggers.
cmd/kubelet/app/server.go
Outdated
@@ -541,19 +541,19 @@ func run(s *options.KubeletServer, kubeDeps *kubelet.Dependencies) (err error) { | |||
return fmt.Errorf("invalid kubeconfig: %v", err) | |||
} | |||
|
|||
var clientCertificateManager certificate.Manager | |||
var clientCertificateManager certificate.Manager = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to do anything useful, remove = nil
@@ -51,80 +51,93 @@ import ( | |||
// | |||
// stopCh should be used to indicate when the transport is unused and doesn't need | |||
// to continue checking the manager. | |||
func UpdateTransport(stopCh <-chan struct{}, clientConfig *restclient.Config, clientCertificateManager certificate.Manager, exitAfter time.Duration) error { | |||
func UpdateTransport(stopCh <-chan struct{}, clientConfig *restclient.Config, clientCertificateManager certificate.Manager, exitAfter time.Duration) (func(), error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document the new return value in func comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
8a0a991
to
c47a7a9
Compare
done |
c47a7a9
to
f18a52d
Compare
@@ -350,6 +350,9 @@ func (kl *Kubelet) updateNodeStatus() error { | |||
glog.V(5).Infof("Updating node status") | |||
for i := 0; i < nodeStatusUpdateRetry; i++ { | |||
if err := kl.tryUpdateNodeStatus(i); err != nil { | |||
if i > 0 && kl.onRepeatedHeartbeatFailure != nil { | |||
kl.onRepeatedHeartbeatFailure() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to set onRepeatedHeartbeatFailure to nil or something? (once we invoke the method)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that would mean the kubelet would hit the same issue if the network condition was encountered twice during a single process lifetime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack thanks!
f18a52d
to
814b065
Compare
/test pull-kubernetes-e2e-gke |
/test pull-kubernetes-local-e2e |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here. |
I plan to open picks to 1.8, 1.9, and 1.10, but will hold them until this makes it through serial/soak/scale CI tests |
green serial run containing this PR: many green slow suite CI runs containing this PR, e.g.: scale test showed no regressions (https://k8s-testgrid.appspot.com/sig-release-master-blocking#gce-scale-performance, run 154) |
xref #48638
xref kubernetes-retired/kube-aws#598
we're already typically tracking kubelet -> API connections and have the ability to force close them as part of client cert rotation. if we do that tracking unconditionally, we gain the ability to also force close connections on heartbeat failure as well. it's a big hammer (means reestablishing pod watches, etc), but so is having all your pods evicted because you didn't heartbeat.
this intentionally does minimal refactoring/extraction of the cert connection tracking transport in case we want to backport this
follow-ups:
/sig node
/sig api-machinery