Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

track/close kubelet->API connections on heartbeat failure #63492

Merged

Conversation

liggitt
Copy link
Member

@liggitt liggitt commented May 7, 2018

xref #48638
xref kubernetes-retired/kube-aws#598

we're already typically tracking kubelet -> API connections and have the ability to force close them as part of client cert rotation. if we do that tracking unconditionally, we gain the ability to also force close connections on heartbeat failure as well. it's a big hammer (means reestablishing pod watches, etc), but so is having all your pods evicted because you didn't heartbeat.

this intentionally does minimal refactoring/extraction of the cert connection tracking transport in case we want to backport this

  • first commit unconditionally sets up the connection-tracking dialer, and moves all the cert management logic inside an if-block that gets skipped if no certificate manager is provided (view with whitespace ignored to see what actually changed)
  • second commit plumbs the connection-closing function to the heartbeat loop and calls it on repeated failures

follow-ups:

  • consider backporting this to 1.10, 1.9, 1.8
  • refactor the connection managing dialer to not be so tightly bound to the client certificate management

/sig node
/sig api-machinery

kubelet: fix hangs in updating Node status after network interruptions/changes between the kubelet and API server

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 7, 2018
@k8s-ci-robot k8s-ci-robot requested review from awly, dims and yifan-gu May 7, 2018 16:17
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels May 7, 2018
@liggitt
Copy link
Member Author

liggitt commented May 7, 2018

@kubernetes/sig-node-pr-reviews

@liggitt liggitt force-pushed the node-heartbeat-close-connections branch from 60fd863 to 05116f9 Compare May 7, 2018 16:40
@liggitt
Copy link
Member Author

liggitt commented May 7, 2018

cc @sjenning

Copy link
Contributor

@awly awly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend adding a test to make sure OnHeartbeatFailure triggers.

@@ -541,19 +541,19 @@ func run(s *options.KubeletServer, kubeDeps *kubelet.Dependencies) (err error) {
return fmt.Errorf("invalid kubeconfig: %v", err)
}

var clientCertificateManager certificate.Manager
var clientCertificateManager certificate.Manager = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to do anything useful, remove = nil

@@ -51,80 +51,93 @@ import (
//
// stopCh should be used to indicate when the transport is unused and doesn't need
// to continue checking the manager.
func UpdateTransport(stopCh <-chan struct{}, clientConfig *restclient.Config, clientCertificateManager certificate.Manager, exitAfter time.Duration) error {
func UpdateTransport(stopCh <-chan struct{}, clientConfig *restclient.Config, clientCertificateManager certificate.Manager, exitAfter time.Duration) (func(), error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document the new return value in func comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@liggitt liggitt force-pushed the node-heartbeat-close-connections branch 2 times, most recently from 8a0a991 to c47a7a9 Compare May 7, 2018 17:10
@liggitt
Copy link
Member Author

liggitt commented May 7, 2018

Recommend adding a test to make sure OnHeartbeatFailure triggers.

done

@liggitt liggitt changed the title WIP - track/close kubelet->API connections on heartbeat failure track/close kubelet->API connections on heartbeat failure May 7, 2018
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 7, 2018
@liggitt liggitt force-pushed the node-heartbeat-close-connections branch from c47a7a9 to f18a52d Compare May 7, 2018 17:33
@@ -350,6 +350,9 @@ func (kl *Kubelet) updateNodeStatus() error {
glog.V(5).Infof("Updating node status")
for i := 0; i < nodeStatusUpdateRetry; i++ {
if err := kl.tryUpdateNodeStatus(i); err != nil {
if i > 0 && kl.onRepeatedHeartbeatFailure != nil {
kl.onRepeatedHeartbeatFailure()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to set onRepeatedHeartbeatFailure to nil or something? (once we invoke the method)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that would mean the kubelet would hit the same issue if the network condition was encountered twice during a single process lifetime

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack thanks!

@liggitt liggitt changed the title track/close kubelet->API connections on heartbeat failure WIP - track/close kubelet->API connections on heartbeat failure May 7, 2018
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 7, 2018
@liggitt liggitt force-pushed the node-heartbeat-close-connections branch from f18a52d to 814b065 Compare May 7, 2018 19:07
@liggitt
Copy link
Member Author

liggitt commented May 7, 2018

/test pull-kubernetes-e2e-gke

@liggitt
Copy link
Member Author

liggitt commented May 7, 2018

/test pull-kubernetes-local-e2e

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 8220171 into kubernetes:master May 14, 2018
@liggitt
Copy link
Member Author

liggitt commented May 14, 2018

I plan to open picks to 1.8, 1.9, and 1.10, but will hold them until this makes it through serial/soak/scale CI tests

@liggitt
Copy link
Member Author

liggitt commented May 15, 2018

k8s-github-robot pushed a commit that referenced this pull request May 17, 2018
…2-upstream-release-1.10

Automatic merge from submit-queue.

Automated cherry pick of #63492: Always track kubelet -> API connections

Cherry pick of #63492 on release-1.10.

#63492: Always track kubelet -> API connections
k8s-github-robot pushed a commit that referenced this pull request May 17, 2018
…2-upstream-release-1.8

Automatic merge from submit-queue.

Automated cherry pick of #63492: Always track kubelet -> API connections

Cherry pick of #63492 on release-1.8.

#63492: Always track kubelet -> API connections
k8s-github-robot pushed a commit that referenced this pull request Jun 27, 2018
…2-upstream-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #63492: Always track kubelet -> API connections

Cherry pick of #63492 on release-1.9.

#63492: Always track kubelet -> API connections
jackfrancis added a commit to jackfrancis/aks-engine that referenced this pull request Jan 3, 2019
jackfrancis added a commit to Azure/aks-engine that referenced this pull request Jan 3, 2019
juhacket pushed a commit to juhacket/aks-engine that referenced this pull request Mar 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.