A watchdog to restart nodes in NotReady state #15606

victor-sudakov · 2023-07-08T16:23:12Z

/kind feature

1. Describe IN DETAIL the feature/behavior/change you would like to see.

There are cases when a node is NotReady from the point of view of Kubernets/kOps, but is healthy from the point of view of the corresponding AWS autoscaling group. The easiest way to reproduce this situation is to stop the kubelet service on a node. A node will stay in the NotReady state forever after that. What is worse, pods using PVCs will never be rescheduled from such a node to other nodes because EBS volumes are still attached to the NotReady node and cannot be detached until it is stopped or terminated.

Can we add some watchdog addon which would signal aws autoscaling set-instance-health --health-status Unhealthy or something similar to AWS when a node has been NotReady for a certain configured time?

This would allow clusters to heal themselves.

The text was updated successfully, but these errors were encountered:

hakman · 2023-07-08T18:22:37Z

ClusterAutoscaler might be a good fit for this use case. They recently added a new flag --scale-down-unready-enabled.
kubernetes/autoscaler#5537

victor-sudakov · 2023-07-09T06:22:17Z

I have always been wary of the ClusterAutoscaler due to its complexity, so I have never used it. For example from https://kops.sigs.k8s.io/addons/#cluster-autoscaler I get the idea that I have to specify the latest supported image of ClusterAutocaler for the specified kubernetes version - does it mean that I will have to change this manually in the manifest on each kops upgrade cluster run?

In short, I think it's an overkill for my rather simple tasks and will add a lot of admin overhead. Do you think there is an alternative simpler solution?

hakman · 2023-07-09T06:51:30Z

I have always been wary of the ClusterAutoscaler due to its complexity, so I have never used it. For example from https://kops.sigs.k8s.io/addons/#cluster-autoscaler I get the idea that I have to specify the latest supported image of ClusterAutocaler for the specified kubernetes version - does it mean that I will have to change this manually in the manifest on each kops upgrade cluster run?

There are 3 options here:

You wait for a contributor to add these args for cluster-autoscaler to kOps
You make a PR to add these args to kOps
You deploy and manage cluster-autoscaler yourself with these args

In short, I think it's an overkill for my rather simple tasks and will add a lot of admin overhead. Do you think there is an alternative simpler solution?

Check if https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#defaulttolerationseconds helps. Once nodes are empty, cluster-autoscaler should terminate them.

victor-sudakov · 2023-07-09T07:22:36Z

There are 3 options here:
You wait for a contributor to add these args for cluster-autoscaler to kOps

Can you please confirm, do I understand correctly then that the installation of ClusterAutoscaler in its current state will indeed break the automation of kops upgrade cluster? Or was it my erroneous interpretation of the docs?

Once nodes are empty, cluster-autoscaler should terminate them.

The affected nodes will never be empty because their pods using PVCs will never be rescheduled from such nodes to other ones because EBS volumes are still attached to the NotReady node and cannot be detached until it is stopped or terminated. You need a node to be really terminated to free and reattach the EBS volume which backs the PV.

Snippet from the log of such a situation:

Warning  FailedAttachVolume  3m9s  attachdetach-controller  Multi-Attach error for volume "pvc-bb834416-d65f-4d85-b7bf-8f64e2e62786" Volume is already exclusively attached to one node and can't be attached to another
Warning  FailedMount         66s   kubelet                  Unable to attach or mount volumes: unmounted volumes=[db-rm], unattached volumes=[db-rm kube-api-access-ss8w5]: timed out waiting for the condition

hakman · 2023-07-09T07:26:35Z

Can you please confirm, do I understand correctly then that the installation of ClusterAutoscaler in its current state will indeed break the automation of kops upgrade cluster? Or was it my erroneous interpretation of the docs?

You cannot use kOps to manage an addon (like cluster-autoscaler) and modify the manifests later. kOps will eventually notice and remove the changes.

victor-sudakov · 2023-07-09T07:34:35Z

You cannot use kOps to manage an addon (like cluster-autoscaler) and modify the manifests later. kOps will eventually notice and remove the changes.

Sorry, I don't quite understand you. So, when I add the autoscaler addon, I specify a particular version in spec.image, correct? Then, when I have to upgrade the cluster via kops upgrade cluster, what should I do with the addon and with the value of spec.image? Update it manually, and at what moment?

hakman · 2023-07-09T07:46:06Z

The problem is not the spec.image. That will be kept. You will not be able to change --scale-down-unneeded-time.

PS: Just try, it's easy to test your assumptions on a test cluster.

victor-sudakov · 2023-07-09T07:52:33Z

I guess I don't need it kept, I need it updated automatically to the latest version the cluster is running.

Of course, I'll experiment with the addon, but I still think it's an overkill for the simple task of restarting NotReady nodes just to free EBS volumes. The documentation itself for the addon is overwhelming.

victor-sudakov · 2023-07-10T12:56:27Z

Want something like this just for kOps: https://docs.digitalocean.com/developer-center/automatic-node-repair-on-digitalocean-kubernetes/

hakman · 2023-07-10T15:48:07Z

What you want will mostly work on AWS, not on most other supported cloud providers. kOps uses cluster autoscaler in general, which already has the feature, so most likely there will be no such feature added.
The --scale-down-unready-enabled is just added to disable the feature you want. The functionality was there for some time (probably 4+ years).

Add flag '--scale-down-unready-enabled' to enable or disable scale-down of unready nodes. Default value set to true for backwards compatibility (i.e., allow scale-down of unready nodes).
There are cases where a user may not want the unready nodes to be removed from a cluster.
As an example, but not limited to, this might be useful in case a node is unreachable for a period of time and local data live there, the node shall remain in the cluster, and possibly an admin may want to take any actions for recovering it.

victor-sudakov · 2023-07-11T08:34:41Z

I have not received a definite answer, can you please tell me: does the autoscaler bring additional administrative overhead after installation into the cluster?

hakman · 2023-07-11T09:28:04Z

All you have to do is to enable it. If you want a newer image, kOps will not overwrite it on update/upgrade.

  clusterAutoscaler:
    enabled: true

victor-sudakov · 2023-07-18T08:34:48Z

Sorry, does not solve the problem. I installed the autoscaler and metrics server:

clusterAutoscaler:
  enabled: true
metricsServer:
  enabled: true
  insecure: true

did a rolling update of the cluster, then SSH-ed into a node and stopped the kubelet. The node became NotReady in the "kubectl get nodes" output and has been in this state for 28 minutes already. No attempt has been made to restart or terminate it.

Probably to use the autoscaler as a NotReady watchdog, some more non-default configuration is needed?

hakman · 2023-07-18T08:39:19Z

Could be combined with other timers. I would suggest to wait 1 hour and see.
Also, check cluster-autoscaler logs to see what it things of the node.

DefaultScaleDownUnreadyTime = 20 * time.Minute

victor-sudakov · 2023-07-18T08:40:00Z

An hour is too much for cluster recovery, but I will wait and report.

victor-sudakov · 2023-07-18T08:47:14Z

PS an hour is too much because a node in a NotReady state does not release its EBS volumes and the statefulSets with PVCs stop working for the whole time a node is not ready. In this context, even 5 minutes is too much.

olemarkus · 2023-07-18T08:54:35Z

I think you are perhaps looking for the wrong solution. When nodes become not-ready, pods should be evicted. The time before eviction is also configurable. See https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions

PVCs certainly follow the pod regardless of the instance state. EBS controller will take care to unattach and reattach EBS volumes as needed.

victor-sudakov · 2023-07-18T09:06:44Z

When nodes become not-ready, pods should be evicted.

That is a different story. Even if all the pods are successfully evicted from the node, it is not good to keep the NotReady node forever. There should be some self-repair mechanism in the cluster, like the one used in the Digital Ocean at the link 9 messages above.

olemarkus · 2023-07-18T09:12:28Z

If the node has been (mostly) evicted, i.e its utilisation is below the configured CAS threshold, then CAS will terminate it. There is a huge difference between failing workloads and failing instances. Besides, you also want the not-ready node to stick around so you can determine the cause of failure.

victor-sudakov · 2023-07-18T09:13:17Z

I would suggest to wait 1 hour and see.

An hour has passed and still ... nada. The log of one of the three autoscaler pods: https://termbin.com/z2vu

victor-sudakov · 2023-07-18T09:15:54Z

Besides, you also want the not-ready node to stick around so you can determine the cause of failure.

Maybe I do if a replacement node has been started. But not the way it works now when the failed node is just marked NotReady and no replacement is started.

olemarkus · 2023-07-18T09:16:23Z

The logs say that it shouldn't scale down because you are at minimum capacity anyway. And there is no need for additional capacity in the cluster. This is not a very likely scenario in a cluster that has actual workloads.

victor-sudakov · 2023-07-18T09:25:23Z

And no eviction happens either. The statefulSet-managed pods are just stuck in the "Terminating" status on the NotReady node. I can see the NotReady node has the node.kubernetes.io/unreachable:NoExecute and node.kubernetes.io/unreachable:NoSchedule taints but the pod is there and has not been rescheduled anywhere.

You know what? Could you reproduce this for me?

Create a kOps cluster on AWS.
Create a statefulSet with one test pod and a PVC (make sure you configure a volumeClaimTemplate in the statefulSet and the pod has a volume attached).
Disable kubelet on the node where your test pod is running (e.g. via SSH to the node). This we emulate a faulty node.
Watch the node in the NotReady state forever and the test pod in the "Terminating" status forever, never rescheduled.

victor-sudakov · 2023-07-18T09:31:03Z

The logs say that it shouldn't scale down because you are at minimum capacity anyway. And there is no need for additional capacity in the cluster. This is not a very likely scenario in a cluster that has actual workloads.

This is because I really don't need autoscaling, I just need faulty node replacement. And you are right, this one is an experimental cluster.

olemarkus · 2023-07-18T09:45:17Z

Sorry I am not able to reproduce this now. If the pod hangs on termination, there is a finaliser blocking it, most likely.

hakman · 2023-07-18T09:58:04Z

See: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-does-ca-deal-with-unready-nodes

victor-sudakov · 2023-07-18T10:13:37Z

See: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-does-ca-deal-with-unready-nodes

Once there are more unready nodes in the cluster, CA stops all operations until the situation improves.

Totally does not sound like it would restart or recreate any nodes. Looks like the autoscaler just stops working if there are more unready nodes than a certain threshold.

victor-sudakov · 2023-07-18T10:15:21Z

Sorry I am not able to reproduce this now. If the pod hangs on termination, there is a finaliser blocking it, most likely.

Whatever is the problem with eviction it is not for this issue. I'll debug it and may open another issue or find the reason. Let's stick to kOps' self healing.

olemarkus · 2023-07-18T11:39:40Z

Sure. Happy to review a PR.

k8s-triage-robot · 2024-03-26T22:18:55Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-03-26T22:18:59Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

victor-sudakov · 2024-03-27T02:37:10Z

/reopen

k8s-ci-robot · 2024-03-27T02:37:14Z

@victor-sudakov: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2024-04-26T03:17:44Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-04-26T03:17:48Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

victor-sudakov · 2024-04-26T03:51:00Z

/reopen

k8s-ci-robot · 2024-04-26T03:51:04Z

@victor-sudakov: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2024-05-26T04:27:17Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-05-26T04:27:22Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

victor-sudakov · 2024-05-27T05:48:31Z

/reopen

k8s-ci-robot · 2024-05-27T05:48:36Z

@victor-sudakov: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-triage-robot · 2024-06-26T05:58:33Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-06-26T05:58:38Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

stl-victor-sudakov · 2024-06-26T06:09:28Z

/reopen

k8s-ci-robot · 2024-06-26T06:09:32Z

@stl-victor-sudakov: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

victor-sudakov · 2024-06-26T06:10:17Z

/reopen

k8s-ci-robot · 2024-06-26T06:10:22Z

@victor-sudakov: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-triage-robot · 2024-07-26T06:31:33Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-07-26T06:31:38Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 8, 2023

k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 25, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 26, 2024

k8s-ci-robot reopened this Mar 27, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 26, 2024

k8s-ci-robot reopened this Apr 26, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale May 26, 2024

k8s-ci-robot reopened this May 27, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 26, 2024

k8s-ci-robot reopened this Jun 26, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 26, 2024

A watchdog to restart nodes in NotReady state #15606

A watchdog to restart nodes in NotReady state #15606

Comments

victor-sudakov commented Jul 8, 2023

hakman commented Jul 8, 2023 • edited Loading

victor-sudakov commented Jul 9, 2023

hakman commented Jul 9, 2023

victor-sudakov commented Jul 9, 2023

hakman commented Jul 9, 2023

victor-sudakov commented Jul 9, 2023

hakman commented Jul 9, 2023

victor-sudakov commented Jul 9, 2023

victor-sudakov commented Jul 10, 2023

hakman commented Jul 10, 2023

victor-sudakov commented Jul 11, 2023

hakman commented Jul 11, 2023

victor-sudakov commented Jul 18, 2023

hakman commented Jul 18, 2023

victor-sudakov commented Jul 18, 2023

victor-sudakov commented Jul 18, 2023

olemarkus commented Jul 18, 2023

victor-sudakov commented Jul 18, 2023

olemarkus commented Jul 18, 2023

victor-sudakov commented Jul 18, 2023

victor-sudakov commented Jul 18, 2023

olemarkus commented Jul 18, 2023

victor-sudakov commented Jul 18, 2023 • edited Loading

victor-sudakov commented Jul 18, 2023 • edited Loading

olemarkus commented Jul 18, 2023

hakman commented Jul 18, 2023

victor-sudakov commented Jul 18, 2023

victor-sudakov commented Jul 18, 2023

olemarkus commented Jul 18, 2023

k8s-triage-robot commented Mar 26, 2024

k8s-ci-robot commented Mar 26, 2024

victor-sudakov commented Mar 27, 2024

k8s-ci-robot commented Mar 27, 2024

k8s-triage-robot commented Apr 26, 2024

k8s-ci-robot commented Apr 26, 2024

victor-sudakov commented Apr 26, 2024

k8s-ci-robot commented Apr 26, 2024

k8s-triage-robot commented May 26, 2024

k8s-ci-robot commented May 26, 2024

victor-sudakov commented May 27, 2024

k8s-ci-robot commented May 27, 2024

k8s-triage-robot commented Jun 26, 2024

k8s-ci-robot commented Jun 26, 2024

stl-victor-sudakov commented Jun 26, 2024

k8s-ci-robot commented Jun 26, 2024

victor-sudakov commented Jun 26, 2024

k8s-ci-robot commented Jun 26, 2024

k8s-triage-robot commented Jul 26, 2024

k8s-ci-robot commented Jul 26, 2024

hakman commented Jul 8, 2023 •

edited

Loading

victor-sudakov commented Jul 18, 2023 •

edited

Loading

victor-sudakov commented Jul 18, 2023 •

edited

Loading