Consolidate nodes that are tainted due to certain node conditions #2544

andrewhibbert · 2022-09-23T15:43:04Z

Tell us about your request

For example if you are running node problem detector with TaintNodesByCondition switched on (as it is in EKS) you would get taints when there are problems with nodes. No further pods would schedule until the condition clears. Perhaps the consolidation feature could remove nodes after a period of time to ensure pods are running on healthy nodes.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

For example given a taint of unreachable:

Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule

You could ensure that the condition has been occuring for a certain amount of time:

Conditions:
  Type                 Status    LastHeartbeatTime                 LastTransitionTime                Reason                    Message
  ----                 ------    -----------------                 ------------------                ------                    -------
  KernelDeadlock       False     Fri, 23 Sep 2022 16:21:53 +0100   Fri, 23 Sep 2022 15:36:22 +0100   KernelHasNoDeadlock       kernel has no deadlock
  ReadonlyFilesystem   False     Fri, 23 Sep 2022 16:21:53 +0100   Fri, 23 Sep 2022 15:36:22 +0100   FilesystemIsNotReadOnly   Filesystem is not read-only
  Ready                Unknown   Fri, 23 Sep 2022 16:15:40 +0100   Fri, 23 Sep 2022 16:16:20 +0100   NodeStatusUnknown         Kubelet stopped posting node status.
  MemoryPressure       Unknown   Fri, 23 Sep 2022 16:15:40 +0100   Fri, 23 Sep 2022 16:16:20 +0100   NodeStatusUnknown         Kubelet stopped posting node status.
  DiskPressure         Unknown   Fri, 23 Sep 2022 16:15:40 +0100   Fri, 23 Sep 2022 16:16:20 +0100   NodeStatusUnknown         Kubelet stopped posting node status.
  PIDPressure          Unknown   Fri, 23 Sep 2022 16:15:40 +0100   Fri, 23 Sep 2022 16:16:20 +0100   NodeStatusUnknown         Kubelet stopped posting node status.

Then if it is longer than that time, consolidate the node

Are you currently working around this issue?

N/A

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

dewjam · 2022-09-27T17:51:41Z

This is an interesting request. This sort of correlates with #2235 also I think.

Are you thinking the taints would be configurable as well @andrewhibbert ?

yevhen-harmonizehr · 2022-10-07T10:36:56Z

+1 for feature. I can see two use-cases for it:

Something changed in launch template (e.g i've created new iam role and assigned to aws_iam_instance_profile in use by karpenter) and all live nodes went "NotReady". In this case i want karpenter to rotate all "NotReady" nodes and auto-heal itself.
Disaster recovery: lets assume we have 3-zone cluster and in some time one zone went down (e.g. no network connection); in this scenario would be great if karpenter will move failed-zone instances to available zone. I know this one actually looks harder, because karpenter would need to track zone status somehow. But even such a simple flag as "TTLSecondsAfterNotReady" will be good to start.

ellistarn · 2023-01-29T16:25:07Z

Closing as duplicate of kubernetes-sigs/karpenter#750

andrewhibbert added the feature New feature or request label Sep 23, 2022

dewjam mentioned this issue Sep 27, 2022

feat: Add a node liveness TTL that terminates NotReady nodes if optional field is set on provisioner #2235

Closed

3 tasks

andrewhibbert mentioned this issue Nov 9, 2022

Support for node problem detector kubernetes-sigs/karpenter#738

Closed

jonathan-innis mentioned this issue Nov 9, 2022

Allow ephemeral-storage capacity overrides for instance types (per node template or provisioner) #2723

Closed

ellistarn mentioned this issue Dec 19, 2022

Node Repair kubernetes-sigs/karpenter#750

Open

andrewhibbert mentioned this issue Jan 5, 2023

Bug: Inflight check failed for node, Instance Type "" not found #3156

Closed

ellistarn closed this as completed Jan 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate nodes that are tainted due to certain node conditions #2544

Consolidate nodes that are tainted due to certain node conditions #2544

andrewhibbert commented Sep 23, 2022

dewjam commented Sep 27, 2022

yevhen-harmonizehr commented Oct 7, 2022

ellistarn commented Jan 29, 2023

Consolidate nodes that are tainted due to certain node conditions #2544

Consolidate nodes that are tainted due to certain node conditions #2544

Comments

andrewhibbert commented Sep 23, 2022

Tell us about your request

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Are you currently working around this issue?

Additional Context

Attachments

Community Note

dewjam commented Sep 27, 2022

yevhen-harmonizehr commented Oct 7, 2022

ellistarn commented Jan 29, 2023