Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate nodes that are tainted due to certain node conditions #2544

Closed
andrewhibbert opened this issue Sep 23, 2022 · 3 comments
Closed
Labels
feature New feature or request

Comments

@andrewhibbert
Copy link
Contributor

Tell us about your request

For example if you are running node problem detector with TaintNodesByCondition switched on (as it is in EKS) you would get taints when there are problems with nodes. No further pods would schedule until the condition clears. Perhaps the consolidation feature could remove nodes after a period of time to ensure pods are running on healthy nodes.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

For example given a taint of unreachable:

Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule

You could ensure that the condition has been occuring for a certain amount of time:

Conditions:
  Type                 Status    LastHeartbeatTime                 LastTransitionTime                Reason                    Message
  ----                 ------    -----------------                 ------------------                ------                    -------
  KernelDeadlock       False     Fri, 23 Sep 2022 16:21:53 +0100   Fri, 23 Sep 2022 15:36:22 +0100   KernelHasNoDeadlock       kernel has no deadlock
  ReadonlyFilesystem   False     Fri, 23 Sep 2022 16:21:53 +0100   Fri, 23 Sep 2022 15:36:22 +0100   FilesystemIsNotReadOnly   Filesystem is not read-only
  Ready                Unknown   Fri, 23 Sep 2022 16:15:40 +0100   Fri, 23 Sep 2022 16:16:20 +0100   NodeStatusUnknown         Kubelet stopped posting node status.
  MemoryPressure       Unknown   Fri, 23 Sep 2022 16:15:40 +0100   Fri, 23 Sep 2022 16:16:20 +0100   NodeStatusUnknown         Kubelet stopped posting node status.
  DiskPressure         Unknown   Fri, 23 Sep 2022 16:15:40 +0100   Fri, 23 Sep 2022 16:16:20 +0100   NodeStatusUnknown         Kubelet stopped posting node status.
  PIDPressure          Unknown   Fri, 23 Sep 2022 16:15:40 +0100   Fri, 23 Sep 2022 16:16:20 +0100   NodeStatusUnknown         Kubelet stopped posting node status.

Then if it is longer than that time, consolidate the node

Are you currently working around this issue?

N/A

Additional Context

No response

Attachments

No response

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@andrewhibbert andrewhibbert added the feature New feature or request label Sep 23, 2022
@dewjam
Copy link
Contributor

dewjam commented Sep 27, 2022

This is an interesting request. This sort of correlates with #2235 also I think.

Are you thinking the taints would be configurable as well @andrewhibbert ?

@yevhen-harmonizehr
Copy link

+1 for feature. I can see two use-cases for it:

  • Something changed in launch template (e.g i've created new iam role and assigned to aws_iam_instance_profile in use by karpenter) and all live nodes went "NotReady". In this case i want karpenter to rotate all "NotReady" nodes and auto-heal itself.
  • Disaster recovery: lets assume we have 3-zone cluster and in some time one zone went down (e.g. no network connection); in this scenario would be great if karpenter will move failed-zone instances to available zone. I know this one actually looks harder, because karpenter would need to track zone status somehow. But even such a simple flag as "TTLSecondsAfterNotReady" will be good to start.

@ellistarn
Copy link
Contributor

Closing as duplicate of kubernetes-sigs/karpenter#750

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants