Karpenter should show `Disrupting` or `Terminating` through `kubectl get nodes` when it has tainted Nodes #1152

jonathan-innis · 2024-04-02T16:43:00Z

Description

What problem are you trying to solve?

Currently, Kubernetes uses the node.kubernetes.io/unschedulable taint and the spec.unschedulable field on the node to mark that a node is cordoned and may be about to be drained for maintenance or removal. This is visible through the printer columns that you get when you call kubectl get nodes like the following

NAME                                                    STATUS                     ROLES    AGE    VERSION
ip-192-168-10-60.us-west-2.compute.internal             Ready                      <none>   2d4h   v1.28.5-eks-5e0fdde
ip-192-168-125-1.us-west-2.compute.internal             Ready                      <none>   2d4h   v1.28.5-eks-5e0fdde
ip-192-168-9-118.us-west-2.compute.internal             Ready,SchedulingDisabled   <none>   2d4h   v1.28.5-eks-5e0fdde

The code for this handling can be seen in the printer columns logic for kubectl here.

This is nice visibility for users when Kubernetes is using this specific field; however, nothing is surfaced when Karpenter adds its taint and is actively draining the node since Karpenter doesn't update the spec.unschedulable field that the printer relies on to add the SchedulingDisabled section to the node.

It would be a really nice UX if we could add something similar to SchedulingDisabled (perhaps something like Disrupting or Terminating) to the node so that users get visibility through the printer that Karpenter is acting on the node.

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

jonathan-innis · 2024-04-02T16:44:27Z

Also, as part of the alignment effort with Cluster Autoscaler, I imagine whatever change that we suggested should be made in upstream would also apply to Cluster Autoscaler. Perhaps aligning on the taint that we both want to use and proposing a way that that taint could have special logic built around it in the node printer columns is a change that we could try to get into upstream?

cc: @MaciekPytel @towca

jonathan-innis · 2024-04-02T16:44:51Z

Also also, there was a discussion in the K8s Slack over whether Karpenter should be using the unschedulable field on the node to piggy-back on this logic or not. Practically, we have chosen to separate ourselves from the basic "cordon" handling for the following reasons:

We want to know that Karpenter was the one that initiated the tainting so that we can recover when we taint a node and the process gets killed (either due to crashing or upgrading)
We want to make sure that we can drain DaemonSets while draining the node if the user wants it (right now Kubernetes will add a tolerations to DaemonSets for the node.kubernetes.io/unschedulable taint by default)

k8s-triage-robot · 2024-07-01T16:53:33Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

simon-wessel · 2024-07-29T15:36:14Z

I would like to note that other controllers are relying on the SchedulingDisabled-Taint to see if a node is shutting down or not.

The CloudNativePG operator for example has a PDB that disallows deleting the pod. If a node with a CNPG pod receives the SchedulingDisabled-Taint, the operator will start to migrate that pod itself. Since Karpenter does not use the taint, the node is stuck.

simon-wessel · 2024-07-29T15:38:05Z

/remove-lifecycle stale

k8s-triage-robot · 2024-10-27T16:26:50Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

simon-wessel · 2024-10-27T21:57:06Z

/remove-lifecycle stale

mark-webster-catalyst · 2024-12-10T10:34:41Z

I understand and support the reasons for using the Karpenter specific taint, but it does break other things that look for the standard unschedulable taint. Would it be feasible to add both? So Karpenter would know it initiated the cordon, but also other operators would be aware that the node has been cordoned and they should do what they need to do (failover, in the case of cloudnative-pg).

jonathan-innis · 2024-12-10T17:55:29Z

the operator will start to migrate that pod itself. Since Karpenter does not use the taint, the node is stuck

I think what we have seen in general is that these other operators support watching on the taint that the autoscaler supports. EBS CSI driver had done something similar since it was also hooking into knowledge that Karpenter was deleting the NodeClaim.

In general, I'm a little weary of things hooking into taints that can be added with a kubectl cordon from a user anyways since this might be added as a manual action and then you are going to get automation that's going to flywheel off of this

jonathan-innis added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 2, 2024

jonathan-innis removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 2, 2024

jonathan-innis changed the title ~~Karpenter Should Show through kubectl get nodes when it has tainted Nodes~~ Karpenter should show Disrupting or Terminating through kubectl get nodes when it has tainted Nodes Apr 2, 2024

njtran mentioned this issue Apr 8, 2024

Provide CR status aws/karpenter-provider-aws#5988

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 1, 2024

njtran mentioned this issue Jul 25, 2024

Enhance v1 NodeClaim print output to show drifted state #1464

Open

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 29, 2024

quercus-carsten mentioned this issue Aug 15, 2024

[Feature]: Check for nodes cordoned by karpenter cloudnative-pg/cloudnative-pg#5299

Open

2 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 27, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 27, 2024

danielloader mentioned this issue Dec 4, 2024

Endless nodes are created after expireAfter elapse on a node in some scenarios #1842

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karpenter should show `Disrupting` or `Terminating` through `kubectl get nodes` when it has tainted Nodes #1152

Karpenter should show `Disrupting` or `Terminating` through `kubectl get nodes` when it has tainted Nodes #1152

jonathan-innis commented Apr 2, 2024 •

edited

Loading

jonathan-innis commented Apr 2, 2024 •

edited

Loading

jonathan-innis commented Apr 2, 2024

k8s-triage-robot commented Jul 1, 2024

simon-wessel commented Jul 29, 2024

simon-wessel commented Jul 29, 2024

k8s-triage-robot commented Oct 27, 2024

simon-wessel commented Oct 27, 2024

mark-webster-catalyst commented Dec 10, 2024 •

edited

Loading

jonathan-innis commented Dec 10, 2024

Karpenter should show Disrupting or Terminating through kubectl get nodes when it has tainted Nodes #1152

Karpenter should show Disrupting or Terminating through kubectl get nodes when it has tainted Nodes #1152

Comments

jonathan-innis commented Apr 2, 2024 • edited Loading

Description

jonathan-innis commented Apr 2, 2024 • edited Loading

jonathan-innis commented Apr 2, 2024

k8s-triage-robot commented Jul 1, 2024

simon-wessel commented Jul 29, 2024

simon-wessel commented Jul 29, 2024

k8s-triage-robot commented Oct 27, 2024

simon-wessel commented Oct 27, 2024

mark-webster-catalyst commented Dec 10, 2024 • edited Loading

jonathan-innis commented Dec 10, 2024

Karpenter should show `Disrupting` or `Terminating` through `kubectl get nodes` when it has tainted Nodes #1152

Karpenter should show `Disrupting` or `Terminating` through `kubectl get nodes` when it has tainted Nodes #1152

jonathan-innis commented Apr 2, 2024 •

edited

Loading

jonathan-innis commented Apr 2, 2024 •

edited

Loading

mark-webster-catalyst commented Dec 10, 2024 •

edited

Loading