A node lives forever if it failed to join the cluster #1014

Noksa · 2021-12-16T20:24:43Z

Version

Karpenter: v0.5.2

Kubernetes: v1.21.1

Expected Behavior

A node that was stuck in NotReady state but not because of NodeStatusNeverUpdated will go away after LivenessTimeout timeout.

Actual Behavior

A node that was stuck in NotReady state but not because of NodeStatusNeverUpdated will live forever despite the fact that it is empty

Steps to Reproduce the Problem

Just use 'bad' securityGroupSelector for a node.
For example SG without any inbound rules.
In my case if I don't specify securityGroupSelector Karpenter chooses 'bad' SG from some AWS ELB that was created by Istio.
This SG has only the following inbound rules:

So because of 'bad' SG the node has the following condition:

  - lastHeartbeatTime: "2021-12-16T20:24:16Z"
    lastTransitionTime: "2021-12-16T19:53:42Z"
    message: 'container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady
      message:Network plugin returns error: cni plugin not initialized'
    reason: KubeletNotReady
    status: "False"
    type: Ready

So aws-node daemonset is not ready and Karpenter ignores this node after creation because it checks only this condition Reason:

	if condition.Reason != "" && condition.Reason != "NodeStatusNeverUpdated" {
		return reconcile.Result{}, nil
	}

So can you consider checking KubeletNotReady or something like that in addition to NodeStatusNeverUpdated?

Resource Specs and Logs

The text was updated successfully, but these errors were encountered:

ellistarn · 2021-12-16T20:27:48Z

This is a great suggestion and something I'd like to move forward with in our liveness controller. I think we should terminate after a specified period, regardless of status conditions. Further, I think we should allow the user control over how long to wait, or whether or not to terminate.

spec.ttlSecondsAfterNotReady: nil // Never terminate
spec.ttlSecondsAfterNotReady: 300 // wait 5min

olemarkus · 2021-12-16T20:35:52Z

If a kubelet cannot properly connect to the cluster, why would it be able to do so the second time?
I would definitely want the instance around for investigation.
Worst case one would end up in a provisioning loop that does nothing more than add load to the cluster.

ellistarn · 2021-12-16T21:08:09Z

Another thing we can do is look at the tolerations of pods on the node that is not ready. If pods tolerate NotReady for a long time, we shouldn't terminate the node. If pods don't tolerate NotReady, they will be evicted by the pod lifecycle controller, so terminating the node won't harm anything. As @olemarkus mentions, it should be possible to disable this feature.

Noksa · 2021-12-17T07:11:40Z

Also if I delete a pending pod that was bound to a new node that was stuck - a node will live forever in NotReady state even if there are no pods on it except daemonsets

ellistarn · 2022-07-06T17:27:52Z

Closing in favor of kubernetes-sigs/karpenter#750

Noksa added the bug Something isn't working label Dec 16, 2021

ellistarn added the termination Issues related to node termination label Dec 16, 2021

ellistarn added the api Issues that require API changes label Dec 16, 2021

ellistarn added feature New feature or request and removed bug Something isn't working labels Dec 23, 2021

ellistarn closed this as completed Jul 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A node lives forever if it failed to join the cluster #1014

A node lives forever if it failed to join the cluster #1014

Noksa commented Dec 16, 2021 •

edited

Loading

ellistarn commented Dec 16, 2021

olemarkus commented Dec 16, 2021

ellistarn commented Dec 16, 2021

Noksa commented Dec 17, 2021

ellistarn commented Jul 6, 2022

A node lives forever if it failed to join the cluster #1014

A node lives forever if it failed to join the cluster #1014

Comments

Noksa commented Dec 16, 2021 • edited Loading

Version

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Resource Specs and Logs

ellistarn commented Dec 16, 2021

olemarkus commented Dec 16, 2021

ellistarn commented Dec 16, 2021

Noksa commented Dec 17, 2021

ellistarn commented Jul 6, 2022

Noksa commented Dec 16, 2021 •

edited

Loading