Node Repair #750

jbouricius · 2022-06-29T20:16:00Z

Tell us about your request
Allow a configurable expiration of NotReady nodes.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
I am observing some behavior in my cluster where occasionally nodes fail to join the cluster, due to some transient error in the kubelet bootstrapping process. These nodes stay in NotReady status. Karpenter continues to assign pods to these nodes, but the k8s scheduler won't schedule to them, leaving pods in limbo for extended periods of time. I would like to be able to configure Karpenter with a TTL for nodes that failed to become Ready. The existing configuration spec.provider.ttlSecondsUntilExpiration doesn't really work for my use case because it will terminate healthy nodes.

Are you currently working around this issue?
Manually deleting stuck nodes.

Additional context
Not sure if this is useful context, but I observed this error on one such stuck node. From /var/log/userdata.log:

Job for sandbox-image.service failed because the control process exited with error code. See "systemctl status sandbox-image.service" and "journalctl -xe" for details.

and then systemctl status sandbox-image.service:

  sandbox-image.service - pull sandbox image defined in containerd config.toml
   Loaded: loaded (/etc/systemd/system/sandbox-image.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2022-06-28 18:47:42 UTC; 2h 9min ago
  Process: 4091 ExecStart=/etc/eks/containerd/pull-sandbox-image.sh (code=exited, status=2)
 Main PID: 4091 (code=exited, status=2)

From reading others issues it looks like this AMI script failed, possibly in the call to ECR: https://github.com/awslabs/amazon-eks-ami/blob/master/files/pull-sandbox-image.sh

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

ellistarn · 2022-06-30T03:33:50Z

related: https://kubernetes.slack.com/archives/C02SFFZSA2K/p1656533825820419?thread_ts=1656532759.059699&cid=C02SFFZSA2K

htoo97 · 2022-07-29T02:08:27Z

We recently started using Karpenter for some batch jobs and are running into this as well, where nodes that get stuck on NotReady cause the pod to never get scheduled. The underlying node's reason turned out to be the subnets being full and the CNI pods never coming up as a result, but regardless of the cause, big +1 to having a configurable way in Karpenter to ensure bad nodes can get terminated automatically if it never comes up within some TTL.

ellistarn · 2022-07-29T16:34:50Z

I'd love to get this prioritized. It should be straightforward to implement in the node controller.

htoo97 · 2022-07-29T17:36:59Z

If it's not being worked on internally yet, I can take a stab at this!

ellistarn · 2022-12-19T16:59:07Z

Karpenter not provisioning new nodes after instances become unreachable/NotReady aws/karpenter-provider-aws#2570
Consolidate nodes that are tainted due to certain node conditions aws/karpenter-provider-aws#2544
Karpenter doesn't delete an empty NotReady node aws/karpenter-provider-aws#2489
Karpenter nodes stuck in "NotReady" status and fail to join EKS cluster aws/karpenter-provider-aws#2439
Karpenter nodes get stuck on "NotReady" state aws/karpenter-provider-aws#1415

korenyoni · 2023-01-05T13:49:07Z

@tzneal makes a good point here that this auto-repair feature can potentially get out of hand if each node provisioned becomes NotReady, in case of something like bad userdata configured at the Provisioner / NodeTemplate level.

This could possibly also be an issue with EC2 service outages.

Maybe you would have to implement some sort of exponential backoff per Provisioner to prevent this endless cycle of provisioning nodes that will always come up as NotReady.

wkaczynski · 2023-01-16T16:39:49Z

👍
We're occasionally seeing cases when a node has been launched but never properly initialized (failing to get karpenter.sh/initialized=true) - as these nodes are treated as a capacity already arranged to be available in future, they can prevent cluster expansion and cause pods to be permanently stuck (constantly nominated by karpenter to run on a node that will never complete the initialization)

ellistarn · 2023-01-17T19:11:31Z

@wkaczynski, it's a bit tricky.

If we delete nodes that fail to initialize, users will have a hard time debugging
If we ignore nodes the fail to initialize, you can get runaway scaling.

dschaaff · 2023-01-18T00:30:35Z

We currently have a support ticket open with AWS for occasional bottlerocket boot failures on our Kubernetes nodes. The failure rate is very low and it's important that we are able to get logs off a node and potentially take a snapshot of the volumes. In this scenario it's vital that we can opt out of Karpenter auto removing the node. I'd be in favor of this at least being configurable so users can decide.

ellistarn · 2023-01-18T00:34:55Z

@njtran re: the behaviors API.

wkaczynski · 2023-01-24T13:06:24Z

it's vital that we can opt out of Karpenter auto removing the node

If we delete nodes that fail to initialize, users will have a hard time debugging

I also think that if we do decide to delete nodes that failed to initialize, there should be an option to opt-out to be able to debug (or if we don't delete by default - an opt-in option to enable the cleanup).

The cleanup does not even need to be a provisioner config - initially, until there it a better way to address this issue, it could be enabled via a helm chart value and exposed as either a cm setting, command line option, or an env var.

Another thing is - are these nodes considered as in-flight indefinitely ? If yes, is there currently an option for at least the in-flight status timeout ? If there isn't an option for these nodes to no longer be considered in-flight, do I understand correctly that this can effectively block the cluster expansion even if there is a one-off node initialization failure (that we're sometimes experiencing with aws).

If we ignore nodes the fail to initialize, you can get runaway scaling.

there are cases in which runaway scaling is preferred over a service interruption, would be good to have an opt-in cleanup option

ellistarn · 2023-01-24T18:21:41Z

Another thing is - are these nodes considered as in-flight indefinitely ?

I like to think about this as ttlAfterNotReady. @wkaczynski, do you think this is reasonable? You could repurpose the same mechanism to cover cases where nodes fail to connect, or eventually disconnect. We'd need to be careful to not kill nodes that have any pods on them (unless they're stuck deleting), since we don't want to burn down the fleet during an AZ outage. I'd also like this to fall under our maintenance windows controls as discussed in aws/karpenter-provider-aws#1738.

hitsub2 · 2023-11-11T02:37:19Z

In my case, When provided the kubelet args(currently not supported by Karpenter), some nodes(2 out of 400) are not ready and karpenter can not disrupt them, leaving them forever. After changing AMIFamily to Custom, this issue does not happend again.

tip-dteller · 2023-11-12T09:37:34Z

Hi, ive been referred to this ticket and adding the case we've hit.

Description

Observed Behavior:
Background:
Windows application deployed on Karpenter Windows Nodes - Ami family 2019, OnDemand nodes.
When the application becomes unresponsive and enters CrashLoopBackOff, its breaks containerd 1.6.18.
The given error is:

rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing open //./pipe/containerd-containerd: The system cannot find the file specified."

Side note: AWS recently updated Windows containerd from 1.6.8 to 1.6.18.

Why does it matter for this case and how its relevant to Karpenter?

The node was created and was functional, it was in "Ready" state as noted by Karpenter and the Windows pods were successfully scheduled onto it. So far so good.
When the application unexpectedly broke containerd and subsequently kubelet, the node enters "NotReady" state.
During this cycle,

At this point, the node is not deprovisioned and prompts this in the events:

karpenter  Cannot deprovision Node: Nominated for a pending pod

Summary of events:

Deploy Windows application
Application is in Pending mode.
Karpenter provisions a functional windows node.
Application loads and enters Running state (its functional).
Application breaks containerd after N time.
Node is unresponsive.
Karpenter cannot deprovision the node because Old pods terminated and New pods are Pending.

Expected Behavior:

Detect that the node is unresponsive and roll it (i.e create a replacement node).

Reproduction Steps (Please include YAML):
Provisioner:

---
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: windows-provisioner
spec:
  consolidation:
    enabled: false
  limits:
    resources:
      cpu: 200
  labels:
    app: workflow
  ttlSecondsAfterEmpty: 300
  taints:
    - key: "company.io/workflow"
      value: "true"
      effect: "NoSchedule"
  requirements:
    - key: "karpenter.k8s.aws/instance-category"
      operator: In
      values: ["c", "m", "r"]
    - key: "karpenter.k8s.aws/instance-cpu"
      operator: In
      values: ["4", "8", "16", "32", "48", "64"]
    - key: "karpenter.k8s.aws/instance-generation"
      operator: Gt
      values: ["4"]
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["us-east-1a", "us-east-1b"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["amd64"]
    - key: "karpenter.sh/capacity-type" 
      operator: In
      values: ["on-demand"] 
    - key: kubernetes.io/os
      operator: In
      values: ["windows"]

  providerRef:
    name: windows

I cannot provide the windows application as it entails business logic.

-- I Couldnt anything in Karpenter documentation that states this is normal behavior and I hope for some clarity here.

Versions:

Chart Version: v0.31.0
Kubernetes Version (kubectl version): Server Version: v1.24.17-eks-f8587cb, Client Version: v1.28.3

k8s-triage-robot · 2024-02-10T09:40:39Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sylr · 2024-02-10T18:47:04Z

/remove-lifecycle stale

Bryce-Soghigian · 2024-02-11T21:51:53Z

Maybe you would have to implement some sort of exponential backoff per Provisioner to prevent this endless cycle of provisioning nodes that will always come up as NotReady.

Cluster Autoscaler has the concept of IsClusterHealthy, and IsNodegroupHealthy, alongside the ok-total-unready-count flags and the max-unready percentage flags to control at what threshold when IsClusterHealthy should be triggered.

IsClusterHealthy blocks autoscaling until the health resolves. Im not convinced that karpenter core is the right place to solve provisioning failures with this type of IsClusterHealthy concept, but worth mentioning. CAS has historically dealt with a lot of bug reports for this very blocking behavior, and karpenter nodepools are not of a single type. So GPU Provisioning being broken shouldn't block all other instance types.

Instead it might make sense for Provisioning Backoff to live inside the cloud provider and leverage the unavailable offerings cache inside of karpenter. If a given sku and node image has failed x times for this nodeclaim, we add it to the unavailable offerings cache? Then let it expire and we retry that permutation later.(this pattern would work with azure, have to read through AWS Pattern on this)

It would be much better to not block all of the provisioning for a given nodepool, and instead do it per instance type

Node Auto Repair General Notes From my AKS Experience

I was the engineer that built the AKS Node Auto repair framework. Some notes based on that experience

The expectation generally is that cluster autoscaler garbage collects the unready nodes after 20 minutes(max-total-unready-time flag)

Separately from CAS Lifecycle, AKS will attempt 3 autohealing actions on the node that is not ready.

Restart the vm: Useful for rebooting the kubelet etc
Reimage the vm: Solves for Corrupted states etc
Redeploy the vm: this will solve any problem due to some host level error.

These actions fix many customer nodes each day, but would be good to unify the autoscalers repair attempts alongside the remediator.

While I am all for moving node lifecycle actions from other places into karpenter, It would have to be solved by cloudprovider APIS. The remediation actions defined one cloud provider may not have an equivalent action inside of another cloud provider. We would have to design a symbiotic relationship carefully.

1ms-ms · 2024-04-16T11:14:53Z

@Bryce-Soghigian Im not an expert, but these 3 actions you mentioned

Restart the vm: Useful for rebooting the kubelet etc

Reimage the vm: Solves for Corrupted states etc

Redeploy the vm: this will solve any problem due to some host level error.

seem to be possible to implement via SDK for all major cloud providers.
From the thread I can't get what's the obstacle right now, especially since rebooting/terminating will solve most of the problems with kubelet being not responsive.

k8s-triage-robot · 2024-07-15T11:15:32Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jessebye · 2024-07-15T17:37:44Z

/remove-lifecycle stale

ibalat · 2024-08-14T18:41:17Z

hi, I guess this issue related with this topic. I can give any log for debugging: #1573

tculp · 2024-08-15T17:10:26Z

Another use case is to recover when a node runs out of memory and goes down, never to come up again without manual intervention.

JacobHenner · 2024-08-29T23:20:24Z

We periodically encounter nodes that get stuck NotReady due to hung kernel tasks related to elevated iowait. If Karpenter was able to terminate these unhealthy nodes after some brief period of time, it'd be quite helpful for recovering from this situation.

leoxlin · 2024-09-06T16:24:18Z

We've been seeing some transient NotReady node states related to aws/karpenter-provider-aws#5043 as well. We will end up having to catch this via our pager and manually deleting these nodes so an automated way to repair will be much appreciated!

diranged · 2024-09-10T23:20:14Z

Just chiming in here... it really feels like adding another disruption type of NodeNotReady and being able to tell Karpenter that we want to terminate nodes that get into this state is pretty reasonable. We are struggling right now trying to move to Karpenter because we operate a very large scale environment and nodes become unready periodically ... and we do not want humans to be involved in fixing them.

JacobHenner · 2024-10-21T22:37:58Z

@mariuskimmina has opened a relevant pull request here

awoimbee · 2024-10-22T13:44:04Z

Relates to aws/karpenter-provider-aws#6803

jbouricius added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 29, 2022

ellistarn mentioned this issue Jul 6, 2022

A node lives forever if it failed to join the cluster aws/karpenter-provider-aws#1014

Closed

tzneal mentioned this issue Jul 8, 2022

emptiness TTL is not applied to instance type g4dn aws/karpenter-provider-aws#2085

Closed

ellistarn changed the title ~~Configurable expiration of NotReady nodes~~ Support a Liveness TTL that terminates NotReady nodes after a certain period Jul 29, 2022

ellistarn added help-wanted good-first-issue Good for newcomers labels Jul 29, 2022

ellistarn assigned htoo97 Jul 29, 2022

htoo97 mentioned this issue Aug 1, 2022

feat: Add a node liveness TTL that terminates NotReady nodes if optional field is set on provisioner aws/karpenter-provider-aws#2235

Closed

3 tasks

ellistarn mentioned this issue Aug 1, 2022

fix: Consider unready nodes as in flight aws/karpenter-provider-aws#2224

Merged

3 tasks

jonathan-innis mentioned this issue Sep 12, 2022

Karpenter doesn't delete an empty NotReady node aws/karpenter-provider-aws#2489

Closed

maximethebault mentioned this issue Sep 26, 2022

sandbox-image service fault-tolerance & reliability awslabs/amazon-eks-ami#1034

Closed

ellistarn changed the title ~~Support a Liveness TTL that terminates NotReady nodes after a certain period~~ Node Auto Repair Nov 21, 2022

tzneal mentioned this issue Jan 5, 2023

Bug: Inflight check failed for node, Instance Type "" not found aws/karpenter-provider-aws#3156

Closed

njtran mentioned this issue Jan 13, 2023

Karpenter does not remove terminated nodes aws/karpenter-provider-aws#3214

Closed

tzneal removed help-wanted good-first-issue Good for newcomers labels Jan 26, 2023

spring1843 mentioned this issue Jan 29, 2023

Node in notready status for a while and reporting Inflight check failed for node, Instance Type "" not found aws/karpenter-provider-aws#3311

Closed

ellistarn mentioned this issue Jan 29, 2023

Karpenter not provisioning new nodes after instances become unreachable/NotReady aws/karpenter-provider-aws#2570

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 10, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 10, 2024

Bryce-Soghigian mentioned this issue Feb 12, 2024

Support for node problem detector #738

Closed

jonathan-innis mentioned this issue Feb 27, 2024

Karpenter does not terminate instances in Pending state aws/karpenter-provider-aws#5706

Closed

jzhn mentioned this issue Feb 29, 2024

Emptiness disruption seems to be blocked by bad nodes aws/karpenter-provider-aws#5750

Closed

tony-engineering mentioned this issue Apr 17, 2024

Prisma crashing my Kubernetes nodes prisma/prisma#23881

Open

tzneal mentioned this issue Apr 24, 2024

Karpenter can't delete node if Kubelet crashes aws/karpenter-provider-aws#6090

Closed

engedaam mentioned this issue May 1, 2024

[Feature Request] Limit the time Karpenter keeps the nodes that aren't initialized aws/karpenter-provider-aws#6124

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 15, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 15, 2024

JacobAmar mentioned this issue Aug 15, 2024

Karpenter Node Claims Stalled by EC2 Launch Failures aws/karpenter-provider-aws#6723

Open

jigisha620 mentioned this issue Aug 15, 2024

Node is "NotReady" and waiting at "Terminating" for hours #1573

Open

jmdeal mentioned this issue Aug 20, 2024

Nodes in Unknown State and Clarification on High CPU Usage and CPU Limit Settings #1585

Open

njtran assigned engedaam and unassigned htoo97 Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node Repair #750

Node Repair #750

jbouricius commented Jun 29, 2022

ellistarn commented Jun 30, 2022

htoo97 commented Jul 29, 2022

ellistarn commented Jul 29, 2022

htoo97 commented Jul 29, 2022

ellistarn commented Dec 19, 2022

korenyoni commented Jan 5, 2023 •

edited

Loading

wkaczynski commented Jan 16, 2023

ellistarn commented Jan 17, 2023

dschaaff commented Jan 18, 2023

ellistarn commented Jan 18, 2023

wkaczynski commented Jan 24, 2023 •

edited

Loading

ellistarn commented Jan 24, 2023 •

edited

Loading

hitsub2 commented Nov 11, 2023

tip-dteller commented Nov 12, 2023

k8s-triage-robot commented Feb 10, 2024

sylr commented Feb 10, 2024

Bryce-Soghigian commented Feb 11, 2024 •

edited

Loading

1ms-ms commented Apr 16, 2024

k8s-triage-robot commented Jul 15, 2024

jessebye commented Jul 15, 2024

ibalat commented Aug 14, 2024

tculp commented Aug 15, 2024

JacobHenner commented Aug 29, 2024

leoxlin commented Sep 6, 2024

diranged commented Sep 10, 2024

JacobHenner commented Oct 21, 2024

awoimbee commented Oct 22, 2024

Node Repair #750

Node Repair #750

Comments

jbouricius commented Jun 29, 2022

Community Note

ellistarn commented Jun 30, 2022

htoo97 commented Jul 29, 2022

ellistarn commented Jul 29, 2022

htoo97 commented Jul 29, 2022

ellistarn commented Dec 19, 2022

korenyoni commented Jan 5, 2023 • edited Loading

wkaczynski commented Jan 16, 2023

ellistarn commented Jan 17, 2023

dschaaff commented Jan 18, 2023

ellistarn commented Jan 18, 2023

wkaczynski commented Jan 24, 2023 • edited Loading

ellistarn commented Jan 24, 2023 • edited Loading

hitsub2 commented Nov 11, 2023

tip-dteller commented Nov 12, 2023

Description

k8s-triage-robot commented Feb 10, 2024

sylr commented Feb 10, 2024

Bryce-Soghigian commented Feb 11, 2024 • edited Loading

1ms-ms commented Apr 16, 2024

k8s-triage-robot commented Jul 15, 2024

jessebye commented Jul 15, 2024

ibalat commented Aug 14, 2024

tculp commented Aug 15, 2024

JacobHenner commented Aug 29, 2024

leoxlin commented Sep 6, 2024

diranged commented Sep 10, 2024

JacobHenner commented Oct 21, 2024

awoimbee commented Oct 22, 2024

korenyoni commented Jan 5, 2023 •

edited

Loading

wkaczynski commented Jan 24, 2023 •

edited

Loading

ellistarn commented Jan 24, 2023 •

edited

Loading

Bryce-Soghigian commented Feb 11, 2024 •

edited

Loading