Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node Repair #750

Open
jbouricius opened this issue Jun 29, 2022 · 51 comments
Open

Node Repair #750

jbouricius opened this issue Jun 29, 2022 · 51 comments
Assignees
Labels
deprovisioning Issues related to node deprovisioning kind/feature Categorizes issue or PR as related to a new feature. v1.x Issues prioritized for post-1.0

Comments

@jbouricius
Copy link

Tell us about your request
Allow a configurable expiration of NotReady nodes.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
I am observing some behavior in my cluster where occasionally nodes fail to join the cluster, due to some transient error in the kubelet bootstrapping process. These nodes stay in NotReady status. Karpenter continues to assign pods to these nodes, but the k8s scheduler won't schedule to them, leaving pods in limbo for extended periods of time. I would like to be able to configure Karpenter with a TTL for nodes that failed to become Ready. The existing configuration spec.provider.ttlSecondsUntilExpiration doesn't really work for my use case because it will terminate healthy nodes.

Are you currently working around this issue?
Manually deleting stuck nodes.

Additional context
Not sure if this is useful context, but I observed this error on one such stuck node. From /var/log/userdata.log:

Job for sandbox-image.service failed because the control process exited with error code. See "systemctl status sandbox-image.service" and "journalctl -xe" for details.

and then systemctl status sandbox-image.service:

  sandbox-image.service - pull sandbox image defined in containerd config.toml
   Loaded: loaded (/etc/systemd/system/sandbox-image.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2022-06-28 18:47:42 UTC; 2h 9min ago
  Process: 4091 ExecStart=/etc/eks/containerd/pull-sandbox-image.sh (code=exited, status=2)
 Main PID: 4091 (code=exited, status=2)

From reading others issues it looks like this AMI script failed, possibly in the call to ECR: https://github.com/awslabs/amazon-eks-ami/blob/master/files/pull-sandbox-image.sh

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@jbouricius jbouricius added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 29, 2022
@ellistarn
Copy link
Contributor

@htoo97
Copy link

htoo97 commented Jul 29, 2022

We recently started using Karpenter for some batch jobs and are running into this as well, where nodes that get stuck on NotReady cause the pod to never get scheduled. The underlying node's reason turned out to be the subnets being full and the CNI pods never coming up as a result, but regardless of the cause, big +1 to having a configurable way in Karpenter to ensure bad nodes can get terminated automatically if it never comes up within some TTL.

@ellistarn ellistarn changed the title Configurable expiration of NotReady nodes Support a Liveness TTL that terminates NotReady nodes after a certain period Jul 29, 2022
@ellistarn
Copy link
Contributor

I'd love to get this prioritized. It should be straightforward to implement in the node controller.

@htoo97
Copy link

htoo97 commented Jul 29, 2022

If it's not being worked on internally yet, I can take a stab at this!

@korenyoni
Copy link

korenyoni commented Jan 5, 2023

@tzneal makes a good point here that this auto-repair feature can potentially get out of hand if each node provisioned becomes NotReady, in case of something like bad userdata configured at the Provisioner / NodeTemplate level.

This could possibly also be an issue with EC2 service outages.

Maybe you would have to implement some sort of exponential backoff per Provisioner to prevent this endless cycle of provisioning nodes that will always come up as NotReady.

@wkaczynski
Copy link
Contributor

👍
We're occasionally seeing cases when a node has been launched but never properly initialized (failing to get karpenter.sh/initialized=true) - as these nodes are treated as a capacity already arranged to be available in future, they can prevent cluster expansion and cause pods to be permanently stuck (constantly nominated by karpenter to run on a node that will never complete the initialization)

@ellistarn
Copy link
Contributor

@wkaczynski, it's a bit tricky.

  • If we delete nodes that fail to initialize, users will have a hard time debugging
  • If we ignore nodes the fail to initialize, you can get runaway scaling.

@dschaaff
Copy link

We currently have a support ticket open with AWS for occasional bottlerocket boot failures on our Kubernetes nodes. The failure rate is very low and it's important that we are able to get logs off a node and potentially take a snapshot of the volumes. In this scenario it's vital that we can opt out of Karpenter auto removing the node. I'd be in favor of this at least being configurable so users can decide.

@ellistarn
Copy link
Contributor

@njtran re: the behaviors API.

@wkaczynski
Copy link
Contributor

wkaczynski commented Jan 24, 2023

it's vital that we can opt out of Karpenter auto removing the node

If we delete nodes that fail to initialize, users will have a hard time debugging

I also think that if we do decide to delete nodes that failed to initialize, there should be an option to opt-out to be able to debug (or if we don't delete by default - an opt-in option to enable the cleanup).

The cleanup does not even need to be a provisioner config - initially, until there it a better way to address this issue, it could be enabled via a helm chart value and exposed as either a cm setting, command line option, or an env var.

Another thing is - are these nodes considered as in-flight indefinitely ? If yes, is there currently an option for at least the in-flight status timeout ? If there isn't an option for these nodes to no longer be considered in-flight, do I understand correctly that this can effectively block the cluster expansion even if there is a one-off node initialization failure (that we're sometimes experiencing with aws).

If we ignore nodes the fail to initialize, you can get runaway scaling.

there are cases in which runaway scaling is preferred over a service interruption, would be good to have an opt-in cleanup option

@ellistarn
Copy link
Contributor

ellistarn commented Jan 24, 2023

Another thing is - are these nodes considered as in-flight indefinitely ?

I like to think about this as ttlAfterNotReady. @wkaczynski, do you think this is reasonable? You could repurpose the same mechanism to cover cases where nodes fail to connect, or eventually disconnect. We'd need to be careful to not kill nodes that have any pods on them (unless they're stuck deleting), since we don't want to burn down the fleet during an AZ outage. I'd also like this to fall under our maintenance windows controls as discussed in aws/karpenter-provider-aws#1738.

@hitsub2
Copy link

hitsub2 commented Nov 11, 2023

In my case, When provided the kubelet args(currently not supported by Karpenter), some nodes(2 out of 400) are not ready and karpenter can not disrupt them, leaving them forever. After changing AMIFamily to Custom, this issue does not happend again.

@tip-dteller
Copy link

Hi, ive been referred to this ticket and adding the case we've hit.

Description

Observed Behavior:
Background:
Windows application deployed on Karpenter Windows Nodes - Ami family 2019, OnDemand nodes.
When the application becomes unresponsive and enters CrashLoopBackOff, its breaks containerd 1.6.18.
The given error is:

rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing open //./pipe/containerd-containerd: The system cannot find the file specified."
  • Side note: AWS recently updated Windows containerd from 1.6.8 to 1.6.18.

Why does it matter for this case and how its relevant to Karpenter?

The node was created and was functional, it was in "Ready" state as noted by Karpenter and the Windows pods were successfully scheduled onto it. So far so good.
When the application unexpectedly broke containerd and subsequently kubelet, the node enters "NotReady" state.
During this cycle,

At this point, the node is not deprovisioned and prompts this in the events:

karpenter  Cannot deprovision Node: Nominated for a pending pod

Summary of events:

  1. Deploy Windows application
  2. Application is in Pending mode.
  3. Karpenter provisions a functional windows node.
  4. Application loads and enters Running state (its functional).
  5. Application breaks containerd after N time.
  6. Node is unresponsive.
  7. Karpenter cannot deprovision the node because Old pods terminated and New pods are Pending.

Expected Behavior:

Detect that the node is unresponsive and roll it (i.e create a replacement node).

Reproduction Steps (Please include YAML):
Provisioner:

---
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: windows-provisioner
spec:
  consolidation:
    enabled: false
  limits:
    resources:
      cpu: 200
  labels:
    app: workflow
  ttlSecondsAfterEmpty: 300
  taints:
    - key: "company.io/workflow"
      value: "true"
      effect: "NoSchedule"
  requirements:
    - key: "karpenter.k8s.aws/instance-category"
      operator: In
      values: ["c", "m", "r"]
    - key: "karpenter.k8s.aws/instance-cpu"
      operator: In
      values: ["4", "8", "16", "32", "48", "64"]
    - key: "karpenter.k8s.aws/instance-generation"
      operator: Gt
      values: ["4"]
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["us-east-1a", "us-east-1b"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["amd64"]
    - key: "karpenter.sh/capacity-type" 
      operator: In
      values: ["on-demand"] 
    - key: kubernetes.io/os
      operator: In
      values: ["windows"]

  providerRef:
    name: windows

I cannot provide the windows application as it entails business logic.

-- I Couldnt anything in Karpenter documentation that states this is normal behavior and I hope for some clarity here.

Versions:

  • Chart Version: v0.31.0
  • Kubernetes Version (kubectl version): Server Version: v1.24.17-eks-f8587cb, Client Version: v1.28.3

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 10, 2024
@sylr
Copy link

sylr commented Feb 10, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 10, 2024
@Bryce-Soghigian
Copy link
Member

Bryce-Soghigian commented Feb 11, 2024

Maybe you would have to implement some sort of exponential backoff per Provisioner to prevent this endless cycle of provisioning nodes that will always come up as NotReady.

Cluster Autoscaler has the concept of IsClusterHealthy, and IsNodegroupHealthy, alongside the ok-total-unready-count flags and the max-unready percentage flags to control at what threshold when IsClusterHealthy should be triggered.

IsClusterHealthy blocks autoscaling until the health resolves. Im not convinced that karpenter core is the right place to solve provisioning failures with this type of IsClusterHealthy concept, but worth mentioning. CAS has historically dealt with a lot of bug reports for this very blocking behavior, and karpenter nodepools are not of a single type. So GPU Provisioning being broken shouldn't block all other instance types.

Instead it might make sense for Provisioning Backoff to live inside the cloud provider and leverage the unavailable offerings cache inside of karpenter. If a given sku and node image has failed x times for this nodeclaim, we add it to the unavailable offerings cache? Then let it expire and we retry that permutation later.(this pattern would work with azure, have to read through AWS Pattern on this)

It would be much better to not block all of the provisioning for a given nodepool, and instead do it per instance type

Node Auto Repair General Notes From my AKS Experience

I was the engineer that built the AKS Node Auto repair framework. Some notes based on that experience

The expectation generally is that cluster autoscaler garbage collects the unready nodes after 20 minutes(max-total-unready-time flag)

Separately from CAS Lifecycle, AKS will attempt 3 autohealing actions on the node that is not ready.

  1. Restart the vm: Useful for rebooting the kubelet etc
  2. Reimage the vm: Solves for Corrupted states etc
  3. Redeploy the vm: this will solve any problem due to some host level error.

These actions fix many customer nodes each day, but would be good to unify the autoscalers repair attempts alongside the remediator.

While I am all for moving node lifecycle actions from other places into karpenter, It would have to be solved by cloudprovider APIS. The remediation actions defined one cloud provider may not have an equivalent action inside of another cloud provider. We would have to design a symbiotic relationship carefully.

@1ms-ms
Copy link

1ms-ms commented Apr 16, 2024

@Bryce-Soghigian Im not an expert, but these 3 actions you mentioned

  1. Restart the vm: Useful for rebooting the kubelet etc
  2. Reimage the vm: Solves for Corrupted states etc
  3. Redeploy the vm: this will solve any problem due to some host level error.

seem to be possible to implement via SDK for all major cloud providers.
From the thread I can't get what's the obstacle right now, especially since rebooting/terminating will solve most of the problems with kubelet being not responsive.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 15, 2024
@jessebye
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 15, 2024
@ibalat
Copy link

ibalat commented Aug 14, 2024

hi, I guess this issue related with this topic. I can give any log for debugging: #1573

@tculp
Copy link

tculp commented Aug 15, 2024

Another use case is to recover when a node runs out of memory and goes down, never to come up again without manual intervention.

@JacobHenner
Copy link

We periodically encounter nodes that get stuck NotReady due to hung kernel tasks related to elevated iowait. If Karpenter was able to terminate these unhealthy nodes after some brief period of time, it'd be quite helpful for recovering from this situation.

@leoxlin
Copy link

leoxlin commented Sep 6, 2024

We've been seeing some transient NotReady node states related to aws/karpenter-provider-aws#5043 as well. We will end up having to catch this via our pager and manually deleting these nodes so an automated way to repair will be much appreciated!

@diranged
Copy link

Just chiming in here... it really feels like adding another disruption type of NodeNotReady and being able to tell Karpenter that we want to terminate nodes that get into this state is pretty reasonable. We are struggling right now trying to move to Karpenter because we operate a very large scale environment and nodes become unready periodically ... and we do not want humans to be involved in fixing them.

@JacobHenner
Copy link

@mariuskimmina has opened a relevant pull request here

@awoimbee
Copy link

Relates to aws/karpenter-provider-aws#6803

@njtran njtran assigned engedaam and unassigned htoo97 Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deprovisioning Issues related to node deprovisioning kind/feature Categorizes issue or PR as related to a new feature. v1.x Issues prioritized for post-1.0
Projects
None yet