-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node Repair #750
Comments
We recently started using Karpenter for some batch jobs and are running into this as well, where nodes that get stuck on |
I'd love to get this prioritized. It should be straightforward to implement in the node controller. |
If it's not being worked on internally yet, I can take a stab at this! |
@tzneal makes a good point here that this auto-repair feature can potentially get out of hand if each node provisioned becomes NotReady, in case of something like bad userdata configured at the Provisioner / NodeTemplate level. This could possibly also be an issue with EC2 service outages. Maybe you would have to implement some sort of exponential backoff per Provisioner to prevent this endless cycle of provisioning nodes that will always come up as NotReady. |
👍 |
@wkaczynski, it's a bit tricky.
|
We currently have a support ticket open with AWS for occasional bottlerocket boot failures on our Kubernetes nodes. The failure rate is very low and it's important that we are able to get logs off a node and potentially take a snapshot of the volumes. In this scenario it's vital that we can opt out of Karpenter auto removing the node. I'd be in favor of this at least being configurable so users can decide. |
@njtran re: the behaviors API. |
I also think that if we do decide to delete nodes that failed to initialize, there should be an option to opt-out to be able to debug (or if we don't delete by default - an opt-in option to enable the cleanup). The cleanup does not even need to be a provisioner config - initially, until there it a better way to address this issue, it could be enabled via a helm chart value and exposed as either a cm setting, command line option, or an env var. Another thing is - are these nodes considered as in-flight indefinitely ? If yes, is there currently an option for at least the in-flight status timeout ? If there isn't an option for these nodes to no longer be considered in-flight, do I understand correctly that this can effectively block the cluster expansion even if there is a one-off node initialization failure (that we're sometimes experiencing with aws).
there are cases in which runaway scaling is preferred over a service interruption, would be good to have an opt-in cleanup option |
I like to think about this as |
In my case, When provided the kubelet args(currently not supported by Karpenter), some nodes(2 out of 400) are not ready and karpenter can not disrupt them, leaving them forever. After changing AMIFamily to Custom, this issue does not happend again. |
Hi, ive been referred to this ticket and adding the case we've hit. DescriptionObserved Behavior:
Why does it matter for this case and how its relevant to Karpenter? The node was created and was functional, it was in "Ready" state as noted by Karpenter and the Windows pods were successfully scheduled onto it. So far so good. At this point, the node is not deprovisioned and prompts this in the events:
Summary of events:
Expected Behavior: Detect that the node is unresponsive and roll it (i.e create a replacement node). Reproduction Steps (Please include YAML):
I cannot provide the windows application as it entails business logic. -- I Couldnt anything in Karpenter documentation that states this is normal behavior and I hope for some clarity here. Versions:
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Cluster Autoscaler has the concept of IsClusterHealthy, and IsNodegroupHealthy, alongside the ok-total-unready-count flags and the max-unready percentage flags to control at what threshold when IsClusterHealthy should be triggered. IsClusterHealthy blocks autoscaling until the health resolves. Im not convinced that karpenter core is the right place to solve provisioning failures with this type of IsClusterHealthy concept, but worth mentioning. CAS has historically dealt with a lot of bug reports for this very blocking behavior, and karpenter nodepools are not of a single type. So GPU Provisioning being broken shouldn't block all other instance types. Instead it might make sense for Provisioning Backoff to live inside the cloud provider and leverage the unavailable offerings cache inside of karpenter. If a given sku and node image has failed x times for this nodeclaim, we add it to the unavailable offerings cache? Then let it expire and we retry that permutation later.(this pattern would work with azure, have to read through AWS Pattern on this) It would be much better to not block all of the provisioning for a given nodepool, and instead do it per instance type
I was the engineer that built the AKS Node Auto repair framework. Some notes based on that experience The expectation generally is that cluster autoscaler garbage collects the unready nodes after 20 minutes(max-total-unready-time flag) Separately from CAS Lifecycle, AKS will attempt 3 autohealing actions on the node that is not ready.
These actions fix many customer nodes each day, but would be good to unify the autoscalers repair attempts alongside the remediator. While I am all for moving node lifecycle actions from other places into karpenter, It would have to be solved by cloudprovider APIS. The remediation actions defined one cloud provider may not have an equivalent action inside of another cloud provider. We would have to design a symbiotic relationship carefully. |
@Bryce-Soghigian Im not an expert, but these 3 actions you mentioned
seem to be possible to implement via SDK for all major cloud providers. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
hi, I guess this issue related with this topic. I can give any log for debugging: #1573 |
Another use case is to recover when a node runs out of memory and goes down, never to come up again without manual intervention. |
We periodically encounter nodes that get stuck NotReady due to hung kernel tasks related to elevated iowait. If Karpenter was able to terminate these unhealthy nodes after some brief period of time, it'd be quite helpful for recovering from this situation. |
We've been seeing some transient |
Just chiming in here... it really feels like adding another disruption type of |
@mariuskimmina has opened a relevant pull request here |
Relates to aws/karpenter-provider-aws#6803 |
Tell us about your request
Allow a configurable expiration of NotReady nodes.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
I am observing some behavior in my cluster where occasionally nodes fail to join the cluster, due to some transient error in the kubelet bootstrapping process. These nodes stay in NotReady status. Karpenter continues to assign pods to these nodes, but the k8s scheduler won't schedule to them, leaving pods in limbo for extended periods of time. I would like to be able to configure Karpenter with a TTL for nodes that failed to become Ready. The existing configuration
spec.provider.ttlSecondsUntilExpiration
doesn't really work for my use case because it will terminate healthy nodes.Are you currently working around this issue?
Manually deleting stuck nodes.
Additional context
Not sure if this is useful context, but I observed this error on one such stuck node. From
/var/log/userdata.log
:and then
systemctl status sandbox-image.service
:From reading others issues it looks like this AMI script failed, possibly in the call to ECR: https://github.com/awslabs/amazon-eks-ami/blob/master/files/pull-sandbox-image.sh
Community Note
The text was updated successfully, but these errors were encountered: