-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Inflight check failed for node, Instance Type "" not found #3156
Comments
So the node is missing the
From the logs and peeking at So not sure if it's the @tzneal as the author of these inflightChecks, do you know if this "deadlock" I describe is indeed caused by one of these inflightChecks failing ( Or do these checks simply add verbosity in the k8s events? |
It looks like the node never started successfully so kubelet never came up. This is a known issue, if the node fails to register correctly it currently remains allowing you to troubleshoot why it failed to come up. There is a feature request at kubernetes-sigs/karpenter#750 for a node auto-repair feature which would automatically remove nodes that experience startup issues. There are some complexities to it, as removing the node may not help. E.g. if your userdata is just bad, removing the node and launching another will just fail again and we'll get into a cycle of launching and terminating a node over and over again. |
I see, thanks for the response. I think for kubernetes-sigs/karpenter#750 you would need some sort of exponential backoff per provisioner to avoid the infinite cycle you were describing. |
I'm going to close this in favor of kubernetes-sigs/karpenter#750 |
Version
Karpenter Version:
v0.21.1
Kubernetes Version:
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.7-eks-fb459a0", GitCommit:"c240013134c03a740781ffa1436ba2688b50b494", GitTreeState:"clean", BuildDate:"2022-10-24T20:36:26Z", GoVersion:"go1.18.7", Compiler:"gc", Platform:"linux/amd64"}
Expected Behavior
We have one spot instance that failed its AWS EC2 health checks. It is running 0 pods and Karpenter is not removing it (it's been this way for the past 11 hours).
Karpenter should gracefully remove said node, instead of leaving the cluster in this state where one of the nodes is posting NotReady but Karpenter cannot do anything about it, since it is failing the inflight checks.
Actual Behavior
EDIT: upon further investigation, I'm seeing that the node is missing the
beta.kubernetes.io/instance-type
label. I think somehow the EC2 health checks failing prevented this label from being created. Either way, Karpenter should not be in this deadlock just because the label is missing.Steps to Reproduce the Problem
I suppose you can reproduce it by rolling the dice until one of your Spot instances fails its EC2 health checks, and somehow ends up missing the
beta.kubernetes.io/instance-type
label.Resource Specs and Logs
See previous sections.
Community Note
The text was updated successfully, but these errors were encountered: