Karpenter leaves orphaned ec2 instances #2734

ig-leonov · 2022-10-26T09:36:30Z

Version

v0.18.0

Expected Behavior

No ec2 instances left when no pods scheduled in cluster

Actual Behavior

We have and EKS cluster running for 10 days. After that we decided to delete it and found out that 20 ec2 instances are still running. We verified that all these 20 instances have the same pattern in logs.
As I see it, the process is following:

Karpenter spins up a node (instance1). Node name is like "123.123.123.123.ec2.internal"
after some time EC2 instance health check fails and it's shut down
Karpenter spins up a new instance (instance2) which gets the same IP and thus the same node name
Karpenter can't register new instance in API server, getting "node already registered"
Karpenter deletes "123.123.123.123.ec2.internal" node but with the old instance id (instance1)
instance2 is running forever

Steps to Reproduce the Problem

I haven't tried to do that but I see the following steps:

Have a subnet with 1 address
Make karpenter spin up a node with that address
kill the instance
have another pod to be scheduled so karpenter will reuse IP to create a new node

Resource Specs and Logs

"2022-10-10T10:10:37.000-0400","Stopping Getty on tty1..."
"2022-10-10T10:10:37.000-0400","Stopping Serial Getty on ttyS0..."
"2022-10-10T10:10:37.000-0400","Stopped target Login Prompts."
"2022-10-10T10:10:37.000-0400","Stopping NTP client/server..."
"2022-10-10T10:10:37.000-0400","Stopping Postfix Mail Transport Agent..."
"2022-10-10T10:10:37.000-0400","chronyd exiting"
"2022-10-10T10:10:37.000-0400","I1010 14:10:37.486257 4863 dynamic_cafile_content.go:170] "Shutting down controller" name="client-ca-bundle::/etc/kubernetes/pki/ca.crt""
"2022-10-10T10:10:37.000-0400","Stopping D-Bus System Message Bus..."
"2022-10-10T10:10:37.000-0400","Stopping Command Scheduler..."
"2022-10-10T10:10:37.000-0400","Stopping Kubernetes Kubelet..."
"2022-10-10T10:10:37.000-0400","Stopped target Multi-User System."
"2022-10-10T10:10:37.000-0400","Stopped target Graphical Interface."
"2022-10-10T10:10:37.000-0400","Unmounting RPC Pipe File System..."
"2022-10-10T10:10:37.000-0400","Stopped target Cloud-config availability."
"2022-10-10T10:10:37.000-0400","Removed slice system-selinux\x2dpolicy\x2dmigrate\x2dlocal\x2dchanges.slice."
"2022-10-10T10:10:37.000-0400","Stopping Session 1 of user root."
"2022-10-10T10:10:37.000-0400","Stopped target rpc_pipefs.target."
"2022-10-10T10:10:37.000-0400","Stopped Dump dmesg to /var/log/dmesg."
"2022-10-10T10:10:37.000-0400","Closed LVM2 poll daemon socket."
"2022-10-10T10:10:37.000-0400","Stopped target Cloud-init target."
"2022-10-10T10:10:37.000-0400","System is powering down."
"2022-10-10T10:10:37.000-0400","Powering Off..."
"2022-10-10T10:10:37.000-0400","Power key pressed."
"2022-10-10T10:10:37.907-0400","time="2022-10-10T14:10:37.796153867Z" level=info msg="Processing signal 'terminated'""
"2022-10-10T10:10:37.938-0400","I1010 14:10:37.486257 4863 dynamic_cafile_content.go:170] "Shutting down controller" name="client-ca-bundle::/etc/kubernetes/pki/ca.crt""
"2022-10-10T10:10:37.966-0400","time="2022-10-10T14:10:37.803943575Z" level=info msg="Daemon shutdown complete""
"2022-10-10T10:11:19.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeNotReady"
"2022-10-10T10:11:19.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeNotReady"
"2022-10-10T10:11:46.000-0400","Node ip-172-24-55-12.ec2.internal event: Registered Node ip-172-24-55-12.ec2.internal in Controller"
"2022-10-10T10:11:46.000-0400","Node ip-172-24-55-12.ec2.internal event: Registered Node ip-172-24-55-12.ec2.internal in Controller"
"2022-10-10T10:12:14.000-0400","Deleting node ip-172-24-55-12.ec2.internal because it does not exist in the cloud provider"
"2022-10-10T10:12:14.333-0400","2022-10-10T14:12:14.333Z INFO controller.termination Cordoned node {"commit": "b157d45", "node": "ip-172-24-55-12.ec2.internal"}"
"2022-10-10T10:12:23.000-0400","Pod should schedule on ip-172-24-55-12.ec2.internal"
"2022-10-10T10:12:23.000-0400","Pod should schedule on ip-172-24-55-12.ec2.internal"
"2022-10-10T10:12:23.000-0400","Deleting node ip-172-24-55-12.ec2.internal because it does not exist in the cloud provider"
"2022-10-10T10:12:23.862-0400","2022-10-10T14:12:23.862Z INFO controller.provisioning.cloudprovider Launched instance: i-0f10f3b5b671e554b, hostname: ip-172-24-55-12.ec2.internal, type: m5.4xlarge, zone: us-east-1d, capacityType: on-demand {"commit": "b157d45", "provisioner": "my-provisioner"}"
"2022-10-10T10:12:23.926-0400","2022-10-10T14:12:23.926Z DEBUG controller.provisioning node ip-172-24-55-12.ec2.internal already registered {"commit": "b157d45", "provisioner": "my-provisioner"}"
"2022-10-10T10:12:54.000-0400","Deleting node ip-172-24-55-12.ec2.internal because it does not exist in the cloud provider"
"2022-10-10T10:12:54.000-0400","Deleting node ip-172-24-55-12.ec2.internal because it does not exist in the cloud provider"
"2022-10-10T10:12:55.000-0400","Updated Node Allocatable limit across pods"
"2022-10-10T10:12:55.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientPID"
"2022-10-10T10:12:55.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasNoDiskPressure"
"2022-10-10T10:12:55.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:12:55.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientPID"
"2022-10-10T10:12:55.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasNoDiskPressure"
"2022-10-10T10:12:55.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:12:55.000-0400","Starting kubelet."
"2022-10-10T10:12:55.000-0400","Updated Node Allocatable limit across pods"
"2022-10-10T10:12:55.000-0400","Starting kubelet."
"2022-10-10T10:12:56.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:12:56.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeNotSchedulable"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeNotReady"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientPID"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasNoDiskPressure"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal has been rebooted, boot id: 95521faa-d86b-4a16-b21e-d0296fe1620d"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeNotSchedulable"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeNotReady"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal has been rebooted, boot id: 95521faa-d86b-4a16-b21e-d0296fe1620d"
"2022-10-10T10:13:05.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:13:05.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:13:07.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientPID"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasNoDiskPressure"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:13:07.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeSchedulable"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientPID"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasNoDiskPressure"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeSchedulable"
"2022-10-10T10:13:07.831-0400","2022-10-10T14:13:07.831Z INFO controller.termination Deleted node {"commit": "b157d45", "node": "ip-172-24-55-12.ec2.internal"}"
"2022-10-10T10:13:09.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:13:09.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:13:09.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientPID"
"2022-10-10T10:13:09.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:13:09.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasNoDiskPressure"
"2022-10-10T10:13:09.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:13:09.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientPID"
"2022-10-10T10:13:09.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasNoDiskPressure"
"2022-10-10T10:13:09.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:13:11.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:13:11.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"

..... more logs are there for another 2 minutes until all running pods (including logging) are stopped.

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

bwagner5 · 2022-10-28T18:14:02Z

There are a couple of things that can help here:

feat: Add Native Spot Termination Handling #2546 will add the ability for Karpenter to catch AWS Health Events including Instance Status Events that will be used to delete nodes in the case of health check failure. Until this is released (hopefully very soon), you can run the aws-node-termination-handler which does a similar thing to that PR (in queue-processor mode).
You can also switch to using instance-id hostnames in your VPC which will also result in your nodes having instance-id names instead of private IP names. Here are the docs for "Resource Naming" https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-naming.html#:~:text=.compute.internal-,Resource%20name,-When%20you%20launch
We've considered having a Cloud Provider reconciler that would delete the K8s node when the ec2 instance disappears. I thought that the Cloud Controller Manager would do this as well, but I may be wrong about that due to what you're seeing....

ellistarn · 2022-10-29T18:44:09Z

Thanks for reporting this case in such detail. We will prioritize this and explore a long term fix to ensure that this can't happen.

yws-ss · 2022-11-01T01:26:23Z

@bwagner5 One thing would like to point out, the Resource Naming only work on EKS v1.23

aws/containers-roadmap#1723

jonathan-innis · 2022-11-08T00:21:11Z

@ig-leonov @yws-ss We discussed this one internally. We have a longer-term solution for this race condition that is in the works as part of an upgrade to a new API version of the Provisioner.

@bwagner5 added all the potential medium-term solutions that you can try until we implement this long-term solution that will solve this problem.

Let me know if the medium-term solutions works for you.

runningman84 · 2022-11-08T15:55:11Z

btw. Resource Naming also breaks fargate (we run karpenter in fargate)

ellistarn · 2022-11-08T18:21:02Z

Should be fixed now: https://github.com/kubernetes/cloud-provider-aws/pull/471/files#r955478855

runningman84 · 2022-11-09T09:00:44Z

Is there some periodic check for all existing ec2 instances with karpenter tags if they are still in k8s and managed by karpenter?

jonathan-innis · 2022-11-09T18:20:17Z

We don't do a check for this inside of Karpenter. We hand this off to the cloud controller since it should be the component that is responsible for watching instances and removing nodes if the backing instance doesn't exist.

runningman84 · 2022-11-10T07:55:00Z

okay but the cloud controller does that? So we can be sure that there won't be any left behind ec2 instances?

jonathan-innis · 2022-11-10T18:30:31Z

The cloud controller removes nodes if the backing instance is gone. What we are observing here is the reverse, where instances are left behind but the projected node is gone.

Fundamentally, we are thinking of solving this issue with a longer-term solution where Karpenter never creates the node object, but lets the kubelet create and own it, so we don't have to deal with these conflicts and race conditions around maintaining the object.

This should solve a lot of the problems including the one you are observing where we orphan an instance because the projected node for the instance doesn't exist on the api-server.

ig-leonov · 2022-11-11T15:44:02Z

@jonathan-innis just to confirm - #2546 will fix only issue when spot node was terminated or any kind of terminated nodes (e.g. by health check)?

jonathan-innis · 2022-11-11T23:10:11Z

#2546 should alleviate the issue regardless of capacity type. This feature gives awareness into EC2 instances that are terminating and stopping.

jonathan-innis · 2022-11-30T19:12:32Z

@ig-leonov, we have a design doc up now that should address this issue: #2944

ig-leonov · 2022-12-01T13:27:26Z

thanks, @jonathan-innis . Do we have any ETAs when it will be released?

jonathan-innis · 2023-05-17T06:50:12Z

@ig-leonov Have you been able to install version v0.28.0-rc.1 to try and see if the orphaned ec2 instance issue you were seeing is resolved?

ig-leonov · 2023-05-22T07:35:37Z

@jonathan-innis unfortunately we don't have time now to test new releases. Will see how it goes when we start testing it

hitsub2 · 2023-10-09T14:00:11Z

We probably have the similar issue.
EKS Version: 1.21
Karpenter Version: 0.13
Orphaned EC2 hostname: ip-10-13-165-204.us-west-2.compute.internal

Inspected from the logs, there are multi instance resource name with the same hostname ip-10-13-165-204.us-west-2.compute.internal, and finally the node is orphaned.

We got limitation on ips.

Karpenter Logs:

2023-10-08T08:23:09.294Z	INFO	controller.provisioning.cloudprovider	Launched instance: i-0922fb02a03b8ddc1, hostname: ip-10-13-165-204.us-west-2.compute.internal, type: m5.8xlarge, zone: us-west-2a, capacityType: spot	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T08:23:09.375Z	INFO	controller.provisioning	Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m6i.8xlarge, m5.8xlarge, m5ad.8xlarge and 9 other(s)	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T08:23:09.375Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-z6q5s","uid":"3abed46d-24a2-4d7e-b2be-ace640fb5c82","apiVersion":"v1","resourceVersion":"590250151"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T08:23:09.375Z	INFO	controller.provisioning	Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m6i.8xlarge, m5.8xlarge, m5ad.8xlarge and 9 other(s)	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T08:23:09.375Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-4sq55","uid":"f0cb9e5f-26e5-4bf2-bf45-acc301b6550e","apiVersion":"v1","resourceVersion":"590250154"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-159.us-west-2.compute.internal"}
2023-10-08T08:23:09.377Z	INFO	controller.provisioning.cloudprovider	Launched instance: i-0c9b0616c9ff40d11, hostname: ip-10-13-165-64.us-west-2.compute.internal, type: m5.8xlarge, zone: us-west-2a, capacityType: spot	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T08:23:09.399Z	INFO	controller.provisioning	Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m6i.8xlarge, m5.8xlarge, m5ad.8xlarge and 9 other(s)	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T08:23:09.399Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "062a029"}
2023-10-08T08:23:09.401Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-ll229","uid":"5f1655db-357b-4a36-9b9f-96e424747f34","apiVersion":"v1","resourceVersion":"590250146"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-64.us-west-2.compute.internal"}
2023-10-08T08:23:12.118Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-z6q5s","uid":"3abed46d-24a2-4d7e-b2be-ace640fb5c82","apiVersion":"v1","resourceVersion":"590250220"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-64.us-west-2.compute.internal"}
2023-10-08T08:23:12.118Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-ll229","uid":"5f1655db-357b-4a36-9b9f-96e424747f34","apiVersion":"v1","resourceVersion":"590250212"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T08:23:27.692Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-z6q5s","uid":"3abed46d-24a2-4d7e-b2be-ace640fb5c82","apiVersion":"v1","resourceVersion":"590250402"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-159.us-west-2.compute.internal"}
2023-10-08T08:23:27.692Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-4sq55","uid":"f0cb9e5f-26e5-4bf2-bf45-acc301b6550e","apiVersion":"v1","resourceVersion":"590250406"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T08:23:42.054Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-ll229","uid":"5f1655db-357b-4a36-9b9f-96e424747f34","apiVersion":"v1","resourceVersion":"590250769"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-159.us-west-2.compute.internal"}
2023-10-08T08:23:52.926Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-64.us-west-2.compute.internal"}
2023-10-08T08:23:53.756Z	INFO	controller.node	Removed emptiness TTL from node	{"commit": "062a029", "node": "ip-10-13-165-64.us-west-2.compute.internal"}
2023-10-08T08:23:56.356Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-159.us-west-2.compute.internal"}
2023-10-08T08:23:56.400Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-159.us-west-2.compute.internal"}
2023-10-08T08:23:56.574Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T08:24:02.766Z	INFO	controller.node	Removed emptiness TTL from node	{"commit": "062a029", "node": "ip-10-13-165-159.us-west-2.compute.internal"}
2023-10-08T08:24:03.763Z	INFO	controller.node	Removed emptiness TTL from node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T08:46:46.021Z	INFO	controller.node	Triggering termination after 2m0s for empty node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T08:46:46.136Z	INFO	controller.termination	Cordoned node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T08:46:46.346Z	INFO	controller.termination	Deleted node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:37:20.453Z	INFO	controller.provisioning.cloudprovider	Launched instance: i-0220ce5f0225e5a14, hostname: ip-10-13-165-204.us-west-2.compute.internal, type: m5.8xlarge, zone: us-west-2a, capacityType: spot	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T10:37:20.501Z	INFO	controller.provisioning	Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m6i.8xlarge, m5.8xlarge, m5ad.8xlarge and 9 other(s)	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T10:37:20.501Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "062a029"}
2023-10-08T10:37:20.501Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-wsgc7","uid":"86befd60-4ca1-46a3-a8cf-1bd5c7a102cb","apiVersion":"v1","resourceVersion":"590358112"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:37:43.212Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-wsgc7","uid":"86befd60-4ca1-46a3-a8cf-1bd5c7a102cb","apiVersion":"v1","resourceVersion":"590358305"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-201.us-west-2.compute.internal"}
2023-10-08T10:37:43.212Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-w7xr2","uid":"6348b78e-105f-4893-b751-a83357ff6f1a","apiVersion":"v1","resourceVersion":"590358464"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:38:03.259Z	INFO	controller.provisioning	Found 2 provisionable pod(s)	{"commit": "062a029"}
2023-10-08T10:38:03.259Z	INFO	controller.provisioning	Computed 1 new node(s) will fit 1 pod(s)	{"commit": "062a029"}
2023-10-08T10:38:03.259Z	INFO	controller.provisioning	Computed 1 unready node(s) will fit 1 pod(s)	{"commit": "062a029"}
2023-10-08T10:38:03.572Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:38:03.849Z	INFO	controller.node	Removed emptiness TTL from node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:37:20.453Z	INFO	controller.provisioning.cloudprovider	Launched instance: i-0220ce5f0225e5a14, hostname: ip-10-13-165-204.us-west-2.compute.internal, type: m5.8xlarge, zone: us-west-2a, capacityType: spot	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T10:37:20.501Z	INFO	controller.provisioning	Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m6i.8xlarge, m5.8xlarge, m5ad.8xlarge and 9 other(s)	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T10:37:20.501Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "062a029"}
2023-10-08T10:37:20.501Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-wsgc7","uid":"86befd60-4ca1-46a3-a8cf-1bd5c7a102cb","apiVersion":"v1","resourceVersion":"590358112"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:37:43.212Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-wsgc7","uid":"86befd60-4ca1-46a3-a8cf-1bd5c7a102cb","apiVersion":"v1","resourceVersion":"590358305"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-201.us-west-2.compute.internal"}
2023-10-08T10:37:43.212Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-w7xr2","uid":"6348b78e-105f-4893-b751-a83357ff6f1a","apiVersion":"v1","resourceVersion":"590358464"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:38:03.259Z	INFO	controller.provisioning	Found 2 provisionable pod(s)	{"commit": "062a029"}
2023-10-08T10:38:03.259Z	INFO	controller.provisioning	Computed 1 new node(s) will fit 1 pod(s)	{"commit": "062a029"}
2023-10-08T10:38:03.259Z	INFO	controller.provisioning	Computed 1 unready node(s) will fit 1 pod(s)	{"commit": "062a029"}
2023-10-08T10:38:03.572Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:38:03.849Z	INFO	controller.node	Removed emptiness TTL from node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:38:05.416Z	INFO	controller.provisioning.cloudprovider	Launched instance: i-0a68604f8f6f7013b, hostname: ip-10-13-165-223.us-west-2.compute.internal, type: m5.8xlarge, zone: us-west-2a, capacityType: spot	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T11:32:31.947Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T11:32:42.459Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-201.us-west-2.compute.internal"}
2023-10-08T11:32:46.991Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-227.us-west-2.compute.internal"}
2023-10-08T11:32:47.391Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-137.us-west-2.compute.internal"}
2023-10-08T11:32:51.581Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-167.us-west-2.compute.internal"}
2023-10-08T11:32:52.799Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-186.us-west-2.compute.internal"}
2023-10-08T11:32:56.810Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-30.us-west-2.compute.internal"}
2023-10-08T11:34:26.001Z	INFO	controller.node	Triggering termination after 2m0s for empty node	{"commit": "062a029", "node": "ip-10-13-165-49.us-west-2.compute.internal"}
2023-10-08T11:34:26.056Z	INFO	controller.termination	Cordoned node	{"commit": "062a029", "node": "ip-10-13-165-49.us-west-2.compute.internal"}
2023-10-08T11:34:26.362Z	INFO	controller.termination	Deleted node	{"commit": "062a029", "node": "ip-10-13-165-49.us-west-2.compute.internal"}
2023-10-08T11:34:31.001Z	INFO	controller.node	Triggering termination after 2m0s for empty node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T11:34:31.093Z	INFO	controller.termination	Cordoned node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T11:34:31.370Z	INFO	controller.termination	Deleted node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
023-10-08T14:17:11.736Z	INFO	controller.provisioning.cloudprovider	Launched instance: i-0bdab6f708f7a9a8a, hostname: ip-10-13-165-204.us-west-2.compute.internal, type: m5.8xlarge, zone: us-west-2a, capacityType: spot	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T14:17:11.745Z	INFO	controller.provisioning	Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m6i.8xlarge, m5.8xlarge, m5ad.8xlarge and 9 other(s)	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T14:17:11.745Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-pq86z","uid":"b94f656b-1324-46c9-a293-c1cc8462af06","apiVersion":"v1","resourceVersion":"590509310"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T14:17:11.971Z	INFO	controller.provisioning.cloudprovider	Launched instance: i-0e2a23ab3123fa0f2, hostname: ip-10-13-165-25.us-west-2.compute.internal, type: m5.8xlarge, zone: us-west-2a, capacityType: spot	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T14:17:11.985Z	INFO	controller.provisioning	Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m6i.8xlarge, m5.8xlarge, m5ad.8xlarge and 9 other(s)	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T14:17:11.985Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "062a029"}
2023-10-08T14:17:11.986Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-tzjfj","uid":"cdcc1056-67e7-465e-8300-e073d7de575e","apiVersion":"v1","resourceVersion":"590509307"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-25.us-west-2.compute.internal"}
2023-10-08T14:17:14.175Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-tzjfj","uid":"cdcc1056-67e7-465e-8300-e073d7de575e","apiVersion":"v1","resourceVersion":"590509307"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T14:17:14.175Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-pq86z","uid":"b94f656b-1324-46c9-a293-c1cc8462af06","apiVersion":"v1","resourceVersion":"590509310"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-25.us-west-2.compute.internal"}
2023-10-08T14:17:17.121Z	DEBUG	controller.provisioning	Discovered 733 EC2 instance types	{"commit": "062a029"}
2023-10-08T14:17:17.299Z	DEBUG	controller.provisioning	Discovered EC2 instance types zonal offerings	{"commit": "062a029"}
2023-10-08T14:17:54.560Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T14:18:00.869Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-25.us-west-2.compute.internal"}
2023-10-08T14:18:00.911Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-25.us-west-2.compute.internal"}
2023-10-08T14:18:02.169Z	INFO	controller.node	Removed emptiness TTL from node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-09T06:57:15.875Z	INFO	controller.provisioning.cloudprovider	Launched instance: i-0d140c1ef2f56852f, hostname: ip-10-13-165-204.us-west-2.compute.internal, type: m6a.8xlarge, zone: us-west-2a, capacityType: spot	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-09T06:57:15.889Z	DEBUG	controller.provisioning	node ip-10-13-165-204.us-west-2.compute.internal already registered	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-09T06:57:15.889Z	INFO	controller.provisioning	Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m5.8xlarge, m6i.8xlarge, m5ad.8xlarge and 9 other(s)	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-09T06:57:15.889Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-j22xd","uid":"90062ac0-f255-459b-8e43-ec5e04d2602b","apiVersion":"v1","resourceVersion":"591455506"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-09T06:57:15.890Z	INFO	controller.provisioning.cloudprovider	Launched instance: i-0c7d67805cbe0d5ac, hostname: ip-10-13-165-38.us-west-2.compute.internal, type: m6a.8xlarge, zone: us-west-2a, capacityType: spot	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-09T06:57:15.930Z	INFO	controller.provisioning.cloudprovider	Launched instance: i-0e27db93ea7e42115, hostname: ip-10-13-165-237.us-west-2.compute.internal, type: m6a.8xlarge, zone: us-west-2a, capacityType: spot	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}

Just want to confirm it is the same issue?

ig-leonov added the bug Something isn't working label Oct 26, 2022

spring1843 added the burning Time sensitive issues label Oct 31, 2022

jonathan-innis self-assigned this Nov 2, 2022

jonathan-innis mentioned this issue Nov 11, 2022

Deregistered Node Not Terminated #2832

Closed

jonathan-innis mentioned this issue Nov 30, 2022

RFC: Karpenter Node Ownership #2944

Merged

3 tasks

bwagner5 removed the burning Time sensitive issues label Feb 27, 2023

jonathan-innis mentioned this issue Mar 7, 2023

feat: Machine Migration #3408

Merged

3 tasks

jonathan-innis closed this as completed in #3408 Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karpenter leaves orphaned ec2 instances #2734

Karpenter leaves orphaned ec2 instances #2734

ig-leonov commented Oct 26, 2022

bwagner5 commented Oct 28, 2022

ellistarn commented Oct 29, 2022 •

edited

Loading

yws-ss commented Nov 1, 2022 •

edited

Loading

jonathan-innis commented Nov 8, 2022 •

edited

Loading

runningman84 commented Nov 8, 2022

ellistarn commented Nov 8, 2022

runningman84 commented Nov 9, 2022

jonathan-innis commented Nov 9, 2022

runningman84 commented Nov 10, 2022

jonathan-innis commented Nov 10, 2022

ig-leonov commented Nov 11, 2022

jonathan-innis commented Nov 11, 2022 •

edited

Loading

jonathan-innis commented Nov 30, 2022

ig-leonov commented Dec 1, 2022

jonathan-innis commented May 17, 2023

ig-leonov commented May 22, 2023

hitsub2 commented Oct 9, 2023

Karpenter leaves orphaned ec2 instances #2734

Karpenter leaves orphaned ec2 instances #2734

Comments

ig-leonov commented Oct 26, 2022

Version

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Resource Specs and Logs

Community Note

bwagner5 commented Oct 28, 2022

ellistarn commented Oct 29, 2022 • edited Loading

yws-ss commented Nov 1, 2022 • edited Loading

jonathan-innis commented Nov 8, 2022 • edited Loading

runningman84 commented Nov 8, 2022

ellistarn commented Nov 8, 2022

runningman84 commented Nov 9, 2022

jonathan-innis commented Nov 9, 2022

runningman84 commented Nov 10, 2022

jonathan-innis commented Nov 10, 2022

ig-leonov commented Nov 11, 2022

jonathan-innis commented Nov 11, 2022 • edited Loading

jonathan-innis commented Nov 30, 2022

ig-leonov commented Dec 1, 2022

jonathan-innis commented May 17, 2023

ig-leonov commented May 22, 2023

hitsub2 commented Oct 9, 2023

ellistarn commented Oct 29, 2022 •

edited

Loading

yws-ss commented Nov 1, 2022 •

edited

Loading

jonathan-innis commented Nov 8, 2022 •

edited

Loading

jonathan-innis commented Nov 11, 2022 •

edited

Loading