-
Notifications
You must be signed in to change notification settings - Fork 963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Karpenter leaves orphaned ec2 instances #2734
Comments
There are a couple of things that can help here:
|
Thanks for reporting this case in such detail. We will prioritize this and explore a long term fix to ensure that this can't happen. |
@bwagner5 One thing would like to point out, the Resource Naming only work on EKS v1.23 |
@ig-leonov @yws-ss We discussed this one internally. We have a longer-term solution for this race condition that is in the works as part of an upgrade to a new API version of the Provisioner. @bwagner5 added all the potential medium-term solutions that you can try until we implement this long-term solution that will solve this problem. Let me know if the medium-term solutions works for you. |
btw. Resource Naming also breaks fargate (we run karpenter in fargate) |
Is there some periodic check for all existing ec2 instances with karpenter tags if they are still in k8s and managed by karpenter? |
We don't do a check for this inside of Karpenter. We hand this off to the cloud controller since it should be the component that is responsible for watching instances and removing nodes if the backing instance doesn't exist. |
okay but the cloud controller does that? So we can be sure that there won't be any left behind ec2 instances? |
The cloud controller removes nodes if the backing instance is gone. What we are observing here is the reverse, where instances are left behind but the projected node is gone. Fundamentally, we are thinking of solving this issue with a longer-term solution where Karpenter never creates the node object, but lets the kubelet create and own it, so we don't have to deal with these conflicts and race conditions around maintaining the object. This should solve a lot of the problems including the one you are observing where we orphan an instance because the projected node for the instance doesn't exist on the api-server. |
@jonathan-innis just to confirm - #2546 will fix only issue when spot node was terminated or any kind of terminated nodes (e.g. by health check)? |
#2546 should alleviate the issue regardless of capacity type. This feature gives awareness into EC2 instances that are terminating and stopping. |
@ig-leonov, we have a design doc up now that should address this issue: #2944 |
thanks, @jonathan-innis . Do we have any ETAs when it will be released? |
@ig-leonov Have you been able to install version |
@jonathan-innis unfortunately we don't have time now to test new releases. Will see how it goes when we start testing it |
We probably have the similar issue. Inspected from the logs, there are multi instance resource name with the same hostname ip-10-13-165-204.us-west-2.compute.internal, and finally the node is orphaned. We got limitation on ips. Karpenter Logs: 2023-10-08T08:23:09.294Z INFO controller.provisioning.cloudprovider Launched instance: i-0922fb02a03b8ddc1, hostname: ip-10-13-165-204.us-west-2.compute.internal, type: m5.8xlarge, zone: us-west-2a, capacityType: spot {"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T08:23:09.375Z INFO controller.provisioning Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m6i.8xlarge, m5.8xlarge, m5ad.8xlarge and 9 other(s) {"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T08:23:09.375Z DEBUG controller.events Normal {"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-z6q5s","uid":"3abed46d-24a2-4d7e-b2be-ace640fb5c82","apiVersion":"v1","resourceVersion":"590250151"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T08:23:09.375Z INFO controller.provisioning Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m6i.8xlarge, m5.8xlarge, m5ad.8xlarge and 9 other(s) {"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T08:23:09.375Z DEBUG controller.events Normal {"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-4sq55","uid":"f0cb9e5f-26e5-4bf2-bf45-acc301b6550e","apiVersion":"v1","resourceVersion":"590250154"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-159.us-west-2.compute.internal"}
2023-10-08T08:23:09.377Z INFO controller.provisioning.cloudprovider Launched instance: i-0c9b0616c9ff40d11, hostname: ip-10-13-165-64.us-west-2.compute.internal, type: m5.8xlarge, zone: us-west-2a, capacityType: spot {"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T08:23:09.399Z INFO controller.provisioning Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m6i.8xlarge, m5.8xlarge, m5ad.8xlarge and 9 other(s) {"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T08:23:09.399Z INFO controller.provisioning Waiting for unschedulable pods {"commit": "062a029"}
2023-10-08T08:23:09.401Z DEBUG controller.events Normal {"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-ll229","uid":"5f1655db-357b-4a36-9b9f-96e424747f34","apiVersion":"v1","resourceVersion":"590250146"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-64.us-west-2.compute.internal"}
2023-10-08T08:23:12.118Z DEBUG controller.events Normal {"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-z6q5s","uid":"3abed46d-24a2-4d7e-b2be-ace640fb5c82","apiVersion":"v1","resourceVersion":"590250220"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-64.us-west-2.compute.internal"}
2023-10-08T08:23:12.118Z DEBUG controller.events Normal {"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-ll229","uid":"5f1655db-357b-4a36-9b9f-96e424747f34","apiVersion":"v1","resourceVersion":"590250212"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T08:23:27.692Z DEBUG controller.events Normal {"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-z6q5s","uid":"3abed46d-24a2-4d7e-b2be-ace640fb5c82","apiVersion":"v1","resourceVersion":"590250402"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-159.us-west-2.compute.internal"}
2023-10-08T08:23:27.692Z DEBUG controller.events Normal {"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-4sq55","uid":"f0cb9e5f-26e5-4bf2-bf45-acc301b6550e","apiVersion":"v1","resourceVersion":"590250406"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T08:23:42.054Z DEBUG controller.events Normal {"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-ll229","uid":"5f1655db-357b-4a36-9b9f-96e424747f34","apiVersion":"v1","resourceVersion":"590250769"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-159.us-west-2.compute.internal"}
2023-10-08T08:23:52.926Z INFO controller.node Added TTL to empty node {"commit": "062a029", "node": "ip-10-13-165-64.us-west-2.compute.internal"}
2023-10-08T08:23:53.756Z INFO controller.node Removed emptiness TTL from node {"commit": "062a029", "node": "ip-10-13-165-64.us-west-2.compute.internal"}
2023-10-08T08:23:56.356Z INFO controller.node Added TTL to empty node {"commit": "062a029", "node": "ip-10-13-165-159.us-west-2.compute.internal"}
2023-10-08T08:23:56.400Z INFO controller.node Added TTL to empty node {"commit": "062a029", "node": "ip-10-13-165-159.us-west-2.compute.internal"}
2023-10-08T08:23:56.574Z INFO controller.node Added TTL to empty node {"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T08:24:02.766Z INFO controller.node Removed emptiness TTL from node {"commit": "062a029", "node": "ip-10-13-165-159.us-west-2.compute.internal"}
2023-10-08T08:24:03.763Z INFO controller.node Removed emptiness TTL from node {"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T08:46:46.021Z INFO controller.node Triggering termination after 2m0s for empty node {"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T08:46:46.136Z INFO controller.termination Cordoned node {"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T08:46:46.346Z INFO controller.termination Deleted node {"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:37:20.453Z INFO controller.provisioning.cloudprovider Launched instance: i-0220ce5f0225e5a14, hostname: ip-10-13-165-204.us-west-2.compute.internal, type: m5.8xlarge, zone: us-west-2a, capacityType: spot {"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T10:37:20.501Z INFO controller.provisioning Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m6i.8xlarge, m5.8xlarge, m5ad.8xlarge and 9 other(s) {"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T10:37:20.501Z INFO controller.provisioning Waiting for unschedulable pods {"commit": "062a029"}
2023-10-08T10:37:20.501Z DEBUG controller.events Normal {"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-wsgc7","uid":"86befd60-4ca1-46a3-a8cf-1bd5c7a102cb","apiVersion":"v1","resourceVersion":"590358112"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:37:43.212Z DEBUG controller.events Normal {"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-wsgc7","uid":"86befd60-4ca1-46a3-a8cf-1bd5c7a102cb","apiVersion":"v1","resourceVersion":"590358305"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-201.us-west-2.compute.internal"}
2023-10-08T10:37:43.212Z DEBUG controller.events Normal {"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-w7xr2","uid":"6348b78e-105f-4893-b751-a83357ff6f1a","apiVersion":"v1","resourceVersion":"590358464"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:38:03.259Z INFO controller.provisioning Found 2 provisionable pod(s) {"commit": "062a029"}
2023-10-08T10:38:03.259Z INFO controller.provisioning Computed 1 new node(s) will fit 1 pod(s) {"commit": "062a029"}
2023-10-08T10:38:03.259Z INFO controller.provisioning Computed 1 unready node(s) will fit 1 pod(s) {"commit": "062a029"}
2023-10-08T10:38:03.572Z INFO controller.node Added TTL to empty node {"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:38:03.849Z INFO controller.node Removed emptiness TTL from node {"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:37:20.453Z INFO controller.provisioning.cloudprovider Launched instance: i-0220ce5f0225e5a14, hostname: ip-10-13-165-204.us-west-2.compute.internal, type: m5.8xlarge, zone: us-west-2a, capacityType: spot {"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T10:37:20.501Z INFO controller.provisioning Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m6i.8xlarge, m5.8xlarge, m5ad.8xlarge and 9 other(s) {"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T10:37:20.501Z INFO controller.provisioning Waiting for unschedulable pods {"commit": "062a029"}
2023-10-08T10:37:20.501Z DEBUG controller.events Normal {"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-wsgc7","uid":"86befd60-4ca1-46a3-a8cf-1bd5c7a102cb","apiVersion":"v1","resourceVersion":"590358112"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:37:43.212Z DEBUG controller.events Normal {"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-wsgc7","uid":"86befd60-4ca1-46a3-a8cf-1bd5c7a102cb","apiVersion":"v1","resourceVersion":"590358305"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-201.us-west-2.compute.internal"}
2023-10-08T10:37:43.212Z DEBUG controller.events Normal {"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-w7xr2","uid":"6348b78e-105f-4893-b751-a83357ff6f1a","apiVersion":"v1","resourceVersion":"590358464"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:38:03.259Z INFO controller.provisioning Found 2 provisionable pod(s) {"commit": "062a029"}
2023-10-08T10:38:03.259Z INFO controller.provisioning Computed 1 new node(s) will fit 1 pod(s) {"commit": "062a029"}
2023-10-08T10:38:03.259Z INFO controller.provisioning Computed 1 unready node(s) will fit 1 pod(s) {"commit": "062a029"}
2023-10-08T10:38:03.572Z INFO controller.node Added TTL to empty node {"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:38:03.849Z INFO controller.node Removed emptiness TTL from node {"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:38:05.416Z INFO controller.provisioning.cloudprovider Launched instance: i-0a68604f8f6f7013b, hostname: ip-10-13-165-223.us-west-2.compute.internal, type: m5.8xlarge, zone: us-west-2a, capacityType: spot {"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T11:32:31.947Z INFO controller.node Added TTL to empty node {"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T11:32:42.459Z INFO controller.node Added TTL to empty node {"commit": "062a029", "node": "ip-10-13-165-201.us-west-2.compute.internal"}
2023-10-08T11:32:46.991Z INFO controller.node Added TTL to empty node {"commit": "062a029", "node": "ip-10-13-165-227.us-west-2.compute.internal"}
2023-10-08T11:32:47.391Z INFO controller.node Added TTL to empty node {"commit": "062a029", "node": "ip-10-13-165-137.us-west-2.compute.internal"}
2023-10-08T11:32:51.581Z INFO controller.node Added TTL to empty node {"commit": "062a029", "node": "ip-10-13-165-167.us-west-2.compute.internal"}
2023-10-08T11:32:52.799Z INFO controller.node Added TTL to empty node {"commit": "062a029", "node": "ip-10-13-165-186.us-west-2.compute.internal"}
2023-10-08T11:32:56.810Z INFO controller.node Added TTL to empty node {"commit": "062a029", "node": "ip-10-13-165-30.us-west-2.compute.internal"}
2023-10-08T11:34:26.001Z INFO controller.node Triggering termination after 2m0s for empty node {"commit": "062a029", "node": "ip-10-13-165-49.us-west-2.compute.internal"}
2023-10-08T11:34:26.056Z INFO controller.termination Cordoned node {"commit": "062a029", "node": "ip-10-13-165-49.us-west-2.compute.internal"}
2023-10-08T11:34:26.362Z INFO controller.termination Deleted node {"commit": "062a029", "node": "ip-10-13-165-49.us-west-2.compute.internal"}
2023-10-08T11:34:31.001Z INFO controller.node Triggering termination after 2m0s for empty node {"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T11:34:31.093Z INFO controller.termination Cordoned node {"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T11:34:31.370Z INFO controller.termination Deleted node {"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
023-10-08T14:17:11.736Z INFO controller.provisioning.cloudprovider Launched instance: i-0bdab6f708f7a9a8a, hostname: ip-10-13-165-204.us-west-2.compute.internal, type: m5.8xlarge, zone: us-west-2a, capacityType: spot {"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T14:17:11.745Z INFO controller.provisioning Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m6i.8xlarge, m5.8xlarge, m5ad.8xlarge and 9 other(s) {"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T14:17:11.745Z DEBUG controller.events Normal {"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-pq86z","uid":"b94f656b-1324-46c9-a293-c1cc8462af06","apiVersion":"v1","resourceVersion":"590509310"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T14:17:11.971Z INFO controller.provisioning.cloudprovider Launched instance: i-0e2a23ab3123fa0f2, hostname: ip-10-13-165-25.us-west-2.compute.internal, type: m5.8xlarge, zone: us-west-2a, capacityType: spot {"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T14:17:11.985Z INFO controller.provisioning Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m6i.8xlarge, m5.8xlarge, m5ad.8xlarge and 9 other(s) {"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T14:17:11.985Z INFO controller.provisioning Waiting for unschedulable pods {"commit": "062a029"}
2023-10-08T14:17:11.986Z DEBUG controller.events Normal {"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-tzjfj","uid":"cdcc1056-67e7-465e-8300-e073d7de575e","apiVersion":"v1","resourceVersion":"590509307"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-25.us-west-2.compute.internal"}
2023-10-08T14:17:14.175Z DEBUG controller.events Normal {"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-tzjfj","uid":"cdcc1056-67e7-465e-8300-e073d7de575e","apiVersion":"v1","resourceVersion":"590509307"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T14:17:14.175Z DEBUG controller.events Normal {"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-pq86z","uid":"b94f656b-1324-46c9-a293-c1cc8462af06","apiVersion":"v1","resourceVersion":"590509310"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-25.us-west-2.compute.internal"}
2023-10-08T14:17:17.121Z DEBUG controller.provisioning Discovered 733 EC2 instance types {"commit": "062a029"}
2023-10-08T14:17:17.299Z DEBUG controller.provisioning Discovered EC2 instance types zonal offerings {"commit": "062a029"}
2023-10-08T14:17:54.560Z INFO controller.node Added TTL to empty node {"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T14:18:00.869Z INFO controller.node Added TTL to empty node {"commit": "062a029", "node": "ip-10-13-165-25.us-west-2.compute.internal"}
2023-10-08T14:18:00.911Z INFO controller.node Added TTL to empty node {"commit": "062a029", "node": "ip-10-13-165-25.us-west-2.compute.internal"}
2023-10-08T14:18:02.169Z INFO controller.node Removed emptiness TTL from node {"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-09T06:57:15.875Z INFO controller.provisioning.cloudprovider Launched instance: i-0d140c1ef2f56852f, hostname: ip-10-13-165-204.us-west-2.compute.internal, type: m6a.8xlarge, zone: us-west-2a, capacityType: spot {"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-09T06:57:15.889Z DEBUG controller.provisioning node ip-10-13-165-204.us-west-2.compute.internal already registered {"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-09T06:57:15.889Z INFO controller.provisioning Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m5.8xlarge, m6i.8xlarge, m5ad.8xlarge and 9 other(s) {"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-09T06:57:15.889Z DEBUG controller.events Normal {"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-j22xd","uid":"90062ac0-f255-459b-8e43-ec5e04d2602b","apiVersion":"v1","resourceVersion":"591455506"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-09T06:57:15.890Z INFO controller.provisioning.cloudprovider Launched instance: i-0c7d67805cbe0d5ac, hostname: ip-10-13-165-38.us-west-2.compute.internal, type: m6a.8xlarge, zone: us-west-2a, capacityType: spot {"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-09T06:57:15.930Z INFO controller.provisioning.cloudprovider Launched instance: i-0e27db93ea7e42115, hostname: ip-10-13-165-237.us-west-2.compute.internal, type: m6a.8xlarge, zone: us-west-2a, capacityType: spot {"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"} Just want to confirm it is the same issue? |
Version
v0.18.0
Expected Behavior
No ec2 instances left when no pods scheduled in cluster
Actual Behavior
We have and EKS cluster running for 10 days. After that we decided to delete it and found out that 20 ec2 instances are still running. We verified that all these 20 instances have the same pattern in logs.
As I see it, the process is following:
Steps to Reproduce the Problem
I haven't tried to do that but I see the following steps:
Resource Specs and Logs
"2022-10-10T10:10:37.000-0400","Stopping Getty on tty1..."
"2022-10-10T10:10:37.000-0400","Stopping Serial Getty on ttyS0..."
"2022-10-10T10:10:37.000-0400","Stopped target Login Prompts."
"2022-10-10T10:10:37.000-0400","Stopping NTP client/server..."
"2022-10-10T10:10:37.000-0400","Stopping Postfix Mail Transport Agent..."
"2022-10-10T10:10:37.000-0400","chronyd exiting"
"2022-10-10T10:10:37.000-0400","I1010 14:10:37.486257 4863 dynamic_cafile_content.go:170] "Shutting down controller" name="client-ca-bundle::/etc/kubernetes/pki/ca.crt""
"2022-10-10T10:10:37.000-0400","Stopping D-Bus System Message Bus..."
"2022-10-10T10:10:37.000-0400","Stopping Command Scheduler..."
"2022-10-10T10:10:37.000-0400","Stopping Kubernetes Kubelet..."
"2022-10-10T10:10:37.000-0400","Stopped target Multi-User System."
"2022-10-10T10:10:37.000-0400","Stopped target Graphical Interface."
"2022-10-10T10:10:37.000-0400","Unmounting RPC Pipe File System..."
"2022-10-10T10:10:37.000-0400","Stopped target Cloud-config availability."
"2022-10-10T10:10:37.000-0400","Removed slice system-selinux\x2dpolicy\x2dmigrate\x2dlocal\x2dchanges.slice."
"2022-10-10T10:10:37.000-0400","Stopping Session 1 of user root."
"2022-10-10T10:10:37.000-0400","Stopped target rpc_pipefs.target."
"2022-10-10T10:10:37.000-0400","Stopped Dump dmesg to /var/log/dmesg."
"2022-10-10T10:10:37.000-0400","Closed LVM2 poll daemon socket."
"2022-10-10T10:10:37.000-0400","Stopped target Cloud-init target."
"2022-10-10T10:10:37.000-0400","System is powering down."
"2022-10-10T10:10:37.000-0400","Powering Off..."
"2022-10-10T10:10:37.000-0400","Power key pressed."
"2022-10-10T10:10:37.907-0400","time="2022-10-10T14:10:37.796153867Z" level=info msg="Processing signal 'terminated'""
"2022-10-10T10:10:37.938-0400","I1010 14:10:37.486257 4863 dynamic_cafile_content.go:170] "Shutting down controller" name="client-ca-bundle::/etc/kubernetes/pki/ca.crt""
"2022-10-10T10:10:37.966-0400","time="2022-10-10T14:10:37.803943575Z" level=info msg="Daemon shutdown complete""
"2022-10-10T10:11:19.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeNotReady"
"2022-10-10T10:11:19.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeNotReady"
"2022-10-10T10:11:46.000-0400","Node ip-172-24-55-12.ec2.internal event: Registered Node ip-172-24-55-12.ec2.internal in Controller"
"2022-10-10T10:11:46.000-0400","Node ip-172-24-55-12.ec2.internal event: Registered Node ip-172-24-55-12.ec2.internal in Controller"
"2022-10-10T10:12:14.000-0400","Deleting node ip-172-24-55-12.ec2.internal because it does not exist in the cloud provider"
"2022-10-10T10:12:14.333-0400","2022-10-10T14:12:14.333Z INFO controller.termination Cordoned node {"commit": "b157d45", "node": "ip-172-24-55-12.ec2.internal"}"
"2022-10-10T10:12:23.000-0400","Pod should schedule on ip-172-24-55-12.ec2.internal"
"2022-10-10T10:12:23.000-0400","Pod should schedule on ip-172-24-55-12.ec2.internal"
"2022-10-10T10:12:23.000-0400","Deleting node ip-172-24-55-12.ec2.internal because it does not exist in the cloud provider"
"2022-10-10T10:12:23.862-0400","2022-10-10T14:12:23.862Z INFO controller.provisioning.cloudprovider Launched instance: i-0f10f3b5b671e554b, hostname: ip-172-24-55-12.ec2.internal, type: m5.4xlarge, zone: us-east-1d, capacityType: on-demand {"commit": "b157d45", "provisioner": "my-provisioner"}"
"2022-10-10T10:12:23.926-0400","2022-10-10T14:12:23.926Z DEBUG controller.provisioning node ip-172-24-55-12.ec2.internal already registered {"commit": "b157d45", "provisioner": "my-provisioner"}"
"2022-10-10T10:12:54.000-0400","Deleting node ip-172-24-55-12.ec2.internal because it does not exist in the cloud provider"
"2022-10-10T10:12:54.000-0400","Deleting node ip-172-24-55-12.ec2.internal because it does not exist in the cloud provider"
"2022-10-10T10:12:55.000-0400","Updated Node Allocatable limit across pods"
"2022-10-10T10:12:55.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientPID"
"2022-10-10T10:12:55.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasNoDiskPressure"
"2022-10-10T10:12:55.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:12:55.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientPID"
"2022-10-10T10:12:55.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasNoDiskPressure"
"2022-10-10T10:12:55.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:12:55.000-0400","Starting kubelet."
"2022-10-10T10:12:55.000-0400","Updated Node Allocatable limit across pods"
"2022-10-10T10:12:55.000-0400","Starting kubelet."
"2022-10-10T10:12:56.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:12:56.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeNotSchedulable"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeNotReady"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientPID"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasNoDiskPressure"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal has been rebooted, boot id: 95521faa-d86b-4a16-b21e-d0296fe1620d"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeNotSchedulable"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeNotReady"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal has been rebooted, boot id: 95521faa-d86b-4a16-b21e-d0296fe1620d"
"2022-10-10T10:13:05.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:13:05.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:13:07.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientPID"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasNoDiskPressure"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:13:07.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeSchedulable"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientPID"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasNoDiskPressure"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeSchedulable"
"2022-10-10T10:13:07.831-0400","2022-10-10T14:13:07.831Z INFO controller.termination Deleted node {"commit": "b157d45", "node": "ip-172-24-55-12.ec2.internal"}"
"2022-10-10T10:13:09.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:13:09.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:13:09.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientPID"
"2022-10-10T10:13:09.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:13:09.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasNoDiskPressure"
"2022-10-10T10:13:09.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:13:09.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientPID"
"2022-10-10T10:13:09.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasNoDiskPressure"
"2022-10-10T10:13:09.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:13:11.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:13:11.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
..... more logs are there for another 2 minutes until all running pods (including logging) are stopped.
Community Note
The text was updated successfully, but these errors were encountered: