Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter leaves orphaned ec2 instances #2734

Closed
ig-leonov opened this issue Oct 26, 2022 · 17 comments · Fixed by #3408
Closed

Karpenter leaves orphaned ec2 instances #2734

ig-leonov opened this issue Oct 26, 2022 · 17 comments · Fixed by #3408
Assignees
Labels
bug Something isn't working

Comments

@ig-leonov
Copy link

Version

v0.18.0

Expected Behavior

No ec2 instances left when no pods scheduled in cluster

Actual Behavior

We have and EKS cluster running for 10 days. After that we decided to delete it and found out that 20 ec2 instances are still running. We verified that all these 20 instances have the same pattern in logs.
As I see it, the process is following:

  1. Karpenter spins up a node (instance1). Node name is like "123.123.123.123.ec2.internal"
  2. after some time EC2 instance health check fails and it's shut down
  3. Karpenter spins up a new instance (instance2) which gets the same IP and thus the same node name
  4. Karpenter can't register new instance in API server, getting "node already registered"
  5. Karpenter deletes "123.123.123.123.ec2.internal" node but with the old instance id (instance1)
  6. instance2 is running forever

Steps to Reproduce the Problem

I haven't tried to do that but I see the following steps:

  1. Have a subnet with 1 address
  2. Make karpenter spin up a node with that address
  3. kill the instance
  4. have another pod to be scheduled so karpenter will reuse IP to create a new node

Resource Specs and Logs

"2022-10-10T10:10:37.000-0400","Stopping Getty on tty1..."
"2022-10-10T10:10:37.000-0400","Stopping Serial Getty on ttyS0..."
"2022-10-10T10:10:37.000-0400","Stopped target Login Prompts."
"2022-10-10T10:10:37.000-0400","Stopping NTP client/server..."
"2022-10-10T10:10:37.000-0400","Stopping Postfix Mail Transport Agent..."
"2022-10-10T10:10:37.000-0400","chronyd exiting"
"2022-10-10T10:10:37.000-0400","I1010 14:10:37.486257 4863 dynamic_cafile_content.go:170] "Shutting down controller" name="client-ca-bundle::/etc/kubernetes/pki/ca.crt""
"2022-10-10T10:10:37.000-0400","Stopping D-Bus System Message Bus..."
"2022-10-10T10:10:37.000-0400","Stopping Command Scheduler..."
"2022-10-10T10:10:37.000-0400","Stopping Kubernetes Kubelet..."
"2022-10-10T10:10:37.000-0400","Stopped target Multi-User System."
"2022-10-10T10:10:37.000-0400","Stopped target Graphical Interface."
"2022-10-10T10:10:37.000-0400","Unmounting RPC Pipe File System..."
"2022-10-10T10:10:37.000-0400","Stopped target Cloud-config availability."
"2022-10-10T10:10:37.000-0400","Removed slice system-selinux\x2dpolicy\x2dmigrate\x2dlocal\x2dchanges.slice."
"2022-10-10T10:10:37.000-0400","Stopping Session 1 of user root."
"2022-10-10T10:10:37.000-0400","Stopped target rpc_pipefs.target."
"2022-10-10T10:10:37.000-0400","Stopped Dump dmesg to /var/log/dmesg."
"2022-10-10T10:10:37.000-0400","Closed LVM2 poll daemon socket."
"2022-10-10T10:10:37.000-0400","Stopped target Cloud-init target."
"2022-10-10T10:10:37.000-0400","System is powering down."
"2022-10-10T10:10:37.000-0400","Powering Off..."
"2022-10-10T10:10:37.000-0400","Power key pressed."
"2022-10-10T10:10:37.907-0400","time="2022-10-10T14:10:37.796153867Z" level=info msg="Processing signal 'terminated'""
"2022-10-10T10:10:37.938-0400","I1010 14:10:37.486257 4863 dynamic_cafile_content.go:170] "Shutting down controller" name="client-ca-bundle::/etc/kubernetes/pki/ca.crt""
"2022-10-10T10:10:37.966-0400","time="2022-10-10T14:10:37.803943575Z" level=info msg="Daemon shutdown complete""
"2022-10-10T10:11:19.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeNotReady"
"2022-10-10T10:11:19.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeNotReady"
"2022-10-10T10:11:46.000-0400","Node ip-172-24-55-12.ec2.internal event: Registered Node ip-172-24-55-12.ec2.internal in Controller"
"2022-10-10T10:11:46.000-0400","Node ip-172-24-55-12.ec2.internal event: Registered Node ip-172-24-55-12.ec2.internal in Controller"
"2022-10-10T10:12:14.000-0400","Deleting node ip-172-24-55-12.ec2.internal because it does not exist in the cloud provider"
"2022-10-10T10:12:14.333-0400","2022-10-10T14:12:14.333Z INFO controller.termination Cordoned node {"commit": "b157d45", "node": "ip-172-24-55-12.ec2.internal"}"
"2022-10-10T10:12:23.000-0400","Pod should schedule on ip-172-24-55-12.ec2.internal"
"2022-10-10T10:12:23.000-0400","Pod should schedule on ip-172-24-55-12.ec2.internal"
"2022-10-10T10:12:23.000-0400","Deleting node ip-172-24-55-12.ec2.internal because it does not exist in the cloud provider"
"2022-10-10T10:12:23.862-0400","2022-10-10T14:12:23.862Z INFO controller.provisioning.cloudprovider Launched instance: i-0f10f3b5b671e554b, hostname: ip-172-24-55-12.ec2.internal, type: m5.4xlarge, zone: us-east-1d, capacityType: on-demand {"commit": "b157d45", "provisioner": "my-provisioner"}"
"2022-10-10T10:12:23.926-0400","2022-10-10T14:12:23.926Z DEBUG controller.provisioning node ip-172-24-55-12.ec2.internal already registered {"commit": "b157d45", "provisioner": "my-provisioner"}"
"2022-10-10T10:12:54.000-0400","Deleting node ip-172-24-55-12.ec2.internal because it does not exist in the cloud provider"
"2022-10-10T10:12:54.000-0400","Deleting node ip-172-24-55-12.ec2.internal because it does not exist in the cloud provider"
"2022-10-10T10:12:55.000-0400","Updated Node Allocatable limit across pods"
"2022-10-10T10:12:55.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientPID"
"2022-10-10T10:12:55.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasNoDiskPressure"
"2022-10-10T10:12:55.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:12:55.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientPID"
"2022-10-10T10:12:55.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasNoDiskPressure"
"2022-10-10T10:12:55.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:12:55.000-0400","Starting kubelet."
"2022-10-10T10:12:55.000-0400","Updated Node Allocatable limit across pods"
"2022-10-10T10:12:55.000-0400","Starting kubelet."
"2022-10-10T10:12:56.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:12:56.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeNotSchedulable"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeNotReady"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientPID"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasNoDiskPressure"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal has been rebooted, boot id: 95521faa-d86b-4a16-b21e-d0296fe1620d"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeNotSchedulable"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeNotReady"
"2022-10-10T10:12:56.000-0400","Node ip-172-24-55-12.ec2.internal has been rebooted, boot id: 95521faa-d86b-4a16-b21e-d0296fe1620d"
"2022-10-10T10:13:05.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:13:05.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:13:07.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientPID"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasNoDiskPressure"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:13:07.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeSchedulable"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientPID"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasNoDiskPressure"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:13:07.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeSchedulable"
"2022-10-10T10:13:07.831-0400","2022-10-10T14:13:07.831Z INFO controller.termination Deleted node {"commit": "b157d45", "node": "ip-172-24-55-12.ec2.internal"}"
"2022-10-10T10:13:09.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:13:09.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:13:09.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientPID"
"2022-10-10T10:13:09.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:13:09.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasNoDiskPressure"
"2022-10-10T10:13:09.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:13:09.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientPID"
"2022-10-10T10:13:09.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasNoDiskPressure"
"2022-10-10T10:13:09.000-0400","Node ip-172-24-55-12.ec2.internal status is now: NodeHasSufficientMemory"
"2022-10-10T10:13:11.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
"2022-10-10T10:13:11.000-0400","network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"

..... more logs are there for another 2 minutes until all running pods (including logging) are stopped.

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@ig-leonov ig-leonov added the bug Something isn't working label Oct 26, 2022
@bwagner5
Copy link
Contributor

There are a couple of things that can help here:

  1. feat: Add Native Spot Termination Handling #2546 will add the ability for Karpenter to catch AWS Health Events including Instance Status Events that will be used to delete nodes in the case of health check failure. Until this is released (hopefully very soon), you can run the aws-node-termination-handler which does a similar thing to that PR (in queue-processor mode).
  2. You can also switch to using instance-id hostnames in your VPC which will also result in your nodes having instance-id names instead of private IP names. Here are the docs for "Resource Naming" https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-naming.html#:~:text=.compute.internal-,Resource%20name,-When%20you%20launch
  3. We've considered having a Cloud Provider reconciler that would delete the K8s node when the ec2 instance disappears. I thought that the Cloud Controller Manager would do this as well, but I may be wrong about that due to what you're seeing....

@ellistarn
Copy link
Contributor

ellistarn commented Oct 29, 2022

Thanks for reporting this case in such detail. We will prioritize this and explore a long term fix to ensure that this can't happen.

@spring1843 spring1843 added the burning Time sensitive issues label Oct 31, 2022
@yws-ss
Copy link

yws-ss commented Nov 1, 2022

@bwagner5 One thing would like to point out, the Resource Naming only work on EKS v1.23

aws/containers-roadmap#1723

@jonathan-innis jonathan-innis self-assigned this Nov 2, 2022
@jonathan-innis
Copy link
Contributor

jonathan-innis commented Nov 8, 2022

@ig-leonov @yws-ss We discussed this one internally. We have a longer-term solution for this race condition that is in the works as part of an upgrade to a new API version of the Provisioner.

@bwagner5 added all the potential medium-term solutions that you can try until we implement this long-term solution that will solve this problem.

Let me know if the medium-term solutions works for you.

@runningman84
Copy link

btw. Resource Naming also breaks fargate (we run karpenter in fargate)

@ellistarn
Copy link
Contributor

@runningman84
Copy link

Is there some periodic check for all existing ec2 instances with karpenter tags if they are still in k8s and managed by karpenter?

@jonathan-innis
Copy link
Contributor

We don't do a check for this inside of Karpenter. We hand this off to the cloud controller since it should be the component that is responsible for watching instances and removing nodes if the backing instance doesn't exist.

@runningman84
Copy link

okay but the cloud controller does that? So we can be sure that there won't be any left behind ec2 instances?

@jonathan-innis
Copy link
Contributor

The cloud controller removes nodes if the backing instance is gone. What we are observing here is the reverse, where instances are left behind but the projected node is gone.

Fundamentally, we are thinking of solving this issue with a longer-term solution where Karpenter never creates the node object, but lets the kubelet create and own it, so we don't have to deal with these conflicts and race conditions around maintaining the object.

This should solve a lot of the problems including the one you are observing where we orphan an instance because the projected node for the instance doesn't exist on the api-server.

@ig-leonov
Copy link
Author

@jonathan-innis just to confirm - #2546 will fix only issue when spot node was terminated or any kind of terminated nodes (e.g. by health check)?

@jonathan-innis
Copy link
Contributor

jonathan-innis commented Nov 11, 2022

#2546 should alleviate the issue regardless of capacity type. This feature gives awareness into EC2 instances that are terminating and stopping.

@jonathan-innis
Copy link
Contributor

@ig-leonov, we have a design doc up now that should address this issue: #2944

@ig-leonov
Copy link
Author

thanks, @jonathan-innis . Do we have any ETAs when it will be released?

@bwagner5 bwagner5 removed the burning Time sensitive issues label Feb 27, 2023
@jonathan-innis
Copy link
Contributor

@ig-leonov Have you been able to install version v0.28.0-rc.1 to try and see if the orphaned ec2 instance issue you were seeing is resolved?

@ig-leonov
Copy link
Author

@jonathan-innis unfortunately we don't have time now to test new releases. Will see how it goes when we start testing it

@hitsub2
Copy link

hitsub2 commented Oct 9, 2023

We probably have the similar issue.
EKS Version: 1.21
Karpenter Version: 0.13
Orphaned EC2 hostname: ip-10-13-165-204.us-west-2.compute.internal

Inspected from the logs, there are multi instance resource name with the same hostname ip-10-13-165-204.us-west-2.compute.internal, and finally the node is orphaned.

We got limitation on ips.

Karpenter Logs:

2023-10-08T08:23:09.294Z	INFO	controller.provisioning.cloudprovider	Launched instance: i-0922fb02a03b8ddc1, hostname: ip-10-13-165-204.us-west-2.compute.internal, type: m5.8xlarge, zone: us-west-2a, capacityType: spot	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T08:23:09.375Z	INFO	controller.provisioning	Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m6i.8xlarge, m5.8xlarge, m5ad.8xlarge and 9 other(s)	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T08:23:09.375Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-z6q5s","uid":"3abed46d-24a2-4d7e-b2be-ace640fb5c82","apiVersion":"v1","resourceVersion":"590250151"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T08:23:09.375Z	INFO	controller.provisioning	Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m6i.8xlarge, m5.8xlarge, m5ad.8xlarge and 9 other(s)	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T08:23:09.375Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-4sq55","uid":"f0cb9e5f-26e5-4bf2-bf45-acc301b6550e","apiVersion":"v1","resourceVersion":"590250154"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-159.us-west-2.compute.internal"}
2023-10-08T08:23:09.377Z	INFO	controller.provisioning.cloudprovider	Launched instance: i-0c9b0616c9ff40d11, hostname: ip-10-13-165-64.us-west-2.compute.internal, type: m5.8xlarge, zone: us-west-2a, capacityType: spot	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T08:23:09.399Z	INFO	controller.provisioning	Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m6i.8xlarge, m5.8xlarge, m5ad.8xlarge and 9 other(s)	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T08:23:09.399Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "062a029"}
2023-10-08T08:23:09.401Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-ll229","uid":"5f1655db-357b-4a36-9b9f-96e424747f34","apiVersion":"v1","resourceVersion":"590250146"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-64.us-west-2.compute.internal"}
2023-10-08T08:23:12.118Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-z6q5s","uid":"3abed46d-24a2-4d7e-b2be-ace640fb5c82","apiVersion":"v1","resourceVersion":"590250220"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-64.us-west-2.compute.internal"}
2023-10-08T08:23:12.118Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-ll229","uid":"5f1655db-357b-4a36-9b9f-96e424747f34","apiVersion":"v1","resourceVersion":"590250212"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T08:23:27.692Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-z6q5s","uid":"3abed46d-24a2-4d7e-b2be-ace640fb5c82","apiVersion":"v1","resourceVersion":"590250402"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-159.us-west-2.compute.internal"}
2023-10-08T08:23:27.692Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-4sq55","uid":"f0cb9e5f-26e5-4bf2-bf45-acc301b6550e","apiVersion":"v1","resourceVersion":"590250406"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T08:23:42.054Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-ll229","uid":"5f1655db-357b-4a36-9b9f-96e424747f34","apiVersion":"v1","resourceVersion":"590250769"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-159.us-west-2.compute.internal"}
2023-10-08T08:23:52.926Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-64.us-west-2.compute.internal"}
2023-10-08T08:23:53.756Z	INFO	controller.node	Removed emptiness TTL from node	{"commit": "062a029", "node": "ip-10-13-165-64.us-west-2.compute.internal"}
2023-10-08T08:23:56.356Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-159.us-west-2.compute.internal"}
2023-10-08T08:23:56.400Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-159.us-west-2.compute.internal"}
2023-10-08T08:23:56.574Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T08:24:02.766Z	INFO	controller.node	Removed emptiness TTL from node	{"commit": "062a029", "node": "ip-10-13-165-159.us-west-2.compute.internal"}
2023-10-08T08:24:03.763Z	INFO	controller.node	Removed emptiness TTL from node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T08:46:46.021Z	INFO	controller.node	Triggering termination after 2m0s for empty node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T08:46:46.136Z	INFO	controller.termination	Cordoned node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T08:46:46.346Z	INFO	controller.termination	Deleted node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:37:20.453Z	INFO	controller.provisioning.cloudprovider	Launched instance: i-0220ce5f0225e5a14, hostname: ip-10-13-165-204.us-west-2.compute.internal, type: m5.8xlarge, zone: us-west-2a, capacityType: spot	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T10:37:20.501Z	INFO	controller.provisioning	Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m6i.8xlarge, m5.8xlarge, m5ad.8xlarge and 9 other(s)	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T10:37:20.501Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "062a029"}
2023-10-08T10:37:20.501Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-wsgc7","uid":"86befd60-4ca1-46a3-a8cf-1bd5c7a102cb","apiVersion":"v1","resourceVersion":"590358112"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:37:43.212Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-wsgc7","uid":"86befd60-4ca1-46a3-a8cf-1bd5c7a102cb","apiVersion":"v1","resourceVersion":"590358305"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-201.us-west-2.compute.internal"}
2023-10-08T10:37:43.212Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-w7xr2","uid":"6348b78e-105f-4893-b751-a83357ff6f1a","apiVersion":"v1","resourceVersion":"590358464"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:38:03.259Z	INFO	controller.provisioning	Found 2 provisionable pod(s)	{"commit": "062a029"}
2023-10-08T10:38:03.259Z	INFO	controller.provisioning	Computed 1 new node(s) will fit 1 pod(s)	{"commit": "062a029"}
2023-10-08T10:38:03.259Z	INFO	controller.provisioning	Computed 1 unready node(s) will fit 1 pod(s)	{"commit": "062a029"}
2023-10-08T10:38:03.572Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:38:03.849Z	INFO	controller.node	Removed emptiness TTL from node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:37:20.453Z	INFO	controller.provisioning.cloudprovider	Launched instance: i-0220ce5f0225e5a14, hostname: ip-10-13-165-204.us-west-2.compute.internal, type: m5.8xlarge, zone: us-west-2a, capacityType: spot	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T10:37:20.501Z	INFO	controller.provisioning	Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m6i.8xlarge, m5.8xlarge, m5ad.8xlarge and 9 other(s)	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T10:37:20.501Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "062a029"}
2023-10-08T10:37:20.501Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-wsgc7","uid":"86befd60-4ca1-46a3-a8cf-1bd5c7a102cb","apiVersion":"v1","resourceVersion":"590358112"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:37:43.212Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-wsgc7","uid":"86befd60-4ca1-46a3-a8cf-1bd5c7a102cb","apiVersion":"v1","resourceVersion":"590358305"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-201.us-west-2.compute.internal"}
2023-10-08T10:37:43.212Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-w7xr2","uid":"6348b78e-105f-4893-b751-a83357ff6f1a","apiVersion":"v1","resourceVersion":"590358464"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:38:03.259Z	INFO	controller.provisioning	Found 2 provisionable pod(s)	{"commit": "062a029"}
2023-10-08T10:38:03.259Z	INFO	controller.provisioning	Computed 1 new node(s) will fit 1 pod(s)	{"commit": "062a029"}
2023-10-08T10:38:03.259Z	INFO	controller.provisioning	Computed 1 unready node(s) will fit 1 pod(s)	{"commit": "062a029"}
2023-10-08T10:38:03.572Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:38:03.849Z	INFO	controller.node	Removed emptiness TTL from node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T10:38:05.416Z	INFO	controller.provisioning.cloudprovider	Launched instance: i-0a68604f8f6f7013b, hostname: ip-10-13-165-223.us-west-2.compute.internal, type: m5.8xlarge, zone: us-west-2a, capacityType: spot	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T11:32:31.947Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T11:32:42.459Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-201.us-west-2.compute.internal"}
2023-10-08T11:32:46.991Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-227.us-west-2.compute.internal"}
2023-10-08T11:32:47.391Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-137.us-west-2.compute.internal"}
2023-10-08T11:32:51.581Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-167.us-west-2.compute.internal"}
2023-10-08T11:32:52.799Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-186.us-west-2.compute.internal"}
2023-10-08T11:32:56.810Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-30.us-west-2.compute.internal"}
2023-10-08T11:34:26.001Z	INFO	controller.node	Triggering termination after 2m0s for empty node	{"commit": "062a029", "node": "ip-10-13-165-49.us-west-2.compute.internal"}
2023-10-08T11:34:26.056Z	INFO	controller.termination	Cordoned node	{"commit": "062a029", "node": "ip-10-13-165-49.us-west-2.compute.internal"}
2023-10-08T11:34:26.362Z	INFO	controller.termination	Deleted node	{"commit": "062a029", "node": "ip-10-13-165-49.us-west-2.compute.internal"}
2023-10-08T11:34:31.001Z	INFO	controller.node	Triggering termination after 2m0s for empty node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T11:34:31.093Z	INFO	controller.termination	Cordoned node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T11:34:31.370Z	INFO	controller.termination	Deleted node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
023-10-08T14:17:11.736Z	INFO	controller.provisioning.cloudprovider	Launched instance: i-0bdab6f708f7a9a8a, hostname: ip-10-13-165-204.us-west-2.compute.internal, type: m5.8xlarge, zone: us-west-2a, capacityType: spot	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T14:17:11.745Z	INFO	controller.provisioning	Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m6i.8xlarge, m5.8xlarge, m5ad.8xlarge and 9 other(s)	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T14:17:11.745Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-pq86z","uid":"b94f656b-1324-46c9-a293-c1cc8462af06","apiVersion":"v1","resourceVersion":"590509310"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T14:17:11.971Z	INFO	controller.provisioning.cloudprovider	Launched instance: i-0e2a23ab3123fa0f2, hostname: ip-10-13-165-25.us-west-2.compute.internal, type: m5.8xlarge, zone: us-west-2a, capacityType: spot	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T14:17:11.985Z	INFO	controller.provisioning	Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m6i.8xlarge, m5.8xlarge, m5ad.8xlarge and 9 other(s)	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-08T14:17:11.985Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "062a029"}
2023-10-08T14:17:11.986Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-tzjfj","uid":"cdcc1056-67e7-465e-8300-e073d7de575e","apiVersion":"v1","resourceVersion":"590509307"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-25.us-west-2.compute.internal"}
2023-10-08T14:17:14.175Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-tzjfj","uid":"cdcc1056-67e7-465e-8300-e073d7de575e","apiVersion":"v1","resourceVersion":"590509307"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T14:17:14.175Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-pq86z","uid":"b94f656b-1324-46c9-a293-c1cc8462af06","apiVersion":"v1","resourceVersion":"590509310"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-25.us-west-2.compute.internal"}
2023-10-08T14:17:17.121Z	DEBUG	controller.provisioning	Discovered 733 EC2 instance types	{"commit": "062a029"}
2023-10-08T14:17:17.299Z	DEBUG	controller.provisioning	Discovered EC2 instance types zonal offerings	{"commit": "062a029"}
2023-10-08T14:17:54.560Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-08T14:18:00.869Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-25.us-west-2.compute.internal"}
2023-10-08T14:18:00.911Z	INFO	controller.node	Added TTL to empty node	{"commit": "062a029", "node": "ip-10-13-165-25.us-west-2.compute.internal"}
2023-10-08T14:18:02.169Z	INFO	controller.node	Removed emptiness TTL from node	{"commit": "062a029", "node": "ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-09T06:57:15.875Z	INFO	controller.provisioning.cloudprovider	Launched instance: i-0d140c1ef2f56852f, hostname: ip-10-13-165-204.us-west-2.compute.internal, type: m6a.8xlarge, zone: us-west-2a, capacityType: spot	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-09T06:57:15.889Z	DEBUG	controller.provisioning	node ip-10-13-165-204.us-west-2.compute.internal already registered	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-09T06:57:15.889Z	INFO	controller.provisioning	Created node with 1 pods requesting {"cpu":"31435m","memory":"115428Mi","pods":"6"} from types m5a.8xlarge, m6a.8xlarge, m5.8xlarge, m6i.8xlarge, m5ad.8xlarge and 9 other(s)	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-09T06:57:15.889Z	DEBUG	controller.events	Normal	{"commit": "062a029", "object": {"kind":"Pod","namespace":"ta-cluster","name":"trino-worker-ch-extra-project-674d9cd7f8-j22xd","uid":"90062ac0-f255-459b-8e43-ec5e04d2602b","apiVersion":"v1","resourceVersion":"591455506"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-13-165-204.us-west-2.compute.internal"}
2023-10-09T06:57:15.890Z	INFO	controller.provisioning.cloudprovider	Launched instance: i-0c7d67805cbe0d5ac, hostname: ip-10-13-165-38.us-west-2.compute.internal, type: m6a.8xlarge, zone: us-west-2a, capacityType: spot	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}
2023-10-09T06:57:15.930Z	INFO	controller.provisioning.cloudprovider	Launched instance: i-0e27db93ea7e42115, hostname: ip-10-13-165-237.us-west-2.compute.internal, type: m6a.8xlarge, zone: us-west-2a, capacityType: spot	{"commit": "062a029", "provisioner": "trino-worker-ch-extra-project"}

Just want to confirm it is the same issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants