cluster autoscaler deleted non-empty node as ScaleDownEmpty #5790

infa-ddeore · 2023-05-22T07:03:05Z

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
1.23

Component version:

What k8s version are you using (kubectl version)?: 1.23

kubectl version Output

$ kubectl version

What environment is this in?:
EKS

What did you expect to happen?:
cluster autoscaler should re-check node is empty or not before deleting

What happened instead?:
cluster autoscaler deleted non-empty node

How to reproduce it (as minimally and precisely as possible):
It is difficult to reproduce the issue, this behaviour is shows from the logs,

06:46:13.400298 --> a pod is scheduled as per k8 api logs

I0322 06:46:13.400298      11 scheduler.go:675] "Successfully bound pod to node" pod="namespace-name/pod-name-6744f894fb-2nl44" node="ip-10-10-10-100.us-west-2.compute.internal" evaluatedNodes=16 feasibleNodes=4

06:46:13.772090 --> after around ~371 ms, CA thinks node is empty even though above k8 api logs above show a pod is scheduled on the node

I0322 06:46:13.772090       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system" Name:"cluster-autoscaler-status", UID:"e02fdcd5-21d0-48df-8dba-4bbddd8ee247", APIVersion:"v1", ResourceVersion:"503217222", FieldPath:""}): type: 'Normal' reason: 'ScaleDownEmpty' Scale-down: removing empty node ip-10-10-10-100.us-west-2.compute.internal

I0322 06:46:13.786697 --> CA adds ToBeDeletedTaint taint to the node

I0322 06:46:13.786697       1 delete.go:103] Successfully added ToBeDeletedTaint on node ip-10-10-10-100.us-west-2.compute.internal

06:46:14.056277 --> CA terminates the node directly thinking it is empty

I0322 06:46:14.056277       1 auto_scaling_groups.go:277] Terminating EC2 instance: i-xxxxxxxxxxxxxx

Anything else we need to know?:

The text was updated successfully, but these errors were encountered:

vadasambar · 2023-05-22T19:32:51Z

Seems like CA's knowledge of the node is stale and since the ToBeDeleted taint is added after the pod is scheduled it's possible that the pod was scheduled after CA obtained node info. Something like this:

CA gets info about the node (including pods running on the node)
Scheduler schedules a pod on the node
CA adds ToBeDeleted taint (thinking the node has no running pods) and deletes the node

infa-ddeore · 2023-05-23T02:38:52Z

Seems like CA's knowledge of the node is stale and since the ToBeDeleted taint is added after the pod is scheduled it's possible that the pod was scheduled after CA obtained node info. Something like this:

CA gets info about the node (including pods running on the node)

Scheduler schedules a pod on the node

CA adds ToBeDeleted taint (thinking the node has no running pods) and deletes the node

yeah, this is what it looks like, is there a way to enforce CA to drain node in this scenario or any way to improve on this?

infa-ddeore · 2023-05-29T13:35:31Z

is there any flag or workaround to tackle this issue?

vadasambar · 2023-07-03T06:08:50Z

@infa-ddeore sorry I don't have enough bandwidth to look at this. If this is important, please bring it up in the sig-autoscaling meeting so that someone else can take a look. 🙏

leoryu · 2023-08-03T07:50:38Z

I think this issue is not just related ScaleDownEmpty case, but also related to DrainNode case. From

https://github.com/kubernetes/autoscaler/blob/702e9685d6c1d002f4a448f2d88007141dfa6d56/cluster-autoscaler/core/scaledown/actuation/drain.go#L84C26-L84C26

I think the nodeinfo might be stale, and ca will not know some new pod is scheduled to the node.

k8s-triage-robot · 2024-01-25T18:22:33Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-24T19:20:26Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-04-20T15:53:58Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-04-20T15:54:03Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

infa-ddeore added the kind/bug Categorizes issue or PR as related to a bug. label May 22, 2023

vadasambar mentioned this issue May 23, 2023

May 2023 vadafoss/daily-updates#9

Closed

vadasambar mentioned this issue Jun 1, 2023

Jun 2023 vadafoss/daily-updates#10

Closed

jbartosik added the area/cluster-autoscaler label Jun 5, 2023

vadasambar mentioned this issue Jul 3, 2023

Jul 2023 vadafoss/daily-updates#11

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 24, 2024

towca added the area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. label Mar 21, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster autoscaler deleted non-empty node as ScaleDownEmpty #5790

cluster autoscaler deleted non-empty node as ScaleDownEmpty #5790

infa-ddeore commented May 22, 2023

vadasambar commented May 22, 2023

infa-ddeore commented May 23, 2023

infa-ddeore commented May 29, 2023

vadasambar commented Jul 3, 2023

leoryu commented Aug 3, 2023

k8s-triage-robot commented Jan 25, 2024

k8s-triage-robot commented Feb 24, 2024

k8s-triage-robot commented Apr 20, 2024

k8s-ci-robot commented Apr 20, 2024

cluster autoscaler deleted non-empty node as ScaleDownEmpty #5790

cluster autoscaler deleted non-empty node as ScaleDownEmpty #5790

Comments

infa-ddeore commented May 22, 2023

vadasambar commented May 22, 2023

infa-ddeore commented May 23, 2023

infa-ddeore commented May 29, 2023

vadasambar commented Jul 3, 2023

leoryu commented Aug 3, 2023

k8s-triage-robot commented Jan 25, 2024

k8s-triage-robot commented Feb 24, 2024

k8s-triage-robot commented Apr 20, 2024

k8s-ci-robot commented Apr 20, 2024