cluster-autoscaler.kubernetes.io/safe-to-evict: false annotation not disabling eviction #5668

fkennedy1 · 2023-04-06T22:21:48Z

Which component are you using?: cluster-autoscaler

What version of the component are you using?: v1.23.0/ helm chart - v9.15.0

Component version: v1.23.0

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.5", GitCommit:"5c99e2ac2ff9a3c549d9ca665e7bc05a3e18f07e", GitTreeState:"clean", BuildDate:"2021-12-16T08:38:33Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.17-eks-48e63af", GitCommit:"47b89ea2caa1f7958bc6539d6865820c86b4bf60", GitTreeState:"clean", BuildDate:"2023-01-24T09:34:06Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?: production

What did you expect to happen?: Using EKS. The pod would not be evicted due to this annotation being present in the spec.template portion of the k8s deployment. I checked the all the relevant pods and they have the annotation.

What happened instead?: The pod and node were evicted even with the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

I checked the logs and did not see this message at all: Fast evaluation: node xxxx.ec2.internal cannot be removed: pod annotated as not safe to evict present: xxxx.

I even increased the scale down threshold temporarily to try and see if above logs were present to no success

We also have 1 AZ per ASG (not managed node groups) so not sure if suspending AZRebalance would have an effect.

Relevant component configuration:

     ./cluster-autoscaler
     --cloud-provider=aws
     --namespace=kube-system
     --node-group-auto-discovery=xxxx
     --balance-similar-node-groups=true
     --expander=least-waste
     --logtostderr=true
     --scale-down-enabled=true
     --scale-down-utilization-threshold=0.5
     --stderrthreshold=info
     --v=4

The text was updated successfully, but these errors were encountered:

fkennedy1 · 2023-04-08T05:08:31Z

Seems as though the pod having local storage log message takes precedence over the safe-to-evict annotation log message

vadasambar · 2023-04-13T04:23:37Z

Yes. CA checks for blocking local storage before safe-to-evict: false annotation

autoscaler/cluster-autoscaler/utils/drain/drain.go

Lines 222 to 226 in 5af3685

    
           if HasLocalStorage(pod) && skipNodesWithLocalStorage { 
        
           	return []*apiv1.Pod{}, []*apiv1.Pod{}, &BlockingPod{Pod: pod, Reason: LocalStorageRequested}, fmt.Errorf("pod with local storage present: %s", pod.Name) 
        
           } 
        
           if hasNotSafeToEvictAnnotation(pod) { 
        
           	return []*apiv1.Pod{}, []*apiv1.Pod{}, &BlockingPod{Pod: pod, Reason: NotSafeToEvictAnnotation}, fmt.Errorf("pod annotated as not safe to evict present: %s", pod.Name)

vadasambar · 2023-04-13T04:24:57Z

I think it will be a little hard to say why the node is getting removed without the logs.

comeonyo · 2023-04-16T13:53:53Z

Can you show us the YAML of pod & deployment without sensitive information?

fkennedy1 · 2023-04-17T02:49:46Z

I solved the problem by lowering the scale down threshold and by adding skip-nodes-with-local-storage: false which now shows the correct log message and functioning as expected

comeonyo · 2023-04-17T07:32:13Z

If you set skip-nodes-with-local-storage as false, then the pod will be moved to another node by CA.
But, you wanted the pod to be protected.

msardana94 · 2023-09-22T04:42:38Z

I solved the problem by lowering the scale down threshold and by adding skip-nodes-with-local-storage: false which now shows the correct log message and functioning as expected

@fkennedy1 did this work for you?

If you set skip-nodes-with-local-storage as false, then the pod will be moved to another node by CA. But, you wanted the pod to be protected.

I am facing exactly this issue after changing the config. This seems like a bug to me?

fkennedy1 · 2023-10-03T21:10:42Z

@msardana94 This did work for me. I however also updated the threshold for scaling down. For me all the workloads with storage are using the annotation

k8s-triage-robot · 2024-01-29T16:12:47Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-28T17:08:57Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-04-20T15:53:57Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-04-20T15:54:01Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fkennedy1 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 6, 2023

fkennedy1 changed the title ~~cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation not disabling eviction~~ cluster-autoscaler.kubernetes.io/safe-to-evict: false annotation not disabling eviction Apr 6, 2023

vadasambar mentioned this issue Apr 13, 2023

Apr 2023 vadafoss/daily-updates#8

Closed

jbartosik added the area/cluster-autoscaler label May 12, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 28, 2024

towca added the area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. label Mar 21, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster-autoscaler.kubernetes.io/safe-to-evict: false annotation not disabling eviction #5668

cluster-autoscaler.kubernetes.io/safe-to-evict: false annotation not disabling eviction #5668

fkennedy1 commented Apr 6, 2023 •

edited

Loading

fkennedy1 commented Apr 8, 2023 •

edited

Loading

vadasambar commented Apr 13, 2023

vadasambar commented Apr 13, 2023

comeonyo commented Apr 16, 2023

fkennedy1 commented Apr 17, 2023

comeonyo commented Apr 17, 2023

msardana94 commented Sep 22, 2023

fkennedy1 commented Oct 3, 2023

k8s-triage-robot commented Jan 29, 2024

k8s-triage-robot commented Feb 28, 2024

k8s-triage-robot commented Apr 20, 2024

k8s-ci-robot commented Apr 20, 2024

cluster-autoscaler.kubernetes.io/safe-to-evict: false annotation not disabling eviction #5668

cluster-autoscaler.kubernetes.io/safe-to-evict: false annotation not disabling eviction #5668

Comments

fkennedy1 commented Apr 6, 2023 • edited Loading

fkennedy1 commented Apr 8, 2023 • edited Loading

vadasambar commented Apr 13, 2023

vadasambar commented Apr 13, 2023

comeonyo commented Apr 16, 2023

fkennedy1 commented Apr 17, 2023

comeonyo commented Apr 17, 2023

msardana94 commented Sep 22, 2023

fkennedy1 commented Oct 3, 2023

k8s-triage-robot commented Jan 29, 2024

k8s-triage-robot commented Feb 28, 2024

k8s-triage-robot commented Apr 20, 2024

k8s-ci-robot commented Apr 20, 2024

fkennedy1 commented Apr 6, 2023 •

edited

Loading

fkennedy1 commented Apr 8, 2023 •

edited

Loading