Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster-autoscaler.kubernetes.io/safe-to-evict: false annotation not disabling eviction #5668

Closed
fkennedy1 opened this issue Apr 6, 2023 · 12 comments
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@fkennedy1
Copy link

fkennedy1 commented Apr 6, 2023

Which component are you using?: cluster-autoscaler

What version of the component are you using?: v1.23.0/ helm chart - v9.15.0

Component version: v1.23.0

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.5", GitCommit:"5c99e2ac2ff9a3c549d9ca665e7bc05a3e18f07e", GitTreeState:"clean", BuildDate:"2021-12-16T08:38:33Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.17-eks-48e63af", GitCommit:"47b89ea2caa1f7958bc6539d6865820c86b4bf60", GitTreeState:"clean", BuildDate:"2023-01-24T09:34:06Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?: production

What did you expect to happen?: Using EKS. The pod would not be evicted due to this annotation being present in the spec.template portion of the k8s deployment. I checked the all the relevant pods and they have the annotation.

What happened instead?: The pod and node were evicted even with the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

I checked the logs and did not see this message at all: Fast evaluation: node xxxx.ec2.internal cannot be removed: pod annotated as not safe to evict present: xxxx.

I even increased the scale down threshold temporarily to try and see if above logs were present to no success

We also have 1 AZ per ASG (not managed node groups) so not sure if suspending AZRebalance would have an effect.

Relevant component configuration:

     ./cluster-autoscaler
     --cloud-provider=aws
     --namespace=kube-system
     --node-group-auto-discovery=xxxx
     --balance-similar-node-groups=true
     --expander=least-waste
     --logtostderr=true
     --scale-down-enabled=true
     --scale-down-utilization-threshold=0.5
     --stderrthreshold=info
     --v=4

@fkennedy1 fkennedy1 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 6, 2023
@fkennedy1 fkennedy1 changed the title cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation not disabling eviction cluster-autoscaler.kubernetes.io/safe-to-evict: false annotation not disabling eviction Apr 6, 2023
@fkennedy1
Copy link
Author

fkennedy1 commented Apr 8, 2023

Seems as though the pod having local storage log message takes precedence over the safe-to-evict annotation log message

@vadasambar
Copy link
Member

Yes. CA checks for blocking local storage before safe-to-evict: false annotation

if HasLocalStorage(pod) && skipNodesWithLocalStorage {
return []*apiv1.Pod{}, []*apiv1.Pod{}, &BlockingPod{Pod: pod, Reason: LocalStorageRequested}, fmt.Errorf("pod with local storage present: %s", pod.Name)
}
if hasNotSafeToEvictAnnotation(pod) {
return []*apiv1.Pod{}, []*apiv1.Pod{}, &BlockingPod{Pod: pod, Reason: NotSafeToEvictAnnotation}, fmt.Errorf("pod annotated as not safe to evict present: %s", pod.Name)

@vadasambar
Copy link
Member

I think it will be a little hard to say why the node is getting removed without the logs.

@comeonyo
Copy link
Contributor

Can you show us the YAML of pod & deployment without sensitive information?

@fkennedy1
Copy link
Author

I solved the problem by lowering the scale down threshold and by adding skip-nodes-with-local-storage: false which now shows the correct log message and functioning as expected

@comeonyo
Copy link
Contributor

If you set skip-nodes-with-local-storage as false, then the pod will be moved to another node by CA.
But, you wanted the pod to be protected.

@msardana94
Copy link

I solved the problem by lowering the scale down threshold and by adding skip-nodes-with-local-storage: false which now shows the correct log message and functioning as expected

@fkennedy1 did this work for you?

If you set skip-nodes-with-local-storage as false, then the pod will be moved to another node by CA. But, you wanted the pod to be protected.

I am facing exactly this issue after changing the config. This seems like a bug to me?

@fkennedy1
Copy link
Author

@msardana94 This did work for me. I however also updated the threshold for scaling down. For me all the workloads with storage are using the annotation

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 28, 2024
@towca towca added the area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. label Mar 21, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 20, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

8 participants