-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster autoscaler version 1.16.0 doesn't notice pending pods #2345
Comments
For what it's worth, we're attempting to run this new CA version 1.16.0 on Kubernetes version 1.15.1. Would that version skew cause this change in behavior? |
This is unsupported and untested configuration. CA is simulating k8s scheduler behaviour and therefore minor version of CA binary and k8s should always match. With that said I am still a bit surprised of the fact that it does not work at all. I would rather expect bad handling of corner cases due to slight scheduler changes. All the E2E tests with pass (https://k8s-testgrid.appspot.com/sig-autoscaling-cluster-autoscaler#gci-gce-autoscaling). Those are run on GCE yet the part which lists the pods is cloudprovider agnostic - so the fact that you are running it on AWS should not matter. |
I'm still looking into this, mystified as I am. This morning I noticed something new: the cluster autoscaler container has failed its liveness probe 54 times overnight. |
Checking the metrics published the cluster autoscaler, I see one error emitted every ten seconds of type "cloudProviderError." These errors don't appear in the cluster autoscaler log file, though, even at verbosity level 6. I'll have to build a custom container image in order to see what the errors are about. |
The
Note, though, that there is indeed a Kubernetes node named "i-0c0bb28525d36a503" on an EC2 instance with the same ID. That happens to be one of our master machines, not intended to be under the cluster autoscaler's control. That machine sits within an ASG that is not tagged with "kubernetes.io/cluster-autoscaler/enabled." There's a second master machine from the same ASG that's also mentioned in similar errors. |
It looks like #2247 changed what happens when the cluster autoscaler goes looking for information about a Kubernetes node that's not part of a managed ASG. @Jeffwan, have you tested that change with a cluster that has ASGs not meant to be managed by the cluster autoscaler? There's a unit test for it ( |
Due to this early failure in unschedulablePods, err := unschedulablePodLister.List() That explains why the log file never shows the autoscaler recognizing and doing anything about these pending pods. |
@seh The time I fix the issue I did test it. Let me check out latest change and see any problems there |
Please help file a PR to revert changes #2247 and I can approve it. |
/platform aws |
@Jeffwan: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/platform aws |
Hi @losipiuk This PR was original filed to resolve the issue in 1.15-branch. We have not cherry-picked to 1.15 yet. @seh's case may only exist in 1.16 and later versions. I need to do more tests on this. Since extra logic checking autoscaler/cluster-autoscaler/clusterstate/clusterstate.go Lines 588 to 592 in bc73ded
Let me double check 1.15 difference with 1.16. It's safer to add |
Given the way that the cluster autoscaler releases track the Kubernetes version names so closely, can we expect this patch to appear in a container image release before Kubernetes version 1.16.1 is available? |
Yes. I will prepare the release over weekend. |
Just to double check. Can you confirm that version built from tip of cluster-autoscsler-release-1.16 works fine? |
What I had tested was an image built with my patch in it, which was the tip of the "cluster-autoscaler-release-1.16" plus that one reverting commit. I think that's a "yes" to your question. |
I released 1.16.1 (gcr.io/google-containers/cluster-autoscaler:v1.16.1). @seh can you verify that the issue is fixed in that release? |
Testing this new cluster autoscaler image at version 1.16.1 with a Kubernetes cluster in AWS at version 1.15.1, I find that the autoscaler once again works as expected. Thank you for the prompt release. |
Thanks for testing and reporting issue in the first place :) |
Hello @seh @losipiuk ,
|
We've been using cluster autoscaler version 1.15.0 patched with #2008 for some time in AWS, to good effect. Today we attempted to put the new version 1.16.0 into service using all the same configuration, and found that it no longer seems to notice pending pods.
The cluster autoscaler starts fine, and the logs don't indicate anything failing. It goes through its periodic "main loop" and the "Regenerating instance to ASG map for ASGs" step regularly, again without any obvious problems. However, when we create pods that require that the cluster autoscaler take note and adjust a suitable ASG's size, the cluster autoscaler's logs don't show any evidence of it noticing these pods. In prior versions, we see messages to the following effect:
Instead, the new cluster autoscaler exhibits no reaction to these pending pods.
Here are the flag values reported at start time:
I didn't see any other issues complaining of this problem. Given that this release has been out for seven days now, I assume someone else would have run into this same problem.
Reverting to our previous patched container image worked fine, but we'd like to move forward. Is there some new configuration that we need to adjust in order to restore the previous behavior of the cluster autoscaler?
The text was updated successfully, but these errors were encountered: