Add filtering out DS pods from scale-up, refactor default pod list processor #5442

BigDarkClown · 2023-01-23T17:20:17Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR improves the way the CA handles unschedulable DS pods. These pods are unhelpable from the scale-upside, as each of them can only get scheduled on a specific existing node. Currently these pods are passed to scale-up logic, which increases the computational complexity, even though we know it is unnecessary.

This change filters out pending DS pods, which solves the described issue. It additionally refactors the defaultPodListProcessor to make it more readable.

Does this PR introduce a user-facing change?

Pending Daemon Set pods will no longer emit scale up events

…ocessor

towca · 2023-01-25T12:28:22Z

/lgtm
/approve

k8s-ci-robot · 2023-01-25T12:28:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BigDarkClown, towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [towca]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* recommendation post processor for integer CPU * update flag cpu-integer-post-processor-enabled description * directly comparing recommendation with rescaling the values * Capping post-processor should always be the last one * add documentation for CPU integer post processor * fix typo in FAQ * Uncomment myself now that I am a Kubernetes org member. * aws: allow setting max retries from AWS_MAX_ATTEMPTS env var * Move PredicateChecker initialization before processors initialization * allow setting kuberentes client burst and qps to avoid rate limiting * State expectations around cloudprovider OWNERS * Stop asking PR creators to name modified components Rely on labels instead. * Cap logs logged by HintingSimulator. To avoid excesive logging of the HintingSimulator we should use klogx to cap the number of logged lines depending on the verbosity level. * Configurable difference ratios * fixing linting pipeline * fix: alicloud example yaml config * Mark VPA KEP 4902 as deprecated in favor of using PDB feature * Added the RBAC Permission to alicloud. * update clusterapi readme with table of contents this change will make navigating the readme easier for users. * add a note to clusterapi readme about ignored labels this change adds a section to the readme that provides advice for clusterapi users about which labels they might want to ignore when using the balance similar node groups flag on various cloud providers. * remove clusterapi nodegroupset processor as discussed with the cluster api community[0], the nodegroupset processor is being removed from the clusterapi provider implementation in favor of instructing our community on the use of the --balancing-ignore-label flag. due to the wide variety of provider infrastructures that clusterapi can be deployed on, we would prefer to not encode all of these labels in the autoscaler itself. see the linked recording for more information. [0] https://www.youtube.com/watch?v=jbhca_9oPuQ * gRPC expander: allow realistic server responses, and log errors Expander requests' payloads can be rather heavy under upscale pressure, as they're compounding all candidates options and unschedulable pods that could fit each options. Expander responses are a subset of the requests' payload items. We're allowing ourself to send arbitrary payload sizes (gRPC `defaultClientMaxSendMessageSize` is `math.MaxInt32`), but we're prone to drop expander servers responses to the floor, due to the `4MiB` `defaultClientMaxReceiveMessageSize`. The arbitrary 128MiB value is meant to be huge (enough to support eg. several dozen fat 1MiB pods) but not unlimited. Let me know if you'd rather see that turned to be a command line flag, or an other value. Also logging the possible gRPC call errors, as that of great help to diagnose that kind of issues. * Added link for RFC3986 in host.go * add an extra note to clusterapi readme about gpus this change adds a little more detail to ensure that users understand how to use the GPU label feature. * Rephrase error messages specific for GPUs. Some error messages related to custom resources assumes that the custom resource is always GPU. Rephrasing these error messages to clear up any confusion. * Add scale down candidates observer * Add previous scale down candidate sorting * OCI OKE autoscaller requires different set of permissions. Without this autoscaller is not working * Add os parameters to MigOsInfo interface * Cleanup version from osReserved calculator * Cleanup Version() method for gce Mig * Clean up DS utils: remove unused cluster snapshot and predicate checker * Add filtering out DS pods from scale-up, refactor default pod list processor * update vendor to v1.27.0-alpha.1 * return argument of framework.RunFilterPlugins has changed from a map[string]*Status to *Status. The func returns after the first failed plugin. Augmented schedulerbased.go to reflect this change. * update vendor to 1.26.3 * update go.mod --------- Co-authored-by: David Benque <david.benque@datadoghq.com> Co-authored-by: Junwon Kwon <snsxhdz@naver.com> Co-authored-by: jesse.millan <jesse.millan@oracle.com> Co-authored-by: Michael Grosser <michael@grosser.it> Co-authored-by: Kubernetes Prow Robot <k8s-ci-robot@users.noreply.github.com> Co-authored-by: Bartłomiej Wróblewski <bwroblewski@google.com> Co-authored-by: Daniel Kłobuszewski <danielmk@google.com> Co-authored-by: Daniel Gutowski <danielgutowski@google.com> Co-authored-by: bsoghigian <bsoghigian@microsoft.com> Co-authored-by: ZhenRan <zhenran.sz@gmail.com> Co-authored-by: Joachim Bartosik <jbartosik@google.com> Co-authored-by: shubham82 <shubham.kuchhal@india.nec.com> Co-authored-by: michael mccune <msm@opbstudios.com> Co-authored-by: Benjamin Pineau <benjamin.pineau@datadoghq.com> Co-authored-by: Hakan Bostan <hbostan@google.com> Co-authored-by: Yaroslava Serdiuk <yaroslava@google.com> Co-authored-by: janis.orlovs <janis@cwise.eu> Co-authored-by: Jayant Jain <jayantj@google.com>

Sync upto `kubernetes#5442` PR of upstream

Add filtering out DS pods from scale-up, refactor default pod list pr…

d4b812e

…ocessor

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 23, 2023

k8s-ci-robot requested a review from feiskyer January 23, 2023 17:20

k8s-ci-robot added the area/cluster-autoscaler label Jan 23, 2023

k8s-ci-robot requested a review from x13n January 23, 2023 17:20

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 23, 2023

k8s-ci-robot assigned towca Jan 25, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 25, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 25, 2023

k8s-ci-robot merged commit bdf033e into kubernetes:master Jan 25, 2023

rishabh-11 mentioned this pull request May 8, 2023

Sync upto #5442 PR of upstream gardener/autoscaler#214

Merged

rishabh-11 pushed a commit to rishabh-11/autoscaler that referenced this pull request May 8, 2023

Merge pull request gardener#214 from rishabh-11/sync-mcmp-branch

1d00c0c

Sync upto `kubernetes#5442` PR of upstream

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add filtering out DS pods from scale-up, refactor default pod list processor #5442

Add filtering out DS pods from scale-up, refactor default pod list processor #5442

BigDarkClown commented Jan 23, 2023 •

edited

Loading

towca commented Jan 25, 2023

k8s-ci-robot commented Jan 25, 2023

Add filtering out DS pods from scale-up, refactor default pod list processor #5442

Add filtering out DS pods from scale-up, refactor default pod list processor #5442

Conversation

BigDarkClown commented Jan 23, 2023 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Does this PR introduce a user-facing change?

towca commented Jan 25, 2023

k8s-ci-robot commented Jan 25, 2023

BigDarkClown commented Jan 23, 2023 •

edited

Loading