Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filtering out DS pods from scale-up, refactor default pod list processor #5442

Merged
merged 1 commit into from
Jan 25, 2023

Conversation

BigDarkClown
Copy link
Contributor

@BigDarkClown BigDarkClown commented Jan 23, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR improves the way the CA handles unschedulable DS pods. These pods are unhelpable from the scale-upside, as each of them can only get scheduled on a specific existing node. Currently these pods are passed to scale-up logic, which increases the computational complexity, even though we know it is unnecessary.

This change filters out pending DS pods, which solves the described issue. It additionally refactors the defaultPodListProcessor to make it more readable.

Does this PR introduce a user-facing change?

Pending Daemon Set pods will no longer emit scale up events

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 23, 2023
@k8s-ci-robot k8s-ci-robot requested a review from feiskyer January 23, 2023 17:20
@k8s-ci-robot k8s-ci-robot requested a review from x13n January 23, 2023 17:20
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 23, 2023
@towca
Copy link
Collaborator

towca commented Jan 25, 2023

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 25, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BigDarkClown, towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 25, 2023
@k8s-ci-robot k8s-ci-robot merged commit bdf033e into kubernetes:master Jan 25, 2023
rishabh-11 added a commit to rishabh-11/autoscaler that referenced this pull request May 8, 2023
* recommendation post processor for integer CPU

* update flag cpu-integer-post-processor-enabled description

* directly comparing recommendation with rescaling the values

* Capping post-processor should always be the last one

* add documentation for CPU integer post processor

* fix typo in FAQ

* Uncomment myself now that I am a Kubernetes org member.

* aws: allow setting max retries from AWS_MAX_ATTEMPTS env var

* Move PredicateChecker initialization before processors initialization

* allow setting kuberentes client burst and qps to avoid rate limiting

* State expectations around cloudprovider OWNERS

* Stop asking PR creators to name modified components

Rely on labels instead.

* Cap logs logged by HintingSimulator.

To avoid excesive logging of the HintingSimulator we should use klogx
to cap the number of logged lines depending on the verbosity level.

* Configurable difference ratios

* fixing linting pipeline

* fix: alicloud example yaml config

* Mark VPA KEP 4902 as deprecated in favor of using PDB feature

* Added the RBAC Permission to alicloud.

* update clusterapi readme with table of contents

this change will make navigating the readme easier for users.

* add a note to clusterapi readme about ignored labels

this change adds a section to the readme that provides advice for
clusterapi users about which labels they might want to ignore when using
the balance similar node groups flag on various cloud providers.

* remove clusterapi nodegroupset processor

as discussed with the cluster api community[0], the nodegroupset
processor is being removed from the clusterapi provider implementation
in favor of instructing our community on the use of the
--balancing-ignore-label flag. due to the wide variety of provider
infrastructures that clusterapi can be deployed on, we would prefer to
not encode all of these labels in the autoscaler itself. see the linked
recording for more information.

[0] https://www.youtube.com/watch?v=jbhca_9oPuQ

* gRPC expander: allow realistic server responses, and log errors

Expander requests' payloads can be rather heavy under upscale pressure,
as they're compounding all candidates options and unschedulable pods
that could fit each options. Expander responses are a subset of the
requests' payload items.

We're allowing ourself to send arbitrary payload sizes (gRPC
`defaultClientMaxSendMessageSize` is `math.MaxInt32`), but we're prone
to drop expander servers responses to the floor, due to the `4MiB`
`defaultClientMaxReceiveMessageSize`.

The arbitrary 128MiB value is meant to be huge (enough to support eg.
several dozen fat 1MiB pods) but not unlimited. Let me know if you'd
rather see that turned to be a command line flag, or an other value.

Also logging the possible gRPC call errors, as that of great help to
diagnose that kind of issues.

* Added link for RFC3986 in host.go

* add an extra note to clusterapi readme about gpus

this change adds a little more detail to ensure that users understand
how to use the GPU label feature.

* Rephrase error messages specific for GPUs.

Some error messages related to custom resources assumes that the custom
resource is always GPU. Rephrasing these error messages to clear up any
confusion.

* Add scale down candidates observer

* Add previous scale down candidate sorting

* OCI OKE autoscaller requires different set of permissions. Without this autoscaller is not working

* Add os parameters to MigOsInfo interface

* Cleanup version from osReserved calculator

* Cleanup Version() method for gce Mig

* Clean up DS utils: remove unused cluster snapshot and predicate checker

* Add filtering out DS pods from scale-up, refactor default pod list processor

* update vendor to v1.27.0-alpha.1

* return argument of framework.RunFilterPlugins has changed from a map[string]*Status to *Status.
The func returns after the first failed plugin.

Augmented schedulerbased.go to reflect this change.

* update vendor to 1.26.3

* update go.mod

---------

Co-authored-by: David Benque <david.benque@datadoghq.com>
Co-authored-by: Junwon Kwon <snsxhdz@naver.com>
Co-authored-by: jesse.millan <jesse.millan@oracle.com>
Co-authored-by: Michael Grosser <michael@grosser.it>
Co-authored-by: Kubernetes Prow Robot <k8s-ci-robot@users.noreply.github.com>
Co-authored-by: Bartłomiej Wróblewski <bwroblewski@google.com>
Co-authored-by: Daniel Kłobuszewski <danielmk@google.com>
Co-authored-by: Daniel Gutowski <danielgutowski@google.com>
Co-authored-by: bsoghigian <bsoghigian@microsoft.com>
Co-authored-by: ZhenRan <zhenran.sz@gmail.com>
Co-authored-by: Joachim Bartosik <jbartosik@google.com>
Co-authored-by: shubham82 <shubham.kuchhal@india.nec.com>
Co-authored-by: michael mccune <msm@opbstudios.com>
Co-authored-by: Benjamin Pineau <benjamin.pineau@datadoghq.com>
Co-authored-by: Hakan Bostan <hbostan@google.com>
Co-authored-by: Yaroslava Serdiuk <yaroslava@google.com>
Co-authored-by: janis.orlovs <janis@cwise.eu>
Co-authored-by: Jayant Jain <jayantj@google.com>
rishabh-11 pushed a commit to rishabh-11/autoscaler that referenced this pull request May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants