Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Autoscaler doesn't scale CPU-only workloads to workers with GPU #20476

Closed
1 of 2 tasks
andras-kth opened this issue Nov 17, 2021 · 2 comments · Fixed by #31202
Closed
1 of 2 tasks

[Bug] Autoscaler doesn't scale CPU-only workloads to workers with GPU #20476

andras-kth opened this issue Nov 17, 2021 · 2 comments · Fixed by #31202
Assignees
Labels
bug Something that is supposed to be working; but isn't docs An issue or change related to documentation infra autoscaler, ray client, kuberay, related issues P2 Important issue, but not time-critical
Milestone

Comments

@andras-kth
Copy link

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Clusters

What happened + What you expected to happen

Clusters where all nodes have a GPU fail to autoscale on CPU-only workloads.

The autoscaler could not find a node type to satisfy the request: [{'CPU': 1.0} ,...

Changing the resource definition of the node type (leaving everything else intact) allows the cluster to autoscale.

I'm guessing that this may, in fact, be the intended behavior, as a cost-saving "feature".
In which case, the right "fix" would be to define node types both with and without GPU.

Versions / Dependencies

Ray 1.8.0
Python 3.9

Reproduction script

N/A

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@andras-kth andras-kth added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 17, 2021
@DmitriGekhtman DmitriGekhtman added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 3, 2022
@wuisawesome wuisawesome added this to the Serverless Autoscaling milestone Jan 4, 2022
@AmeerHajAli AmeerHajAli added the infra autoscaler, ray client, kuberay, related issues label Mar 26, 2022
pang-wu added a commit to pang-wu/raydp that referenced this issue Jul 29, 2022
GPU auto scaling is a bug on Ray side. For more details, please see [this issue](ray-project/ray#20476).
carsonwang pushed a commit to oap-project/raydp that referenced this issue Aug 1, 2022
* Support fractional resource scheduling

* Fix java and scala code styling.

* Fix tests.

* Use marker to skip tests

* Refactor

* Use mock clusters.

Use mock cluster based on doc here: https://docs.ray.io/en/latest/ray-core/examples/testing-tips.html#tip-4-create-a-mini-cluster-with-ray-cluster-utils-cluster

* try to fix test by running the custom resource test separately.

* Remove GPU resource config.

GPU auto scaling is a bug on Ray side. For more details, please see [this issue](ray-project/ray#20476).
@DmitriGekhtman DmitriGekhtman added P1 Issue that should be fixed within a few weeks and removed P2 Important issue, but not time-critical labels Aug 31, 2022
@DmitriGekhtman DmitriGekhtman removed their assignment Nov 19, 2022
@hora-anyscale hora-anyscale added docs An issue or change related to documentation P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Dec 19, 2022
@hora-anyscale
Copy link
Contributor

Per Triage Sync: Need to update docs to reflect this is intended behavior

@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented Dec 19, 2022

The correct behavior is to avoid adding GPU workers when possible, but to add GPU workers when needed fulfill the workload.

The code for this is straightforward to implement (max 5 lines, plus a test)

DmitriGekhtman added a commit that referenced this issue Dec 21, 2022
…riority (#31202)

Closes #20476:
Instead of preventing GPU upscaling due to non-GPU tasks, prefer non-GPU nodes by assigning low utilization score to the GPU nodes.

Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
AmeerHajAli pushed a commit that referenced this issue Jan 12, 2023
…riority (#31202)

Closes #20476:
Instead of preventing GPU upscaling due to non-GPU tasks, prefer non-GPU nodes by assigning low utilization score to the GPU nodes.

Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
tamohannes pushed a commit to ju2ez/ray that referenced this issue Jan 25, 2023
…riority (ray-project#31202)

Closes ray-project#20476:
Instead of preventing GPU upscaling due to non-GPU tasks, prefer non-GPU nodes by assigning low utilization score to the GPU nodes.

Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't docs An issue or change related to documentation infra autoscaler, ray client, kuberay, related issues P2 Important issue, but not time-critical
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants