Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[core][autoscaler] Fix incorrectly terminating nodes misclassified as…
… idle in autoscaler v1 (ray-project#48519) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> In autoscaler v1, nodes are incorrectly classified as idle based solely on their resource usage metrics. This misclassification can occur under the following conditions: 1. Tasks running on the node do not have assigned resources. 2. All tasks on the node are blocked on get or wait operations. This will lead to the incorrect termination of nodes during downscaling. To resolve this issue, use the `idle_duration_ms` reported by raylet instead, which already considers the aforementioned conditions. ref: ray-project#39582 ### Before: NodeDiedError ![image](https://github.com/user-attachments/assets/a126af98-7950-40c4-ad43-2448f4b0d71a) ### After ![image](https://github.com/user-attachments/assets/ae5f6c74-6b7a-4684-a126-66e9a562149c) ### Reproduction Script (on local fake nodes) - Setting: head_nodes: < 10 cpus, worker nodes: 10 cpus - Code: ``` import ray import time import os import random @ray.remote(max_retries=5, num_cpus=10) def inside_ray_task_with_outside(): print('start inside_ray_task_with_outside') sleep_time = 15 start_time = time.perf_counter() while True: if(time.perf_counter() - start_time < sleep_time): time.sleep(0.001) else: break @ray.remote(max_retries=5, num_cpus=10) def inside_ray_task_without_outside(): print('start inside_ray_task_without_outside task') sleep_time = 50 start_time = time.perf_counter() while True: if(time.perf_counter() - start_time < sleep_time): time.sleep(0.001) else: break @ray.remote(max_retries=0, num_cpus=10) def outside_ray_task(): print('start outside_ray_task task') future_list = [inside_ray_task_with_outside.remote(), inside_ray_task_without_outside.remote()] ray.get(future_list) if __name__ == '__main__': ray.init() ray.get(outside_ray_task.remote()) ``` ## Related issue number <!-- For example: "Closes ray-project#1234" --> Closes ray-project#46492 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Mimi Liao <mimiliao2000@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
- Loading branch information