Commit eaf2af4
authored
[Serve] Prioritize stopping most recently scaled-up replicas during downscaling (#52929)
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->
## Why are these changes needed?
<!-- Please give a short summary of the change and the problem this
solves. -->
This PR improves the downscaling behavior in Ray Serve by modifying the
logic in `_get_replicas_to_stop()` within Default `DeploymentScheduler`.
Previously, the scheduler selected replicas to stop by traversing the
least loaded nodes in ascending order. This often resulted in stopping
replicas that had been scheduled earlier and placed optimally using the
`_best_fit_node()` strategy.
This led to several drawbacks:
- Long-lived replicas, which were scheduled on best-fit nodes, were
removed first — leading to inefficient reuse of resources.
- Recently scaled-up replicas, which were placed on less utilized nodes,
were kept longer despite being suboptimal.
- Cold-start overhead increased, as newer replicas were removed before
fully warming up.
This PR reverses the node traversal order during downscaling so that
**more recently added replicas are prioritized for termination**, *in
cases where other conditions (e.g., running state and number of replicas
per node) are equal*. These newer replicas are typically less optimal in
placement and not yet fully warmed up.
Preserving long-lived replicas improves performance stability and
reduces unnecessary resource fragmentation.
## Related issue number
<!-- For example: "Closes #1234" -->
N/A
## Checks
- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
---------
Signed-off-by: kitae <ryugitae777@gmail.com>1 parent 0325fab commit eaf2af4
File tree
2 files changed
+21
-14
lines changed- python/ray/serve
- _private
- tests/unit
2 files changed
+21
-14
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
735 | 735 | | |
736 | 736 | | |
737 | 737 | | |
| 738 | + | |
738 | 739 | | |
739 | 740 | | |
740 | | - | |
741 | | - | |
742 | 741 | | |
743 | | - | |
744 | | - | |
745 | | - | |
746 | 742 | | |
747 | 743 | | |
748 | 744 | | |
749 | 745 | | |
| 746 | + | |
| 747 | + | |
| 748 | + | |
| 749 | + | |
| 750 | + | |
| 751 | + | |
| 752 | + | |
| 753 | + | |
| 754 | + | |
| 755 | + | |
| 756 | + | |
| 757 | + | |
750 | 758 | | |
751 | 759 | | |
752 | 760 | | |
| |||
760 | 768 | | |
761 | 769 | | |
762 | 770 | | |
763 | | - | |
| 771 | + | |
764 | 772 | | |
765 | | - | |
766 | | - | |
767 | | - | |
| 773 | + | |
| 774 | + | |
| 775 | + | |
| 776 | + | |
768 | 777 | | |
769 | 778 | | |
770 | | - | |
771 | | - | |
772 | 779 | | |
773 | 780 | | |
774 | 781 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
674 | 674 | | |
675 | 675 | | |
676 | 676 | | |
677 | | - | |
| 677 | + | |
678 | 678 | | |
679 | 679 | | |
680 | 680 | | |
| |||
737 | 737 | | |
738 | 738 | | |
739 | 739 | | |
740 | | - | |
| 740 | + | |
741 | 741 | | |
742 | 742 | | |
743 | 743 | | |
| |||
861 | 861 | | |
862 | 862 | | |
863 | 863 | | |
864 | | - | |
| 864 | + | |
865 | 865 | | |
866 | 866 | | |
867 | 867 | | |
| |||
0 commit comments