Skip to content

Commit 2abd476

Browse files
authored
Merge branch 'master' into serve-llm-cross-node-parallelism-docs
2 parents 287a374 + 4e3039e commit 2abd476

25 files changed

+1986
-174
lines changed

.buildkite/data.rayci.yml

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ steps:
4040
tags:
4141
- data
4242
instance_type: medium
43-
parallelism: 2
43+
parallelism: 8
4444
commands:
4545
- bazel run //ci/ray_ci:test_in_docker -- //python/ray/data/... //python/ray/air/... data
4646
--workers "$${BUILDKITE_PARALLEL_JOB_COUNT}"
@@ -54,8 +54,11 @@ steps:
5454
- data
5555
- data_non_parallel
5656
instance_type: medium
57+
parallelism: 3
5758
commands:
5859
- bazel run //ci/ray_ci:test_in_docker -- //python/ray/data/... //python/ray/air/... data
60+
--workers "$${BUILDKITE_PARALLEL_JOB_COUNT}"
61+
--worker-id "$${BUILDKITE_PARALLEL_JOB}"
5962
--build-name data9build
6063
--only-tags data_non_parallel
6164
depends_on: data9build
@@ -65,7 +68,7 @@ steps:
6568
- python
6669
- data
6770
instance_type: medium
68-
parallelism: 2
71+
parallelism: 8
6972
commands:
7073
- bazel run //ci/ray_ci:test_in_docker -- //python/ray/data/... //python/ray/air/... data
7174
--workers "$${BUILDKITE_PARALLEL_JOB_COUNT}"
@@ -80,8 +83,11 @@ steps:
8083
- data
8184
- data_non_parallel
8285
instance_type: medium
86+
parallelism: 3
8387
commands:
8488
- bazel run //ci/ray_ci:test_in_docker -- //python/ray/data/... //python/ray/air/... data
89+
--workers "$${BUILDKITE_PARALLEL_JOB_COUNT}"
90+
--worker-id "$${BUILDKITE_PARALLEL_JOB}"
8591
--build-name datalbuild
8692
--only-tags data_non_parallel
8793
depends_on: datalbuild

.buildkite/dependencies.rayci.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ steps:
2929
- cp ./python/deplocks/llm/* /artifact-mount/
3030
job_env: manylinux
3131
depends_on: manylinux
32+
soft_fail: true
3233

3334
- label: ":tapioca: build: raydepsets: compile ray img dependencies"
3435
key: raydepsets_compile_rayimg_dependencies

ci/lint/pre-push

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,14 @@
22

33
echo "Linting changes as part of pre-push hook"
44
echo ""
5-
echo "ci/lint/format.sh:"
6-
ci/lint/format.sh
5+
echo "pre-commit:"
6+
pre-commit run --from-ref master --to-ref HEAD
77

88
lint_exit_status=$?
99
if [ $lint_exit_status -ne 0 ]; then
1010
echo ""
1111
echo "Linting changes failed."
12-
echo "Please make sure 'ci/lint/format.sh'"\
12+
echo "Please make sure 'pre-commit'"\
1313
"runs with no errors before pushing."
1414
echo "If you want to ignore this and push anyways,"\
1515
"re-run with '--no-verify'."

ci/lint/pydoclint-baseline.txt

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1528,10 +1528,6 @@ python/ray/serve/_private/api.py
15281528
DOC201: Function `serve_start` does not have a return section in docstring
15291529
--------------------
15301530
python/ray/serve/_private/application_state.py
1531-
DOC001: Method `__init__` Potential formatting errors in docstring. Error message: No specification for "Args": ""
1532-
DOC001: Function/method `__init__`: Potential formatting errors in docstring. Error message: No specification for "Args": "" (Note: DOC001 could trigger other unrelated violations under this function/method too. Please fix the docstring formatting first.)
1533-
DOC101: Method `ApplicationState.__init__`: Docstring contains fewer arguments than in function signature.
1534-
DOC103: Method `ApplicationState.__init__`: Docstring arguments are different from function arguments. (Or could be other formatting issues: https://jsh9.github.io/pydoclint/violation_codes.html#notes-on-doc103 ). Arguments in the function signature but not in the docstring: [deployment_state_manager: DeploymentStateManager, endpoint_state: EndpointState, logging_config: LoggingConfig, name: str].
15351531
DOC103: Method `ApplicationStateManager.deploy_app`: Docstring arguments are different from function arguments. (Or could be other formatting issues: https://jsh9.github.io/pydoclint/violation_codes.html#notes-on-doc103 ). Arguments in the function signature but not in the docstring: [deployment_args: List[Dict]]. Arguments in the docstring but not in the function signature: [deployment_args_list: ].
15361532
DOC102: Function `override_deployment_info`: Docstring contains more arguments than in function signature.
15371533
DOC103: Function `override_deployment_info`: Docstring arguments are different from function arguments. (Or could be other formatting issues: https://jsh9.github.io/pydoclint/violation_codes.html#notes-on-doc103 ). Arguments in the docstring but not in the function signature: [app_name: ].

ci/raydepsets/workspace.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -117,8 +117,8 @@ def get_configs_dir(self, configs_path: str) -> List[str]:
117117
def load_config(self, config_path: str) -> Config:
118118
with open(os.path.join(self.dir, config_path), "r") as f:
119119
data = yaml.safe_load(f)
120-
config_name = os.path.basename(config_path)
121-
config = Config.from_dict(data, config_name)
120+
config_name = os.path.basename(config_path)
121+
config = Config.from_dict(data, config_name)
122122
return config
123123

124124
def merge_configs(self, configs: List[Config]) -> Config:

doc/source/ray-contribute/development.rst

Lines changed: 7 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -319,11 +319,12 @@ You can tweak the build with the following environment variables (when running `
319319
Installing additional dependencies for development
320320
--------------------------------------------------
321321

322-
Dependencies for the linter (``scripts/format.sh``) can be installed with:
322+
Dependencies for the linter (``pre-commit``) can be installed with:
323323

324324
.. code-block:: shell
325325
326-
pip install -c python/requirements_compiled.txt -r python/requirements/lint-requirements.txt
326+
pip install -c python/requirements_compiled.txt pre-commit
327+
pre-commit install
327328
328329
Dependencies for running Ray unit tests under ``python/ray/tests`` can be installed with:
329330

@@ -336,12 +337,9 @@ Requirement files for running Ray Data / ML library tests are under ``python/req
336337
Pre-commit Hooks
337338
----------------
338339

339-
Ray is planning to replace the pre-push hooks that are invoked from ``scripts/format.sh`` with
340-
pre-commit hooks using `the pre-commit python package <https://pre-commit.com/>`_ in the future. At
341-
the moment, we have configured a ``.pre-commit-config.yaml`` which runs all the same checks done by
342-
``scripts/format.sh`` along with a few additional ones too. Currently this developer tooling is
343-
opt-in, with any formatting changes made by ``scripts/format.sh`` expected to be caught by
344-
``pre-commit`` as well. To start using ``pre-commit``:
340+
Ray uses pre-commit hooks with `the pre-commit python package <https://pre-commit.com/>`_.
341+
The ``.pre-commit-config.yaml`` file configures all the linting and formatting checks.
342+
To start using ``pre-commit``:
345343

346344
.. code-block:: shell
347345
@@ -356,8 +354,7 @@ you commit new code changes with git. To temporarily skip pre-commit checks, use
356354
357355
git commit -n
358356
359-
If you find that ``scripts/format.sh`` makes a change that is different from what ``pre-commit``
360-
does, please `report an issue here`_.
357+
If you encounter any issues with ``pre-commit``, please `report an issue here`_.
361358

362359
.. _report an issue here: https://github.com/ray-project/ray/issues/new?template=bug-report.yml
363360

doc/source/ray-contribute/docs.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ It's considered good practice to check the output of your build to make sure eve
100100

101101
Before committing any changes, make sure you run the
102102
[linter](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)
103-
with `../scripts/format.sh` from the `doc` folder,
103+
with `pre-commit run` from the `doc` folder,
104104
to make sure your changes are formatted correctly.
105105

106106
### Code completion and other developer tooling

doc/source/ray-contribute/getting-involved.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -260,7 +260,7 @@ An output like the following indicates failure:
260260
* branch master -> FETCH_HEAD
261261
python/ray/util/sgd/tf/tf_runner.py:4:1: F401 'numpy as np' imported but unused # Below is the failure
262262
263-
In addition, there are other formatting and semantic checkers for components like the following (not included in ``scripts/format.sh``):
263+
In addition, there are other formatting and semantic checkers for components like the following (not included in ``pre-commit``):
264264

265265
* Python README format:
266266

doc/source/serve/advanced-guides/index.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ dyn-req-batch
1111
inplace-updates
1212
dev-workflow
1313
grpc-guide
14+
replica-ranks
1415
managing-java-deployments
1516
deploy-vm
1617
multi-app-container
@@ -28,6 +29,7 @@ Use these advanced guides for more options and configurations:
2829
- [In-Place Updates for Serve](serve-inplace-updates)
2930
- [Development Workflow](serve-dev-workflow)
3031
- [gRPC Support](serve-set-up-grpc-service)
32+
- [Replica Ranks](serve-replica-ranks)
3133
- [Ray Serve Dashboard](dash-serve-view)
3234
- [Experimental Java API](serve-java-api)
3335
- [Run Applications in Different Containers](serve-container-runtime-env-guide)
Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
(serve-replica-ranks)=
2+
3+
# Replica ranks
4+
5+
:::{warning}
6+
This API is experimental and may change between Ray minor versions.
7+
:::
8+
9+
Replica ranks provide a unique identifier for **each replica within a deployment**. Each replica receives a **rank (an integer from 0 to N-1)** and **a world size (the total number of replicas)**.
10+
11+
## Access replica ranks
12+
13+
You can access the rank and world size from within a deployment through the replica context using [`serve.get_replica_context()`](../api/doc/ray.serve.get_replica_context.rst).
14+
15+
The following example shows how to access replica rank information:
16+
17+
```{literalinclude} ../doc_code/replica_rank.py
18+
:start-after: __replica_rank_start__
19+
:end-before: __replica_rank_end__
20+
:language: python
21+
```
22+
23+
```{literalinclude} ../doc_code/replica_rank.py
24+
:start-after: __replica_rank_start_run_main__
25+
:end-before: __replica_rank_end_run_main__
26+
:language: python
27+
```
28+
29+
The [`ReplicaContext`](../api/doc/ray.serve.context.ReplicaContext.rst) provides two key fields:
30+
31+
- `rank`: An integer from 0 to N-1 representing this replica's unique identifier.
32+
- `world_size`: The target number of replicas for the deployment.
33+
34+
## Handle rank changes with reconfigure
35+
36+
When a replica's rank changes (such as during downscaling), Ray Serve can automatically call the `reconfigure` method on your deployment class to notify it of the new rank. This allows you to update replica-specific state when ranks change.
37+
38+
The following example shows how to implement `reconfigure` to handle rank changes:
39+
40+
```{literalinclude} ../doc_code/replica_rank.py
41+
:start-after: __reconfigure_rank_start__
42+
:end-before: __reconfigure_rank_end__
43+
:language: python
44+
```
45+
46+
```{literalinclude} ../doc_code/replica_rank.py
47+
:start-after: __reconfigure_rank_start_run_main__
48+
:end-before: __reconfigure_rank_end_run_main__
49+
:language: python
50+
```
51+
52+
### When reconfigure is called
53+
54+
Ray Serve automatically calls your `reconfigure` method in the following situations:
55+
56+
1. **At replica startup:** When a replica starts, if your deployment has both a `reconfigure` method and a `user_config`, Ray Serve calls `reconfigure` after running `__init__`. This lets you initialize rank-aware state without duplicating code between `__init__` and `reconfigure`.
57+
2. **When you update user_config:** When you redeploy with a new `user_config`, Ray Serve calls `reconfigure` on all running replicas. If your `reconfigure` method includes `rank` as a parameter, Ray Serve passes both the new `user_config` and the current rank.
58+
3. **When a replica's rank changes:** During downscaling, ranks may be reassigned to maintain contiguity (0 to N-1). If your `reconfigure` method includes `rank` as a parameter and your deployment has a `user_config`, Ray Serve calls `reconfigure` with the existing `user_config` and the new rank.
59+
60+
:::{note}
61+
**Requirements to receive rank updates:**
62+
63+
To get rank changes through `reconfigure`, your deployment needs:
64+
- A class-based deployment (function deployments don't support `reconfigure`)
65+
- A `reconfigure` method with `rank` as a parameter: `def reconfigure(self, user_config, rank: int)`
66+
- A `user_config` in your deployment (even if it's just an empty dict: `user_config={}`)
67+
68+
Without a `user_config`, Ray Serve won't call `reconfigure` for rank changes.
69+
:::
70+
71+
:::{tip}
72+
If you'd like different behavior for when `reconfigure` is called with rank changes, [open a GitHub issue](https://github.com/ray-project/ray/issues/new/choose) to discuss your use case with the Ray Serve team.
73+
:::
74+
75+
## How replica ranks work
76+
77+
:::{note}
78+
**Rank reassignment is eventually consistent**
79+
80+
When replicas are removed during downscaling, rank reassignment to maintain contiguity (0 to N-1) doesn't happen immediately. The controller performs rank consistency checks and reassignment only when the deployment reaches a `HEALTHY` state in its update loop. This means there can be a brief period after downscaling where ranks are non-contiguous before the controller reassigns them.
81+
82+
This design choice prevents rank reassignment from interfering with ongoing deployment updates and rollouts. If you need immediate rank reassignment or different behavior, [open a GitHub issue](https://github.com/ray-project/ray/issues/new/choose) to discuss your use case with the Ray Serve team.
83+
:::
84+
85+
:::{note}
86+
**Ranks don't influence scheduling or eviction decisions**
87+
88+
Replica ranks are independent of scheduling and eviction decisions. The deployment scheduler doesn't consider ranks when placing replicas on nodes, so there's no guarantee that replicas with contiguous ranks (such as rank 0 and rank 1) will be on the same node. Similarly, during downscaling, the autoscaler's eviction decisions don't take replica ranks into account—any replica can be chosen for removal regardless of its rank.
89+
90+
If you need rank-aware scheduling or eviction (for example, to colocate replicas with consecutive ranks), [open a GitHub issue](https://github.com/ray-project/ray/issues/new/choose) to discuss your requirements with the Ray Serve team.
91+
:::
92+
93+
Ray Serve manages replica ranks automatically throughout the deployment lifecycle. The system maintains these invariants:
94+
95+
1. Ranks are contiguous integers from 0 to N-1.
96+
2. Each running replica has exactly one rank.
97+
3. No two replicas share the same rank.
98+
99+
### Rank assignment lifecycle
100+
101+
The following table shows how ranks and world size behave during different events:
102+
103+
| Event | Local Rank | World Size |
104+
|-------|------------|------------|
105+
| Upscaling | No change for existing replicas | Increases to target count |
106+
| Downscaling | Can change to maintain contiguity | Decreases to target count |
107+
| Other replica dies(will be restarted) | No change | No change |
108+
| Self replica dies | No change | No change |
109+
110+
:::{note}
111+
World size always reflects the target number of replicas configured for the deployment, not the current number of running replicas. During scaling operations, the world size updates immediately to the new target, even while replicas are still starting or stopping.
112+
:::
113+
114+
### Rank lifecycle state machine
115+
116+
```
117+
┌─────────────────────────────────────────────────────────────┐
118+
│ DEPLOYMENT LIFECYCLE │
119+
└─────────────────────────────────────────────────────────────┘
120+
121+
Initial Deployment / Upscaling:
122+
┌──────────┐ assign ┌──────────┐
123+
│ No Rank │ ───────────────> │ Rank: N-1│
124+
└──────────┘ └──────────┘
125+
(Contiguous: 0, 1, 2, ..., N-1)
126+
127+
Replica Crash:
128+
┌──────────┐ release ┌──────────┐ assign ┌──────────┐
129+
│ Rank: K │ ───────────────> │ Released │ ────────────> │ Rank: K │
130+
│ (Dead) │ │ │ │ (New) │
131+
└──────────┘ └──────────┘ └──────────┘
132+
(K can be any rank from 0 to N-1)
133+
134+
:::{note}
135+
When a replica crashes, Ray Serve automatically starts a replacement replica and assigns it the **same rank** as the crashed replica. This ensures rank contiguity is maintained without reassigning other replicas.
136+
:::
137+
138+
Downscaling:
139+
┌──────────┐ release ┌──────────┐
140+
│ Rank: K │ ───────────────> │ Released │
141+
│ (Stopped)│ │ │
142+
└──────────┘ └──────────┘
143+
144+
└──> Remaining replicas may be reassigned to maintain
145+
contiguity: [0, 1, 2, ..., M-1] where M < N
146+
(K can be any rank from 0 to N-1)
147+
148+
Controller Recovery:
149+
┌──────────┐ recover ┌──────────┐
150+
│ Running │ ───────────────> │ Rank: N │
151+
│ Replicas │ │(Restored)│
152+
└──────────┘ └──────────┘
153+
(Controller queries replicas to reconstruct rank state)
154+
```
155+
156+
### Detailed lifecycle events
157+
158+
1. **Rank assignment on startup**: Ranks are assigned when replicas start, such as during initial deployment, cold starts, or upscaling. The controller assigns ranks and propagates them to replicas during initialization. New replicas receive the lowest available rank.
159+
160+
2. **Rank release on shutdown**: Ranks are released only after a replica fully stops, which occurs during graceful shutdown or downscaling. Ray Serve preserves existing rank assignments as much as possible to minimize disruption.
161+
162+
3. **Handling replica crashes**: If a replica crashes unexpectedly, the system releases its rank and assigns the **same rank** to the replacement replica. This means if replica with rank 3 crashes, the new replacement replica will also receive rank 3. The replacement receives its rank during initialization, and other replicas keep their existing ranks unchanged.
163+
164+
4. **Controller crash and recovery**: When the controller recovers from a crash, it reconstructs the rank state by querying all running replicas for their assigned ranks. Ranks aren't checkpointed; the system re-learns them directly from replicas during recovery.
165+
166+
5. **Maintaining rank contiguity**: After downscaling, the system may reassign ranks to remaining replicas to maintain contiguity (0 to N-1). Ray Serve minimizes reassignments by only changing ranks when necessary.

0 commit comments

Comments
 (0)