Skip to content

Commit f896cd6

Browse files
committed
Merge conflicts
Signed-off-by: joshlee <joshlee@anyscale.com>
2 parents 37278bb + 1622ff8 commit f896cd6

File tree

56 files changed

+1179
-634
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

56 files changed

+1179
-634
lines changed

.github/CODEOWNERS

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,14 +71,14 @@
7171
/python/ray/data/llm.py @ray-project/ray-llm
7272
/python/ray/dashboard/modules/metrics/dashboards/serve_llm_dashboard_panels.py @ray-project/ray-llm
7373
/python/ray/dashboard/modules/metrics/dashboards/serve_llm_grafana_dashboard_base.json @ray-project/ray-llm
74-
/doc/source/serve/llm/ @ray-project/ray-llm
7574

7675
# Ray Serve
7776
/python/ray/serve/ @ray-project/ray-serve
7877
/java/serve/ @ray-project/ray-serve
7978
/src/ray/protobuf/serve.proto @ray-project/ray-serve
8079
/python/ray/dashboard/modules/serve/ @ray-project/ray-serve
8180
/doc/source/serve/ @ray-project/ray-serve @ray-project/ray-docs
81+
/doc/source/serve/llm/ @ray-project/ray-llm @ray-project/ray-docs
8282

8383
# ML Docker Dependencies
8484
/python/requirements/ml/dl-cpu-requirements.txt @richardliaw @matthewdeng

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 10 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,15 @@
1-
<!-- Thank you for contributing to Ray! 🚀 -->
2-
<!-- Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->
3-
<!-- 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete -->
1+
> Thank you for contributing to Ray! 🚀
2+
> Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request.
43
5-
## Description
6-
7-
<!-- Briefly describe what this PR accomplishes and why it's needed -->
8-
9-
## Related issues
10-
11-
<!-- Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234" -->
12-
13-
## Types of change
4+
> ⚠️ Remove these instructions before submitting your PR.
145
15-
- [ ] Bug fix 🐛
16-
- [ ] New feature ✨
17-
- [ ] Enhancement 🚀
18-
- [ ] Code refactoring 🔧
19-
- [ ] Documentation update 📖
20-
- [ ] Chore 🧹
21-
- [ ] Style 🎨
6+
> 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete.
227
23-
## Checklist
24-
25-
**Does this PR introduce breaking changes?**
26-
- [ ] Yes ⚠️
27-
- [ ] No
28-
<!-- If yes, describe what breaks and how users should migrate -->
29-
30-
**Testing:**
31-
- [ ] Added/updated tests for my changes
32-
- [ ] Tested the changes manually
33-
- [ ] This PR is not tested ❌ _(please explain why)_
34-
35-
**Code Quality:**
36-
- [ ] Signed off every commit (`git commit -s`)
37-
- [ ] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
38-
39-
**Documentation:**
40-
- [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html))
41-
- [ ] Added new APIs to `doc/source/` (if applicable)
8+
## Description
9+
> Briefly describe what this PR accomplishes and why it's needed.
4210
43-
## Additional context
11+
## Related issues
12+
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".
4413
45-
<!-- Optional: Add screenshots, examples, performance impact, breaking change details -->
14+
## Additional information
15+
> Optional: Add implementation details, API changes, usage examples, screenshots, etc.

doc/BUILD.bazel

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -552,6 +552,7 @@ doctest(
552552

553553
doctest(
554554
name = "doctest[core]",
555+
size = "large",
555556
files = glob(
556557
include = [
557558
"source/ray-core/**/*.md",

doc/source/ray-overview/examples/llamafactory-llm-fine-tune/notebooks/dpo_qlora.ipynb

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -134,6 +134,8 @@
134134
"\n",
135135
"### Configure LLaMA-Factory with Ray\n",
136136
"\n",
137+
"**Note**: To customize the training configuration, edit `train-configs/dpo_qlora.yaml`. \n",
138+
"\n",
137139
"```yaml\n",
138140
"# dpo_qlora.yaml\n",
139141
"\n",

doc/source/ray-overview/examples/llamafactory-llm-fine-tune/notebooks/kto_lora.ipynb

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -145,6 +145,8 @@
145145
"\n",
146146
"### Configure LLaMA-Factory with Ray\n",
147147
"\n",
148+
"**Note**: To customize the training configuration, edit `train-configs/kto_lora.yaml`. \n",
149+
"\n",
148150
"```yaml\n",
149151
"# kto_lora.yaml\n",
150152
"\n",

doc/source/ray-overview/examples/llamafactory-llm-fine-tune/notebooks/sft_lora_deepspeed.ipynb

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -160,7 +160,9 @@
160160
"- **Gated models:** If your base model has gated access (for example, Llama) on HuggingFace, set `HF_TOKEN` in the runtime environment.\n",
161161
"- **GPU selection:** The config sets `accelerator_type` to `L40S`, but you can switch to other GPUs such as `A100-40GB` or any other GPU with comparable or more VRAM, depending on your cloud availability.\n",
162162
"\n",
163-
"### LLaMA-Factory + Ray configuration\n",
163+
"### Configure LLaMA-Factory with Ray\n",
164+
"\n",
165+
"**Note**: To customize the training configuration, edit `train-configs/sft_lora_deepspeed.yaml`. \n",
164166
"\n",
165167
"```yaml\n",
166168
"# sft_lora_deepspeed.yaml\n",
@@ -209,7 +211,7 @@
209211
"### ray\n",
210212
"ray_run_name: qwen2.5_32b_lora_sft\n",
211213
"ray_storage_path: /mnt/cluster_storage/\n",
212-
"ray_num_workers: 4 # Number of GPUs to use.\n",
214+
"ray_num_workers: 4 # Number of GPUs to use\n",
213215
"resources_per_worker:\n",
214216
" GPU: 1\n",
215217
" accelerator_type:L40S: 0.001 # Use this to simply specify a GPU type (not guaranteed on the same node). You can use A100-40G if L40S is not available. \n",

doc/source/ray-overview/examples/llamafactory-llm-fine-tune/train-configs/sft_lora_deepspeed.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ ddp_timeout: 180000000
4444
### ray
4545
ray_run_name: qwen2.5_32b_lora_sft
4646
ray_storage_path: /mnt/cluster_storage/
47-
ray_num_workers: 4 # Number of GPUs to use.
47+
ray_num_workers: 4 # Number of GPUs to use
4848
resources_per_worker:
4949
GPU: 1
5050
accelerator_type:L40S: 0.001 # Use this to simply specify a GPU type (not guaranteed on the same node). You can use A100-40G if L40S is not available.

doc/source/serve/advanced-guides/advanced-autoscaling.md

Lines changed: 41 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -669,4 +669,44 @@ In your policy, access custom metrics via:
669669
The number of data points stored for each replica depends on the [`look_back_period_s`](../api/doc/ray.serve.config.AutoscalingConfig.look_back_period_s.rst) (the sliding window size) and [`metrics_interval_s`](../api/doc/ray.serve.config.AutoscalingConfig.metrics_interval_s.rst) (the metric recording interval).
670670
* **`ctx.aggregated_metrics[metric_name]`** — A time-weighted average computed from the raw metric values for each replica.
671671

672-
> Today, aggregation is a time-weighted average. In future releases, additional aggregation options may be supported.
672+
673+
### Application level autoscaling
674+
675+
By default, each deployment in Ray Serve autoscales independently. When you have multiple deployments that need to scale in a coordinated way—such as deployments that share backend resources, have dependencies on each other, or need load-aware routing—you can define an **application-level autoscaling policy**. This policy makes scaling decisions for all deployments within an application simultaneously.
676+
677+
#### Define an application level policy
678+
679+
An application-level autoscaling policy is a function that takes a Dict[DeploymentID, [`AutoscalingContext`](../api/doc/ray.serve.config.AutoscalingContext.rst)] objects (one per deployment) and returns a tuple of `(decisions, policy_state)`. Each context contains metrics and bounds for one deployment, and the policy returns target replica counts for all deployments.
680+
681+
The following example shows a policy that scales deployments based on their relative load, ensuring that downstream deployments have enough capacity for upstream traffic:
682+
683+
```{literalinclude} ../doc_code/autoscaling_policy.py
684+
:language: python
685+
:start-after: __begin_application_level_autoscaling_policy__
686+
:end-before: __end_application_level_autoscaling_policy__
687+
```
688+
689+
#### Configure application level autoscaling
690+
691+
To use an application-level policy, you need to define your deployments:
692+
693+
```{literalinclude} ../doc_code/application_level_autoscaling.py
694+
:language: python
695+
:start-after: __serve_example_begin__
696+
:end-before: __serve_example_end__
697+
```
698+
699+
Then specify the application-level policy in your application config:
700+
701+
```{literalinclude} ../doc_code/application_level_autoscaling.yaml
702+
:language: yaml
703+
:emphasize-lines: 4-5
704+
```
705+
706+
:::{note}
707+
Programmatic configuration of application-level autoscaling policies through `serve.run()` will be supported in a future release.
708+
:::
709+
710+
:::{note}
711+
When you specify both a deployment-level policy and an application-level policy, the application-level policy takes precedence. Ray Serve logs a warning if you configure both.
712+
:::
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# __serve_example_begin__
2+
import time
3+
from ray import serve
4+
5+
6+
@serve.deployment
7+
class Preprocessor:
8+
def __call__(self, input_data: str) -> str:
9+
# Simulate preprocessing work
10+
time.sleep(0.05)
11+
return f"preprocessed_{input_data}"
12+
13+
14+
@serve.deployment
15+
class Model:
16+
def __call__(self, preprocessed_data: str) -> str:
17+
# Simulate model inference (takes longer than preprocessing)
18+
time.sleep(0.1)
19+
return f"result_{preprocessed_data}"
20+
21+
22+
@serve.deployment
23+
class Driver:
24+
def __init__(self, preprocessor, model):
25+
self._preprocessor = preprocessor
26+
self._model = model
27+
28+
async def __call__(self, input_data: str) -> str:
29+
# Coordinate preprocessing and model inference
30+
preprocessed = await self._preprocessor.remote(input_data)
31+
result = await self._model.remote(preprocessed)
32+
return result
33+
34+
35+
app = Driver.bind(Preprocessor.bind(), Model.bind())
36+
# __serve_example_end__
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
applications:
2+
- name: MyApp
3+
import_path: application_level_autoscaling:app
4+
autoscaling_policy:
5+
policy_function: autoscaling_policy:coordinated_scaling_policy
6+
deployments:
7+
- name: Preprocessor
8+
autoscaling_config:
9+
min_replicas: 1
10+
max_replicas: 10
11+
- name: Model
12+
autoscaling_config:
13+
min_replicas: 2
14+
max_replicas: 20

0 commit comments

Comments
 (0)