Skip to content

Commit 20a9603

Browse files
committed
wip clean up
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
1 parent 1823990 commit 20a9603

File tree

2 files changed

+4
-65
lines changed

2 files changed

+4
-65
lines changed

doc/source/serve/doc_code/cross_node_parallelism_example.py

Lines changed: 3 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@
124124
max_model_len=8192,
125125
),
126126
placement_group_config=dict(
127-
bundles=[{"GPU": 1, "CPU": 2}] * 2,
127+
bundles=[{"GPU": 1}] * 2,
128128
strategy="PACK",
129129
),
130130
)
@@ -158,7 +158,7 @@
158158
max_model_len=8192,
159159
),
160160
placement_group_config=dict(
161-
bundles=[{"GPU": 1, "CPU": 2}] * 4,
161+
bundles=[{"GPU": 1}] * 4,
162162
strategy="SPREAD",
163163
),
164164
)
@@ -192,7 +192,7 @@
192192
max_model_len=8192,
193193
),
194194
placement_group_config=dict(
195-
bundles=[{"GPU": 1, "CPU": 2}] * 2,
195+
bundles=[{"GPU": 1}] * 2,
196196
strategy="STRICT_PACK",
197197
),
198198
)
@@ -201,61 +201,3 @@
201201
app = build_openai_app({"llm_configs": [llm_config]})
202202
serve.run(app, blocking=True)
203203
# __custom_placement_group_strict_pack_example_end__
204-
205-
# __yaml_cross_node_tp_pp_example_start__
206-
# config.yaml
207-
# applications:
208-
# - args:
209-
# llm_configs:
210-
# - model_loading_config:
211-
# model_id: llama-3.1-8b
212-
# model_source: meta-llama/Llama-3.1-8B-Instruct
213-
# accelerator_type: L4
214-
# deployment_config:
215-
# autoscaling_config:
216-
# min_replicas: 1
217-
# max_replicas: 1
218-
# engine_kwargs:
219-
# tensor_parallel_size: 2
220-
# pipeline_parallel_size: 2
221-
# distributed_executor_backend: ray
222-
# max_model_len: 8192
223-
# enable_chunked_prefill: true
224-
# max_num_batched_tokens: 4096
225-
# import_path: ray.serve.llm:build_openai_app
226-
# name: llm_app
227-
# route_prefix: "/"
228-
# __yaml_cross_node_tp_pp_example_end__
229-
230-
# __yaml_custom_placement_group_example_start__
231-
# config.yaml
232-
# applications:
233-
# - args:
234-
# llm_configs:
235-
# - model_loading_config:
236-
# model_id: llama-3.1-8b
237-
# model_source: meta-llama/Llama-3.1-8B-Instruct
238-
# accelerator_type: L4
239-
# deployment_config:
240-
# autoscaling_config:
241-
# min_replicas: 1
242-
# max_replicas: 1
243-
# engine_kwargs:
244-
# tensor_parallel_size: 4
245-
# distributed_executor_backend: ray
246-
# max_model_len: 8192
247-
# placement_group_config:
248-
# bundles:
249-
# - GPU: 1
250-
# CPU: 2
251-
# - GPU: 1
252-
# CPU: 2
253-
# - GPU: 1
254-
# CPU: 2
255-
# - GPU: 1
256-
# CPU: 2
257-
# strategy: SPREAD
258-
# import_path: ray.serve.llm:build_openai_app
259-
# name: llm_app
260-
# route_prefix: "/"
261-
# __yaml_custom_placement_group_example_end__

doc/source/serve/llm/index.md

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,6 @@ Ray Serve LLM APIs allow users to deploy multiple LLM models together with a fam
1111
- 🔌 OpenAI compatible
1212
- 🔄 Multi-LoRA support with shared base models
1313
- 🚀 Engine agnostic architecture (i.e. vLLM, SGLang, etc)
14-
- 🔗 Cross-node tensor and pipeline parallelism
15-
- ⚙️ Custom :ref:`placement group strategies <pgroup-strategy>` for fine-grained resource control
1614

1715
## Requirements
1816

@@ -50,10 +48,9 @@ The LLMConfig class specifies model details such as:
5048

5149
- Model loading sources (HuggingFace or cloud storage)
5250
- Hardware requirements (accelerator type)
53-
- Engine arguments (e.g. vLLM engine kwargs, tensor/pipeline parallelism)
51+
- Engine arguments (e.g. vLLM engine kwargs)
5452
- LoRA multiplexing configuration
5553
- Serve auto-scaling parameters
56-
- Placement group configuration for multi-node deployments
5754

5855
```{toctree}
5956
:hidden:

0 commit comments

Comments
 (0)