Skip to content

Commit 539ff3e

Browse files
committed
update readme
1 parent daa3c4e commit 539ff3e

File tree

4 files changed

+8
-3
lines changed

4 files changed

+8
-3
lines changed

components/backends/vllm/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,7 @@ For Kubernetes deployment, YAML manifests are provided in the `deploy/` director
112112
- `agg_router.yaml` - Aggregated serving with KV routing
113113
- `disagg.yaml` - Disaggregated serving
114114
- `disagg_router.yaml` - Disaggregated serving with KV routing
115+
- `disagg_planner.yaml` - Disaggregated serving with [SLA Planner](../../../docs/architecture/sla_planner.md). See [SLA Planner Deployment Guide](../../../docs/guides/dynamo_deploy/sla_planner_deployment.md) for more details.
115116

116117
#### Prerequisites
117118

@@ -124,6 +125,8 @@ For Kubernetes deployment, YAML manifests are provided in the `deploy/` director
124125
# Update the image references in the YAML files
125126
```
126127

128+
- **Pre-Deployment Profiling (if Using SLA Planner)**: Follow the [pre-deployment profiling guide](../../../docs/architecture/pre_deployment_profiling.md) to run pre-deployment profiling. The results will be saved to the `profiling-pvc` PVC and queried by the SLA Planner.
129+
127130
- **Port Forwarding**: After deployment, forward the frontend service to access the API:
128131
```bash
129132
kubectl port-forward deployment/vllm-v1-disagg-frontend-<pod-uuid-info> 8080:8000

docs/architecture/pre_deployment_profiling.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ The script will recommend the best TP size for prefill and decode, as well as th
2929
2025-05-16 15:20:24 - __main__ - INFO - Suggested planner upper/lower bound for decode kv cache utilization: 0.20/0.10
3030
```
3131

32-
After finding the best TP size for prefill and decode, the script will then interpolate the TTFT with ISL and ITL with active KV cache and decode context length. This is to provide a more accurate estimation of the performance when ISL and OSL changes and will be used in the sla-planner. The results will be saved to `<output_dir>/<decode/prefill>_tp<best_tp>_interpolation`.
32+
After finding the best TP size for prefill and decode, the script will then interpolate the TTFT with ISL and ITL with active KV cache and decode context length. This is to provide a more accurate estimation of the performance when ISL and OSL changes and will be used in the sla-planner. The results will be saved to `<output_dir>/<decode/prefill>_tp<best_tp>_interpolation`. Please change the prefill and decode TP size in the config file to match the best TP sizes obtained from the profiling script.
3333

3434
### Prefill Interpolation Data
3535

docs/architecture/sla_planner.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ The SLA (Service Level Agreement)-based planner is an intelligent autoscaling sy
88
> Currently, SLA-based planner only supports disaggregated setup.
99
1010
> [!WARNING]
11-
> Bare metal deployment with local connector is deprecated. The only option to deploy SLA-based planner is via k8s. We will update the examples in this document soon.
11+
> Bare metal deployment with local connector is deprecated. Please deploy the SLA planner in k8s.
1212
1313
## Features
1414

@@ -115,4 +115,4 @@ kubectl apply -f disagg_planner.yaml -n {$NAMESPACE}
115115
```
116116

117117
> [!NOTE]
118-
> The SLA planner requires a frontend that reports metrics at `/metrics` HTTP endpoint with number of requests, ISL, OSL, TTFT, ITL in the correct format. The VLLM frontend provides these metrics automatically.
118+
> The SLA planner requires a frontend that reports metrics at `/metrics` HTTP endpoint with number of requests, ISL, OSL, TTFT, ITL in the correct format. The dynamo frontend provides these metrics automatically.

docs/guides/dynamo_deploy/sla_planner_deployment.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@ flowchart LR
2525
## Prerequisites
2626
- Kubernetes cluster with GPU nodes
2727
- `hf-token-secret` created in target namespace
28+
- [Pre-Deployment Profiling](../../architecture/pre_deployment_profiling.md) results saved to `profiling-pvc` PVC.
29+
- Prefill and decode worker uses the best parallelization mapping suggested by the pre-deployment profiling script.
2830

2931
```bash
3032
export NAMESPACE=your-namespace

0 commit comments

Comments
 (0)