Skip to content

Commit be8b847

Browse files
committed
go
1 parent 2f38e10 commit be8b847

File tree

2 files changed

+4
-13
lines changed

2 files changed

+4
-13
lines changed

examples/sglang/README.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ that get spawned depend upon the chosen graph.
9595
#### Aggregated
9696

9797
```bash
98-
cd /workspace/examples/sglang
98+
cd $DYNAMO_ROOT/examples/sglang
9999
./launch/agg.sh
100100
```
101101

@@ -108,8 +108,7 @@ cd /workspace/examples/sglang
108108
> After these are in, the TODOs in `worker.py` will be resolved and the placeholder logic removed.
109109
110110
```bash
111-
cd /workspace/examples/sglang
112-
export PYTHONPATH=$PYTHONPATH:/workspace/examples/sglang/utils
111+
cd $DYNAMO_ROOT/examples/sglang
113112
./launch/agg_router.sh
114113
```
115114

@@ -133,7 +132,7 @@ Because Dynamo has a discovery mechanism, we do not use a load balancer. Instead
133132
> Disaggregated serving in SGLang currently requires each worker to have the same tensor parallel size [unless you are using an MLA based model](https://github.com/sgl-project/sglang/pull/5922)
134133
135134
```bash
136-
cd /workspace/examples/sglang
135+
cd $DYNAMO_ROOT/examples/sglang
137136
./launch/disagg.sh
138137
```
139138

@@ -143,7 +142,7 @@ SGLang also supports DP attention for MoE models. We provide an example config f
143142

144143
```bash
145144
# note this will require 4 GPUs
146-
cd /workspace/examples/sglang
145+
cd $DYNAMO_ROOT/examples/sglang
147146
./launch/disagg_dp_attn.sh
148147
```
149148

examples/sglang/dsr1-wideep.md

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -138,14 +138,6 @@ python3 components/decode_worker_inc.py \
138138

139139
On the other decode nodes (this example has 9 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, and 8
140140

141-
8. Run the warmup script to warm up the model
142-
143-
DeepGEMM kernels can sometimes take a while to warm up. Here we provide a small helper script that should help. You can run this as many times as you want before starting inference/benchmarking. You can exec into the head node and run this script standalone - it does not need a container.
144-
145-
```bash
146-
./warmup.sh HEAD_PREFILL_NODE_IP
147-
```
148-
149141
## Benchmarking
150142

151143
In the official [blog post repro instructions](https://github.com/sgl-project/sglang/issues/6017), SGL uses batch inference to benchmark their prefill and decode workers. They do this by pretokenizing the ShareGPT dataset and then creating a batch of 8192 requests with ISL 4096 and OSL 5 (for prefill stress test) and a batch of 40000 with ISL 2000 and OSL 100 (for decode stress test). If you want to repro these benchmarks, you will need to add the following flags to the prefill and decode commands:

0 commit comments

Comments
 (0)