go

ishandhanani · ishandhanani · commit be8b847eaaa4 · 2025-07-03T15:35:16.000Z
diff --git a/examples/sglang/README.md b/examples/sglang/README.md
@@ -95,7 +95,7 @@ that get spawned depend upon the chosen graph.
 #### Aggregated
 
 ```bash
-cd /workspace/examples/sglang
+cd $DYNAMO_ROOT/examples/sglang
 ./launch/agg.sh
 ```
 
@@ -108,8 +108,7 @@ cd /workspace/examples/sglang
 > After these are in, the TODOs in `worker.py` will be resolved and the placeholder logic removed.
 
 ```bash
-cd /workspace/examples/sglang
-export PYTHONPATH=$PYTHONPATH:/workspace/examples/sglang/utils
+cd $DYNAMO_ROOT/examples/sglang
 ./launch/agg_router.sh
 ```
 
@@ -133,7 +132,7 @@ Because Dynamo has a discovery mechanism, we do not use a load balancer. Instead
 > Disaggregated serving in SGLang currently requires each worker to have the same tensor parallel size [unless you are using an MLA based model](https://github.com/sgl-project/sglang/pull/5922)
 
 ```bash
-cd /workspace/examples/sglang
+cd $DYNAMO_ROOT/examples/sglang
 ./launch/disagg.sh
 ```
 
@@ -143,7 +142,7 @@ SGLang also supports DP attention for MoE models. We provide an example config f
 
 ```bash
 # note this will require 4 GPUs
-cd /workspace/examples/sglang
+cd $DYNAMO_ROOT/examples/sglang
 ./launch/disagg_dp_attn.sh
 ```
 
diff --git a/examples/sglang/dsr1-wideep.md b/examples/sglang/dsr1-wideep.md
@@ -138,14 +138,6 @@ python3 components/decode_worker_inc.py \
 
 On the other decode nodes (this example has 9 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, and 8
 
-8. Run the warmup script to warm up the model
-
-DeepGEMM kernels can sometimes take a while to warm up. Here we provide a small helper script that should help. You can run this as many times as you want before starting inference/benchmarking. You can exec into the head node and run this script standalone - it does not need a container.
-
-```bash
-./warmup.sh HEAD_PREFILL_NODE_IP
-```
-
 ## Benchmarking
 
 In the official [blog post repro instructions](https://github.com/sgl-project/sglang/issues/6017), SGL uses batch inference to benchmark their prefill and decode workers. They do this by pretokenizing the ShareGPT dataset and then creating a batch of 8192 requests with ISL 4096 and OSL 5 (for prefill stress test) and a batch of 40000 with ISL 2000 and OSL 100 (for decode stress test). If you want to repro these benchmarks, you will need to add the following flags to the prefill and decode commands: