ai-dynamo · tzulingk · Jul 30, 2025 · Jul 30, 2025 · Jul 31, 2025 · Jul 31, 2025
diff --git a/README.md b/README.md
@@ -21,15 +21,29 @@ limitations under the License.
 [![Discord](https://dcbadge.limes.pink/api/server/D92uqZRjCZ?style=flat)](https://discord.gg/D92uqZRjCZ)
 [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/ai-dynamo/dynamo)
 
-| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Documentation](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Examples](https://github.com/ai-dynamo/dynamo/tree/main/examples)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** |
+| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Documentation](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Support Matrix](docs/support_matrix.md)** | **[Examples](https://github.com/ai-dynamo/dynamo/tree/main/examples)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** |
 
 # NVIDIA Dynamo
 
 High-throughput, low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
 
-## Latest News
+## Framework Support Matrix
 
-* [08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./components/backends/trtllm/gpt-oss.md)
+| Feature | vLLM | SGLang | TensorRT-LLM |
+|---------|----------------------|----------------------------|----------------------------------------|
+| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ |
+| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
+| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
+| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | 🚧 | 🚧 |
+| [**Load Based Planner**](/docs/architecture/load_planner.md) | ✅ | 🚧 | 🚧 |
+| [**KVBM**](/docs/architecture/kvbm_architecture.md) | 🚧 | 🚧 | 🚧 |
+
+To learn more about each framework and their capabilities, check out each framework's README and deploy them with Dynamo!
+- **[vLLM](components/backends/vllm/README.md)**
+- **[SGLang](components/backends/sglang/README.md)**
+- **[TensorRT-LLM](components/backends/trtllm/README.md)**
+
+Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.
 
 ## The Era of Multi-GPU, Multi-Node
 
@@ -51,24 +65,6 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
   <img src="./docs/images/frontpage-architecture.png" alt="Dynamo architecture" width="600" />
 </p>
 
-## Framework Support Matrix
-
-| Feature | vLLM | SGLang | TensorRT-LLM |
-|---------|----------------------|----------------------------|----------------------------------------|
-| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ |
-| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
-| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
-| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | 🚧 | 🚧 |
-| [**Load Based Planner**](/docs/architecture/load_planner.md) | ✅ | 🚧 | 🚧 |
-| [**KVBM**](/docs/architecture/kvbm_architecture.md) | 🚧 | 🚧 | 🚧 |
-
-To learn more about each framework and their capabilities, check out each framework's README!
-- **[vLLM](components/backends/vllm/README.md)**
-- **[SGLang](components/backends/sglang/README.md)**
-- **[TensorRT-LLM](components/backends/trtllm/README.md)**
-
-Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.
-
 # Installation
 
 The following examples require a few system level packages.
@@ -171,10 +167,15 @@ To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.
 
 ## SGLang
 
+
 ```
-# Install libnuma
+# Install libnuma-dev
 apt install -y libnuma-dev
 
+# Install flashinfer-python pre-release (required by sglang for optimized inference)
+uv pip install "flashinfer-python==0.2.9rc2" --prerelease=allow
+
+# Install ai-dynamo with sglang support
 uv pip install ai-dynamo[sglang]
 ```
 

diff --git a/benchmarks/llm/README.md b/benchmarks/llm/README.md
@@ -12,4 +12,3 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->
 
-[../../examples/llm/benchmarks/README.md](../../examples/llm/benchmarks/README.md)
diff --git a/components/README.md b/components/README.md
@@ -77,4 +77,4 @@ To get started with Dynamo components:
 4. **Run deployment scripts** from the engine's launch directory
 5. **Monitor performance** using the metrics component
 
-For detailed instructions, see the README files in each component directory and the main [Dynamo documentation](../../docs/).
+For detailed instructions, see the README files in each component directory and the main [Dynamo documentation](../docs/).
diff --git a/components/backends/llama_cpp/README.md b/components/backends/llama_cpp/README.md
@@ -13,7 +13,7 @@ python -m dynamo.llama_cpp --model-path /data/models/Qwen3-0.6B-Q8_0.gguf [args]
 
 ## Request Migration
 
-You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
+In a Distributed System, a request may fail due to connectivity issues between the Frontend and the Backend.
 
 ```bash
 python3 -m dynamo.llama_cpp ... --migration-limit=3

diff --git a/components/backends/sglang/README.md b/components/backends/sglang/README.md
@@ -52,8 +52,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 
 ## Quick Start
 
-Below we provide a guide that lets you run all of our the common deployment patterns on a single node. See our different [architectures](../llm/README.md#deployment-architectures) for a high level overview of each pattern and the architecture diagram for each.
-
+Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
 ### Start NATS and ETCD in the background
 
 Start using [Docker Compose](../../../deploy/docker-compose.yml)
@@ -157,7 +156,7 @@ curl localhost:8000/v1/chat/completions \
 
 ## Request Migration
 
-You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
+In a Distributed System, a request may fail due to connectivity issues between the Frontend and the Backend.
 
 ```bash
 python3 -m dynamo.sglang ... --migration-limit=3
@@ -174,7 +173,6 @@ Below we provide a selected list of advanced examples. Please open up an issue i
 
 ### Large scale P/D disaggregation with WideEP
 - **[Run DeepSeek-R1 on 104+ H100s](docs/dsr1-wideep-h100.md)**
-- **[Run DeepSeek-R1 on GB200s](docs/dsr1-wideep-gb200.md)**
 
 ### Supporting SGLang's native endpoints via Dynamo
 - **[HTTP Server for native SGLang endpoints](docs/sgl-http-server.md)**

diff --git a/components/backends/sglang/deploy/README.md b/components/backends/sglang/deploy/README.md
@@ -74,7 +74,7 @@ extraPodSpec:
 
 Before using these templates, ensure you have:
 
-1. **Dynamo Cloud Platform installed** - See [Installing Dynamo Cloud](../../docs/guides/dynamo_deploy/dynamo_cloud.md)
+1. **Dynamo Cloud Platform installed** - See [Installing Dynamo Cloud](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md)
 2. **Kubernetes cluster with GPU support**
 3. **Container registry access** for SGLang runtime images
 4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`)
@@ -103,8 +103,34 @@ args:
 ```
 
 ### 3. Deploy
+
+Use the following command to deploy the deployment file.
+
+First, create a secret for the HuggingFace token.
+```bash
+export HF_TOKEN=your_hf_token
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN=${HF_TOKEN} \
+  -n ${NAMESPACE}
+```
+
+Then, deploy the model using the deployment file.
+
 ```bash
-kubectl apply -f <your-template>.yaml
+export DEPLOYMENT_FILE=agg.yaml
+kubectl apply -f $DEPLOYMENT_FILE -n ${NAMESPACE}
+```
+
+### 4. Using Custom Dynamo Frameworks Image for SGLang
+
+To use a custom dynamo frameworks image for SGLang, you can update the deployment file using yq:
+
+```bash
+export DEPLOYMENT_FILE=agg.yaml
+export FRAMEWORK_RUNTIME_IMAGE=<sglang-image>
+
+yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE  > $DEPLOYMENT_FILE.generated
+kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACE
 ```
 
 ## Model Configuration
@@ -133,4 +159,4 @@ Common issues and solutions:
 3. **Health check failures**: Review model loading logs and increase `initialDelaySeconds`
 4. **Out of memory**: Increase memory limits or reduce model batch size
 
-For additional support, refer to the [deployment troubleshooting guide](../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting).
+For additional support, refer to the [deployment troubleshooting guide](../../../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting).
diff --git a/components/backends/sglang/docs/dsr1-wideep-h100.md b/components/backends/sglang/docs/dsr1-wideep-h100.md
@@ -5,7 +5,7 @@ SPDX-License-Identifier: Apache-2.0
 
 # Running DeepSeek-R1 Disaggregated with WideEP on H100s
 
-Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://www.nvidia.com/en-us/technologies/ai/deepseek-r1-large-scale-p-d-with-wide-expert-parallelism/) for more details. We provide a Dockerfile for this in `container/Dockerfile.sglang-deepep` and configurations to deploy this at scale. In this example, we will run 1 prefill worker on 4 H100 nodes and 1 decode worker on 9 H100 nodes (104 total GPUs).
+Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://lmsys.org/blog/2025-05-05-large-scale-ep/) for more details. We provide a Dockerfile for this in `container/Dockerfile.sglang-deepep` and configurations to deploy this at scale. In this example, we will run 1 prefill worker on 4 H100 nodes and 1 decode worker on 9 H100 nodes (104 total GPUs).
 
 ## Instructions
 
@@ -16,8 +16,6 @@ cd $DYNAMO_ROOT
 docker build -f container/Dockerfile.sglang-wideep . -t dynamo-wideep --no-cache
 ```
 
-You can use a specific tag from the [lmsys dockerhub](https://hub.docker.com/r/lmsysorg/sglang/tags) by adding `--build-arg SGLANG_IMAGE_TAG=<tag>` to the build command.
-
 2. You can run this container on each 8xH100 node using the following command.
 
 > [!IMPORTANT]
@@ -44,7 +42,7 @@ In each container, you should be in the `/sgl-workspace/dynamo/components/backen
 3. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier.
 
 ```bash
-./utils/gen_env_vars.sh
+./components/backends/sglang/src/dynamo/sglang/utils/gen_env_vars.sh
 ```
 
 4. Run the ingress and prefill worker
@@ -53,7 +51,7 @@ In each container, you should be in the `/sgl-workspace/dynamo/components/backen
 # run ingress
 python3 -m dynamo.frontend --http-port=8000 &
 # optionally run the http server that allows you to flush the kv cache for all workers (see benchmarking section below)
-python3 utils/sgl_http_server.py --ns dynamo &
+python3 -m dynamo.sglang.utils.sgl_http_server --ns dynamo &
 # run prefill worker
 python3 -m dynamo.sglang.worker \
   --model-path /model/ \
@@ -162,7 +160,7 @@ curl -X POST http://${HEAD_PREFILL_NODE_IP}:9001/flush_cache
 ```
 
 2. **GenAI Perf to benchmark completions with custom dataset**
-   We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAI Perf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL.
+   We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAIPerf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL.
 
 Example usage:
 

diff --git a/components/backends/sglang/docs/sgl-http-server.md b/components/backends/sglang/docs/sgl-http-server.md
@@ -64,7 +64,7 @@ The server accepts the following command-line arguments:
 
 Start the server:
 ```bash
-python src/dynamo/sglang/utils/sgl_http_server.py --port 9001 --namespace dynamo
+python3 -m dynamo.sglang.utils.sgl_http_server --ns dynamo
 ```
 
 The server will automatically discover all SGLang components in the specified namespace and provide HTTP endpoints for managing them.
diff --git a/components/backends/sglang/launch/agg_router.sh b/components/backends/sglang/launch/agg_router.sh
@@ -14,6 +14,7 @@ trap cleanup EXIT INT TERM
 # run clear_namespace
 python3 -m dynamo.sglang.utils.clear_namespace --namespace dynamo
 
+# run ingress
 # run ingress
 python -m dynamo.frontend --router-mode kv --http-port=8000 &
 DYNAMO_PID=$!

diff --git a/components/backends/sglang/slurm_jobs/README.md b/components/backends/sglang/slurm_jobs/README.md
@@ -1,153 +1 @@
-# Example: Deploy Multi-node SGLang with Dynamo on SLURM
-
-This folder implements the example of [SGLang DeepSeek-R1 Disaggregated with WideEP](../dsr1-wideep.md) on a SLURM cluster.
-
-## Overview
-
-The scripts in this folder set up multiple cluster nodes to run the [SGLang DeepSeek-R1 Disaggregated with WideEP](../dsr1-wideep.md) example, with separate nodes handling prefill and decode.
-The node setup is done using Python job submission scripts with Jinja2 templates for flexible configuration. The setup also includes GPU utilization monitoring capabilities to track performance during benchmarks.
-
-## Scripts
-
-- **`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates
-- **`job_script_template.j2`**: Jinja2 template for generating SLURM job scripts
-- **`scripts/worker_setup.py`**: Worker script that handles the setup on each node
-- **`scripts/monitor_gpu_utilization.sh`**: Script for monitoring GPU utilization during benchmarks
-
-## Logs Folder Structure
-
-Each SLURM job creates a unique log directory under `logs/` using the job ID. For example, job ID `3062824` creates the directory `logs/3062824/`.
-
-### Log File Structure
-
-```
-logs/
-├── 3062824/                    # Job ID directory
-│   ├── log.out                 # Main job output (node allocation, IP addresses, launch commands)
-│   ├── log.err                 # Main job errors
-│   ├── node0197_prefill.out     # Prefill node stdout (node0197)
-│   ├── node0197_prefill.err     # Prefill node stderr (node0197)
-│   ├── node0200_prefill.out     # Prefill node stdout (node0200)
-│   ├── node0200_prefill.err     # Prefill node stderr (node0200)
-│   ├── node0201_decode.out      # Decode node stdout (node0201)
-│   ├── node0201_decode.err      # Decode node stderr (node0201)
-│   ├── node0204_decode.out      # Decode node stdout (node0204)
-│   ├── node0204_decode.err      # Decode node stderr (node0204)
-│   ├── node0197_prefill_gpu_utilization.log    # GPU utilization monitoring (node0197)
-│   ├── node0200_prefill_gpu_utilization.log    # GPU utilization monitoring (node0200)
-│   ├── node0201_decode_gpu_utilization.log     # GPU utilization monitoring (node0201)
-│   └── node0204_decode_gpu_utilization.log     # GPU utilization monitoring (node0204)
-├── 3063137/                    # Another job ID directory
-├── 3062689/                    # Another job ID directory
-└── ...
-```
-
-## Setup
-
-For simplicity of the example, we will make some assumptions about your SLURM cluster:
-
-1. We assume you have access to a SLURM cluster with multiple GPU nodes
-   available. For functional testing, most setups should be fine. For performance
-   testing, you should aim to allocate groups of nodes that are performantly
-   inter-connected, such as those in an NVL72 setup.
-2. We assume this SLURM cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
-   SPANK plugin setup. In particular, the `job_script_template.j2` template in this
-   example will use `srun` arguments like `--container-image`,
-   `--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
-   If your cluster supports similar container based plugins, you may be able to
-   modify the template to use that instead.
-3. We assume you have already built a recent Dynamo+SGLang container image as
-   described [here](../dsr1-wideep.md#instructions).
-   This is the image that can be passed to the `--container-image` argument in later steps.
-
-## Usage
-
-> [!NOTE]
-> The logic for finding prefill and decode node IPs in [`job_script_template.j2`](job_script_template.j2) is still a work in progress. You may need to tweak the `srun`/`ip route`/`getent`/`awk` bits for your cluster, especially if your networking or hostname conventions differ. PRs and suggestions welcome.
-
-1. **Submit a benchmark job**:
-
-   ```bash
-   python submit_job_script.py \
-     --template job_script_template.j2 \
-     --model-dir /path/to/model \
-     --config-dir /path/to/configs \
-     --container-image container-image-uri \
-     --account your-slurm-account
-   ```
-
-   **Required arguments**:
-
-   - `--template`: Path to Jinja2 template file
-   - `--model-dir`: Model directory path
-   - `--config-dir`: Config directory path
-   - `--container-image`: Container image URI (e.g., `registry/repository:tag`)
-   - `--account`: SLURM account
-
-   **Optional arguments**:
-
-   - `--prefill-nodes`: Number of prefill nodes (default: `2`)
-   - `--decode-nodes`: Number of decode nodes (default: `2`)
-   - `--gpus-per-node`: Number of GPUs per node (default: `8`)
-   - `--network-interface`: Network interface to use (default: `eth3`)
-   - `--job-name`: SLURM job name (default: `dynamo_setup`)
-   - `--time-limit`: Time limit in HH:MM:SS format (default: `01:00:00`)
-   - `--gpu-type`: GPU type to use, choices: `h100`, `gb200` (default: `h100`)
-   - `--use-sglang-commands`: Use SGLang commands instead of Dynamo (default: `false`)
-
-   **Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters.
-
-2. **Example with different GPU types**:
-
-   ```bash
-   # For H100 with Dynamo (default)
-   python submit_job_script.py \
-     --template job_script_template.j2 \
-     --model-dir /path/to/model \
-     --config-dir /path/to/configs \
-     --container-image container-image-uri \
-     --account your-slurm-account \
-     --gpu-type h100
-
-   # For GB200 with SGLang
-   python submit_job_script.py \
-     --template job_script_template.j2 \
-     --model-dir /path/to/model \
-     --config-dir /path/to/configs \
-     --container-image container-image-uri \
-     --account your-slurm-account \
-     --gpu-type gb200 \
-     --use-sglang-commands
-     --gpus-per-node 4
-   ```
-
-3. **Monitor job progress**:
-
-   ```bash
-   squeue -u $USER
-   ```
-
-4. **Check logs in real-time**:
-
-   ```bash
-   tail -f logs/{JOB_ID}/log.out
-   ```
-
-   You can view logs of all prefill or decode workers simultaneously by running:
-
-   ```bash
-   # prefill workers err (or .out)
-   tail -f logs/{JOB_ID}/*_prefill.err
-
-   # decode workers err (or .out)
-   tail -f logs/{JOB_ID}/*_decode.err
-   ```
-
-5. **Monitor GPU utilization**:
-   ```bash
-   tail -f logs/{JOB_ID}/{node}_prefill_gpu_utilization.log
-   ```
-
-## Outputs
-
-Benchmark results and outputs are stored in the `outputs/` directory, which is mounted into the container.
+Please refer to [Deploying Dynamo with SGLang on SLURM](../../../../docs/components/backends/sglang/slurm_jobs/README.md) for more details.
Original file line number	Diff line number	Diff line change
Expand Up		@@ -12,4 +12,3 @@ See the License for the specific language governing permissions and
		limitations under the License.
		-->

		[../../examples/llm/benchmarks/README.md](../../examples/llm/benchmarks/README.md)