Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
a8dd326
chore: fix install (#2191)
ishandhanani Jul 30, 2025
79e6711
chore: fix QA bugs in documentation/readmes (#2199)
athreesh Jul 30, 2025
9320d68
fix(sglang): disagg yaml worker change and agg kv router fix (#2205)
ishandhanani Jul 31, 2025
1c9c7d3
chore: cleanup dead links (#2208)
nealvaidya Jul 31, 2025
a6d48bd
chore: Remove multimodal readme. (#2212) (#2234)
krishung5 Jul 31, 2025
44cbf88
fix: drop cuda graph bs (batch size) on dsr1 h100 sgl (#2235)
ishandhanani Aug 1, 2025
a57bade
fix: Locked triton==3.3.1 since triton 3.4.0 breaks tensorrt-llm 1.0.…
dmitry-tokarev-nv Aug 1, 2025
95c8b58
fix: sgl instructions point to new frontend (#2245)
ishandhanani Aug 1, 2025
bfe2808
fix: readme instruction (#2265)
ishandhanani Aug 4, 2025
e2552ed
docs: Backport: Dyn 591 (#2247) to 0.4.0 (#2251)
atchernych Aug 4, 2025
9af0a01
fix: trtllm container - ENV var used before declaration (#2277)
dmitry-tokarev-nv Aug 5, 2025
d60af96
docs: add instruction to deploy model with inference gateway #2257 (#…
biswapanda Aug 5, 2025
c948f1d
fix: fix broken doc links (#2308)
biswapanda Aug 5, 2025
add5fa8
fix: Copy cuda libraries from devel to runtime stage (#2298)
nv-tusharma Aug 5, 2025
f8b95fd
docs: update deploy readme (#2306)
atchernych Aug 5, 2025
3f7c7a7
fix: Add common and test dependencies to sglang runtime build (#2279)…
nv-tusharma Aug 5, 2025
741496e
fix: Backport/anish index rst into 0.4.0 - fix links in docs and more…
athreesh Aug 6, 2025
b4be3c2
docs: Final fixes to links reported by QA (#2334)
athreesh Aug 6, 2025
2ed36b8
docs: address sphinx build errors for docs.nvidia.com (#2346)
athreesh Aug 7, 2025
2846f9e
docs: Address vincent issue with trtllm symlink (#2351)
athreesh Aug 7, 2025
b4a3cb3
Pinned PyTorch version
krishung5 Aug 7, 2025
59a2005
Add model label to Component
tzulingk Aug 8, 2025
6c95b2b
Use ModelDeploymentCard.slug() for model name. ModelDeploymentCard.se…
tzulingk Aug 9, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 23 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,15 +21,29 @@ limitations under the License.
[![Discord](https://dcbadge.limes.pink/api/server/D92uqZRjCZ?style=flat)](https://discord.gg/D92uqZRjCZ)
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/ai-dynamo/dynamo)

| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Documentation](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Examples](https://github.com/ai-dynamo/dynamo/tree/main/examples)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** |
| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Documentation](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Support Matrix](docs/support_matrix.md)** | **[Examples](https://github.com/ai-dynamo/dynamo/tree/main/examples)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** |

# NVIDIA Dynamo

High-throughput, low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.

## Latest News
## Framework Support Matrix

* [08/05] Deploy `openai/gpt-oss-120b` with disaggregated serving on NVIDIA Blackwell GPUs using Dynamo [➡️ link](./components/backends/trtllm/gpt-oss.md)
| Feature | vLLM | SGLang | TensorRT-LLM |
|---------|----------------------|----------------------------|----------------------------------------|
| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ |
| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | 🚧 | 🚧 |
| [**Load Based Planner**](/docs/architecture/load_planner.md) | ✅ | 🚧 | 🚧 |
| [**KVBM**](/docs/architecture/kvbm_architecture.md) | 🚧 | 🚧 | 🚧 |

To learn more about each framework and their capabilities, check out each framework's README and deploy them with Dynamo!
- **[vLLM](components/backends/vllm/README.md)**
- **[SGLang](components/backends/sglang/README.md)**
- **[TensorRT-LLM](components/backends/trtllm/README.md)**

Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.

## The Era of Multi-GPU, Multi-Node

Expand All @@ -51,24 +65,6 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
<img src="./docs/images/frontpage-architecture.png" alt="Dynamo architecture" width="600" />
</p>

## Framework Support Matrix

| Feature | vLLM | SGLang | TensorRT-LLM |
|---------|----------------------|----------------------------|----------------------------------------|
| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ |
| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | 🚧 | 🚧 |
| [**Load Based Planner**](/docs/architecture/load_planner.md) | ✅ | 🚧 | 🚧 |
| [**KVBM**](/docs/architecture/kvbm_architecture.md) | 🚧 | 🚧 | 🚧 |

To learn more about each framework and their capabilities, check out each framework's README!
- **[vLLM](components/backends/vllm/README.md)**
- **[SGLang](components/backends/sglang/README.md)**
- **[TensorRT-LLM](components/backends/trtllm/README.md)**

Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.

# Installation

The following examples require a few system level packages.
Expand Down Expand Up @@ -171,10 +167,15 @@ To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.

## SGLang


```
# Install libnuma
# Install libnuma-dev
apt install -y libnuma-dev

# Install flashinfer-python pre-release (required by sglang for optimized inference)
uv pip install "flashinfer-python==0.2.9rc2" --prerelease=allow

# Install ai-dynamo with sglang support
uv pip install ai-dynamo[sglang]
```

Expand Down
1 change: 0 additions & 1 deletion benchmarks/llm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,3 @@ See the License for the specific language governing permissions and
limitations under the License.
-->

[../../examples/llm/benchmarks/README.md](../../examples/llm/benchmarks/README.md)
2 changes: 1 addition & 1 deletion components/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,4 +77,4 @@ To get started with Dynamo components:
4. **Run deployment scripts** from the engine's launch directory
5. **Monitor performance** using the metrics component

For detailed instructions, see the README files in each component directory and the main [Dynamo documentation](../../docs/).
For detailed instructions, see the README files in each component directory and the main [Dynamo documentation](../docs/).
2 changes: 1 addition & 1 deletion components/backends/llama_cpp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ python -m dynamo.llama_cpp --model-path /data/models/Qwen3-0.6B-Q8_0.gguf [args]

## Request Migration

You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
In a Distributed System, a request may fail due to connectivity issues between the Frontend and the Backend.

```bash
python3 -m dynamo.llama_cpp ... --migration-limit=3
Expand Down
6 changes: 2 additions & 4 deletions components/backends/sglang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

## Quick Start

Below we provide a guide that lets you run all of our the common deployment patterns on a single node. See our different [architectures](../llm/README.md#deployment-architectures) for a high level overview of each pattern and the architecture diagram for each.

Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
### Start NATS and ETCD in the background

Start using [Docker Compose](../../../deploy/docker-compose.yml)
Expand Down Expand Up @@ -157,7 +156,7 @@ curl localhost:8000/v1/chat/completions \

## Request Migration

You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
In a Distributed System, a request may fail due to connectivity issues between the Frontend and the Backend.

```bash
python3 -m dynamo.sglang ... --migration-limit=3
Expand All @@ -174,7 +173,6 @@ Below we provide a selected list of advanced examples. Please open up an issue i

### Large scale P/D disaggregation with WideEP
- **[Run DeepSeek-R1 on 104+ H100s](docs/dsr1-wideep-h100.md)**
- **[Run DeepSeek-R1 on GB200s](docs/dsr1-wideep-gb200.md)**

### Supporting SGLang's native endpoints via Dynamo
- **[HTTP Server for native SGLang endpoints](docs/sgl-http-server.md)**
Expand Down
32 changes: 29 additions & 3 deletions components/backends/sglang/deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ extraPodSpec:

Before using these templates, ensure you have:

1. **Dynamo Cloud Platform installed** - See [Installing Dynamo Cloud](../../docs/guides/dynamo_deploy/dynamo_cloud.md)
1. **Dynamo Cloud Platform installed** - See [Installing Dynamo Cloud](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md)
2. **Kubernetes cluster with GPU support**
3. **Container registry access** for SGLang runtime images
4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`)
Expand Down Expand Up @@ -103,8 +103,34 @@ args:
```

### 3. Deploy

Use the following command to deploy the deployment file.

First, create a secret for the HuggingFace token.
```bash
export HF_TOKEN=your_hf_token
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=${HF_TOKEN} \
-n ${NAMESPACE}
```

Then, deploy the model using the deployment file.

```bash
kubectl apply -f <your-template>.yaml
export DEPLOYMENT_FILE=agg.yaml
kubectl apply -f $DEPLOYMENT_FILE -n ${NAMESPACE}
```

### 4. Using Custom Dynamo Frameworks Image for SGLang

To use a custom dynamo frameworks image for SGLang, you can update the deployment file using yq:

```bash
export DEPLOYMENT_FILE=agg.yaml
export FRAMEWORK_RUNTIME_IMAGE=<sglang-image>

yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE > $DEPLOYMENT_FILE.generated
kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACE
```

## Model Configuration
Expand Down Expand Up @@ -133,4 +159,4 @@ Common issues and solutions:
3. **Health check failures**: Review model loading logs and increase `initialDelaySeconds`
4. **Out of memory**: Increase memory limits or reduce model batch size

For additional support, refer to the [deployment troubleshooting guide](../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting).
For additional support, refer to the [deployment troubleshooting guide](../../../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting).
10 changes: 4 additions & 6 deletions components/backends/sglang/docs/dsr1-wideep-h100.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ SPDX-License-Identifier: Apache-2.0

# Running DeepSeek-R1 Disaggregated with WideEP on H100s

Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://www.nvidia.com/en-us/technologies/ai/deepseek-r1-large-scale-p-d-with-wide-expert-parallelism/) for more details. We provide a Dockerfile for this in `container/Dockerfile.sglang-deepep` and configurations to deploy this at scale. In this example, we will run 1 prefill worker on 4 H100 nodes and 1 decode worker on 9 H100 nodes (104 total GPUs).
Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://lmsys.org/blog/2025-05-05-large-scale-ep/) for more details. We provide a Dockerfile for this in `container/Dockerfile.sglang-deepep` and configurations to deploy this at scale. In this example, we will run 1 prefill worker on 4 H100 nodes and 1 decode worker on 9 H100 nodes (104 total GPUs).

## Instructions

Expand All @@ -16,8 +16,6 @@ cd $DYNAMO_ROOT
docker build -f container/Dockerfile.sglang-wideep . -t dynamo-wideep --no-cache
```

You can use a specific tag from the [lmsys dockerhub](https://hub.docker.com/r/lmsysorg/sglang/tags) by adding `--build-arg SGLANG_IMAGE_TAG=<tag>` to the build command.

2. You can run this container on each 8xH100 node using the following command.

> [!IMPORTANT]
Expand All @@ -44,7 +42,7 @@ In each container, you should be in the `/sgl-workspace/dynamo/components/backen
3. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier.

```bash
./utils/gen_env_vars.sh
./components/backends/sglang/src/dynamo/sglang/utils/gen_env_vars.sh
```

4. Run the ingress and prefill worker
Expand All @@ -53,7 +51,7 @@ In each container, you should be in the `/sgl-workspace/dynamo/components/backen
# run ingress
python3 -m dynamo.frontend --http-port=8000 &
# optionally run the http server that allows you to flush the kv cache for all workers (see benchmarking section below)
python3 utils/sgl_http_server.py --ns dynamo &
python3 -m dynamo.sglang.utils.sgl_http_server --ns dynamo &
# run prefill worker
python3 -m dynamo.sglang.worker \
--model-path /model/ \
Expand Down Expand Up @@ -162,7 +160,7 @@ curl -X POST http://${HEAD_PREFILL_NODE_IP}:9001/flush_cache
```

2. **GenAI Perf to benchmark completions with custom dataset**
We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAI Perf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL.
We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAIPerf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL.

Example usage:

Expand Down
2 changes: 1 addition & 1 deletion components/backends/sglang/docs/sgl-http-server.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ The server accepts the following command-line arguments:

Start the server:
```bash
python src/dynamo/sglang/utils/sgl_http_server.py --port 9001 --namespace dynamo
python3 -m dynamo.sglang.utils.sgl_http_server --ns dynamo
```

The server will automatically discover all SGLang components in the specified namespace and provide HTTP endpoints for managing them.
1 change: 1 addition & 0 deletions components/backends/sglang/launch/agg_router.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ trap cleanup EXIT INT TERM
# run clear_namespace
python3 -m dynamo.sglang.utils.clear_namespace --namespace dynamo

# run ingress
# run ingress
python -m dynamo.frontend --router-mode kv --http-port=8000 &
DYNAMO_PID=$!
Expand Down
154 changes: 1 addition & 153 deletions components/backends/sglang/slurm_jobs/README.md
Original file line number Diff line number Diff line change
@@ -1,153 +1 @@
# Example: Deploy Multi-node SGLang with Dynamo on SLURM

This folder implements the example of [SGLang DeepSeek-R1 Disaggregated with WideEP](../dsr1-wideep.md) on a SLURM cluster.

## Overview

The scripts in this folder set up multiple cluster nodes to run the [SGLang DeepSeek-R1 Disaggregated with WideEP](../dsr1-wideep.md) example, with separate nodes handling prefill and decode.
The node setup is done using Python job submission scripts with Jinja2 templates for flexible configuration. The setup also includes GPU utilization monitoring capabilities to track performance during benchmarks.

## Scripts

- **`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates
- **`job_script_template.j2`**: Jinja2 template for generating SLURM job scripts
- **`scripts/worker_setup.py`**: Worker script that handles the setup on each node
- **`scripts/monitor_gpu_utilization.sh`**: Script for monitoring GPU utilization during benchmarks

## Logs Folder Structure

Each SLURM job creates a unique log directory under `logs/` using the job ID. For example, job ID `3062824` creates the directory `logs/3062824/`.

### Log File Structure

```
logs/
├── 3062824/ # Job ID directory
│ ├── log.out # Main job output (node allocation, IP addresses, launch commands)
│ ├── log.err # Main job errors
│ ├── node0197_prefill.out # Prefill node stdout (node0197)
│ ├── node0197_prefill.err # Prefill node stderr (node0197)
│ ├── node0200_prefill.out # Prefill node stdout (node0200)
│ ├── node0200_prefill.err # Prefill node stderr (node0200)
│ ├── node0201_decode.out # Decode node stdout (node0201)
│ ├── node0201_decode.err # Decode node stderr (node0201)
│ ├── node0204_decode.out # Decode node stdout (node0204)
│ ├── node0204_decode.err # Decode node stderr (node0204)
│ ├── node0197_prefill_gpu_utilization.log # GPU utilization monitoring (node0197)
│ ├── node0200_prefill_gpu_utilization.log # GPU utilization monitoring (node0200)
│ ├── node0201_decode_gpu_utilization.log # GPU utilization monitoring (node0201)
│ └── node0204_decode_gpu_utilization.log # GPU utilization monitoring (node0204)
├── 3063137/ # Another job ID directory
├── 3062689/ # Another job ID directory
└── ...
```

## Setup

For simplicity of the example, we will make some assumptions about your SLURM cluster:

1. We assume you have access to a SLURM cluster with multiple GPU nodes
available. For functional testing, most setups should be fine. For performance
testing, you should aim to allocate groups of nodes that are performantly
inter-connected, such as those in an NVL72 setup.
2. We assume this SLURM cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
SPANK plugin setup. In particular, the `job_script_template.j2` template in this
example will use `srun` arguments like `--container-image`,
`--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
If your cluster supports similar container based plugins, you may be able to
modify the template to use that instead.
3. We assume you have already built a recent Dynamo+SGLang container image as
described [here](../dsr1-wideep.md#instructions).
This is the image that can be passed to the `--container-image` argument in later steps.

## Usage

> [!NOTE]
> The logic for finding prefill and decode node IPs in [`job_script_template.j2`](job_script_template.j2) is still a work in progress. You may need to tweak the `srun`/`ip route`/`getent`/`awk` bits for your cluster, especially if your networking or hostname conventions differ. PRs and suggestions welcome.

1. **Submit a benchmark job**:

```bash
python submit_job_script.py \
--template job_script_template.j2 \
--model-dir /path/to/model \
--config-dir /path/to/configs \
--container-image container-image-uri \
--account your-slurm-account
```

**Required arguments**:

- `--template`: Path to Jinja2 template file
- `--model-dir`: Model directory path
- `--config-dir`: Config directory path
- `--container-image`: Container image URI (e.g., `registry/repository:tag`)
- `--account`: SLURM account

**Optional arguments**:

- `--prefill-nodes`: Number of prefill nodes (default: `2`)
- `--decode-nodes`: Number of decode nodes (default: `2`)
- `--gpus-per-node`: Number of GPUs per node (default: `8`)
- `--network-interface`: Network interface to use (default: `eth3`)
- `--job-name`: SLURM job name (default: `dynamo_setup`)
- `--time-limit`: Time limit in HH:MM:SS format (default: `01:00:00`)
- `--gpu-type`: GPU type to use, choices: `h100`, `gb200` (default: `h100`)
- `--use-sglang-commands`: Use SGLang commands instead of Dynamo (default: `false`)

**Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters.

2. **Example with different GPU types**:

```bash
# For H100 with Dynamo (default)
python submit_job_script.py \
--template job_script_template.j2 \
--model-dir /path/to/model \
--config-dir /path/to/configs \
--container-image container-image-uri \
--account your-slurm-account \
--gpu-type h100

# For GB200 with SGLang
python submit_job_script.py \
--template job_script_template.j2 \
--model-dir /path/to/model \
--config-dir /path/to/configs \
--container-image container-image-uri \
--account your-slurm-account \
--gpu-type gb200 \
--use-sglang-commands
--gpus-per-node 4
```

3. **Monitor job progress**:

```bash
squeue -u $USER
```

4. **Check logs in real-time**:

```bash
tail -f logs/{JOB_ID}/log.out
```

You can view logs of all prefill or decode workers simultaneously by running:

```bash
# prefill workers err (or .out)
tail -f logs/{JOB_ID}/*_prefill.err

# decode workers err (or .out)
tail -f logs/{JOB_ID}/*_decode.err
```

5. **Monitor GPU utilization**:
```bash
tail -f logs/{JOB_ID}/{node}_prefill_gpu_utilization.log
```

## Outputs

Benchmark results and outputs are stored in the `outputs/` directory, which is mounted into the container.
Please refer to [Deploying Dynamo with SGLang on SLURM](../../../../docs/components/backends/sglang/slurm_jobs/README.md) for more details.
Loading
Loading