Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
992adfb
fix: add better port logic (#2175) (#2192)
alec-flowers Jul 30, 2025
9a93f11
chore: fix install (#2191)
ishandhanani Jul 30, 2025
2a616da
chore: fix QA bugs in documentation/readmes (#2199)
athreesh Jul 30, 2025
d0de1a0
feat: Add trtllm deploy examples for k8s #2133 (#2207)
biswapanda Jul 31, 2025
edccbd5
fix(sglang): disagg yaml worker change and agg kv router fix (#2205)
ishandhanani Jul 31, 2025
54fbff3
fix: add curl and jq for health checks #2203 (#2209)
biswapanda Jul 31, 2025
a9b6b28
fix: Kprashanth/trtllm rc4 cherry pick (#2218)
KrishnanPrash Jul 31, 2025
65e89b3
chore: cleanup dead links (#2208)
nealvaidya Jul 31, 2025
c92dc98
chore: update nixl version to 0.4.1 (#2221) (#2228)
nv-anants Jul 31, 2025
eb58916
chore: Remove multimodal readme. (#2212) (#2234)
krishung5 Jul 31, 2025
e848cf5
fix: Cherry pick pr 2186 release 0.4.0 to fix docs/runtime/README.md …
keivenchang Aug 1, 2025
5e3586d
fix: drop cuda graph bs (batch size) on dsr1 h100 sgl (#2235)
ishandhanani Aug 1, 2025
4fbb4e5
fix: handle groveTerminationDelay and auto-detect grove installation …
julienmancuso Aug 1, 2025
dc13774
fix: Locked triton==3.3.1 since triton 3.4.0 breaks tensorrt-llm 1.0.…
dmitry-tokarev-nv Aug 1, 2025
e5e94ad
fix: sgl instructions point to new frontend (#2245)
ishandhanani Aug 1, 2025
92781d3
fix: Update disagg configs for trtllm 1.0.0rc4 changes (release/0.4.0…
rmccorm4 Aug 4, 2025
58ad4a2
fix: readme instruction (#2265)
ishandhanani Aug 4, 2025
039c061
fix: Update eagle_one configs with speculative_model_dir field (#2283)
rmccorm4 Aug 4, 2025
2a8e251
docs: Backport: Dyn 591 (#2247) to 0.4.0 (#2251)
atchernych Aug 4, 2025
2dc4a4b
fix: trtllm container - ENV var used before declaration (#2277)
dmitry-tokarev-nv Aug 5, 2025
85737ba
fix: Update the NIXL TRTLLM commit version to rc4 (#2285)
tanmayv25 Aug 5, 2025
27c8a97
docs: add instruction to deploy model with inference gateway #2257 (#…
biswapanda Aug 5, 2025
641e49d
fix: fix nil pointer deref in dynamo controller (#2293) (#2299)
mohammedabdulwahhab Aug 5, 2025
1b145bb
fix: fix broken doc links (#2308)
biswapanda Aug 5, 2025
4e4818f
fix: Copy cuda libraries from devel to runtime stage (#2298)
nv-tusharma Aug 5, 2025
c92c1f4
docs: update deploy readme (#2306)
atchernych Aug 5, 2025
6fce98a
fix: Add common and test dependencies to sglang runtime build (#2279)…
nv-tusharma Aug 5, 2025
035d6d8
fix: Revert the commit for DeepGEMM to fix vLLM WideEP (#2302) (#2325)
krishung5 Aug 6, 2025
167c793
fix: Backport/anish index rst into 0.4.0 - fix links in docs and more…
athreesh Aug 6, 2025
409aa9e
docs: Final fixes to links reported by QA (#2334)
athreesh Aug 6, 2025
71126c7
fix: nil pointer deref in dynamo controller (#2335)
mohammedabdulwahhab Aug 6, 2025
f342c30
docs: address sphinx build errors for docs.nvidia.com (#2346)
athreesh Aug 7, 2025
96d1f15
docs: Address vincent issue with trtllm symlink (#2351)
athreesh Aug 7, 2025
e8b37a6
fix: ARM Flashinfer Versioning for 0.4.0 Release (#2363)
zaristei Aug 8, 2025
b5c9278
fix: Pinned PyTorch version for vLLM container (#2356)
krishung5 Aug 8, 2025
b0c1a24
chore: ATTRIBUTIONS-Go.md (#2355)
dmitry-tokarev-nv Aug 8, 2025
0cf8041
Revert "adjust tag to accomodate flashinfer versioning typo" (#2364)
zaristei Aug 8, 2025
bd8e368
fix: use wheel files for installation in trtllm build (#2372) (#2375)
nv-anants Aug 8, 2025
73bcc3b
fix(build): Pin cuda-python>=12,<13 to avoid trtllm breakage (#2379)
rmccorm4 Aug 8, 2025
aa57c6b
fix: turn off kvbm for al2023 support (#2533)
saturley-hall Aug 21, 2025
3f0a725
docs: add trtllm known issue for al2023 (#2604) (#2612)
nv-anants Aug 21, 2025
d98a791
docs: update trtllm know issue message (#2639) (#2643)
nv-anants Aug 22, 2025
37fca1c
fix: prevent crash looping hello world (#2625)
biswapanda Aug 22, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
445 changes: 445 additions & 0 deletions ATTRIBUTIONS-Go.md

Large diffs are not rendered by default.

22 changes: 11 additions & 11 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ members = [
resolver = "3"

[workspace.package]
version = "0.4.0"
version = "0.4.0+post0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Workspace version bump may desync local crate versions

Bumping [workspace.package].version to 0.4.0+post0 while workspace.dependencies pin local crates at 0.4.0 can cause version mismatches if member crates inherit the workspace version or if any member’s package.version is updated without updating these dependency constraints.

Choose one consistent approach:

  • Option A (quick): keep workspace at 0.4.0 for now.
  • Option B (preferred): bump local crate versions and dependency pins to 0.4.0+post0.

If going with Option B, update the local workspace dependency versions:

 [workspace.dependencies]
 # Local crates
-dynamo-runtime = { path = "lib/runtime", version = "0.4.0" }
-dynamo-llm = { path = "lib/llm", version = "0.4.0" }
-dynamo-tokens = { path = "lib/tokens", version = "0.4.0" }
+dynamo-runtime = { path = "lib/runtime", version = "0.4.0+post0" }
+dynamo-llm = { path = "lib/llm", version = "0.4.0+post0" }
+dynamo-tokens = { path = "lib/tokens", version = "0.4.0+post0" }

I can also generate a script to scan member crates for version.workspace = true and align everything if helpful.


🏁 Script executed:

#!/bin/bash
set -eux

# Inspect root Cargo.toml for workspace.package version
echo "===== [workspace.package] block in root Cargo.toml ====="
sed -n -n '/^\[workspace.package\]/,/^\[/{p}' Cargo.toml || true

# Inspect root Cargo.toml for workspace.dependencies block
echo -e "\n===== [workspace.dependencies] block in root Cargo.toml ====="
sed -n -n '/^\[workspace.dependencies\]/,/^\[/{p}' Cargo.toml || true

# Inspect member crate Cargo.toml files
for crate in lib/runtime lib/llm lib/tokens; do
  echo -e "\n===== $crate/Cargo.toml ====="
  if [ -f "$crate/Cargo.toml" ]; then
    # Show [package] block
    echo "---- [package] block ----"
    sed -n -n '/^\[package\]/,/^\[/{p}' "$crate/Cargo.toml"
    # Show [dependencies] block
    echo "---- [dependencies] block ----"
    sed -n -n '/^\[dependencies\]/,/^\[/{p}' "$crate/Cargo.toml"
  else
    echo "$crate/Cargo.toml not found"
  fi
done

Length of output: 8765


Action Required: Align local crate dependency versions with the bumped workspace version

The root Cargo.toml now sets

[workspace.package]
version = "0.4.0+post0"

but under

[workspace.dependencies]

the local crates are still pinned at 0.4.0. Since each member crate uses version.workspace = true, their actual version becomes 0.4.0+post0, which will conflict with the pinned 0.4.0 in workspace.dependencies. You have two choices:

  • Option A: Revert the workspace version bump back to "0.4.0".
  • Option B (preferred): Update the local crate entries in [workspace.dependencies] to match 0.4.0+post0.

If you proceed with Option B, apply this diff in root Cargo.toml:

 [workspace.dependencies]
 # Local crates
-dynamo-runtime = { path = "lib/runtime", version = "0.4.0" }
-dynamo-llm      = { path = "lib/llm",     version = "0.4.0" }
-dynamo-tokens   = { path = "lib/tokens",  version = "0.4.0" }
+dynamo-runtime = { path = "lib/runtime", version = "0.4.0+post0" }
+dynamo-llm      = { path = "lib/llm",     version = "0.4.0+post0" }
+dynamo-tokens   = { path = "lib/tokens",  version = "0.4.0+post0" }

• Root manifest: Cargo.toml
• Lines: update the three dynamo-* entries under [workspace.dependencies]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
version = "0.4.0+post0"
[workspace.dependencies]
# Local crates
dynamo-runtime = { path = "lib/runtime", version = "0.4.0+post0" }
dynamo-llm = { path = "lib/llm", version = "0.4.0+post0" }
dynamo-tokens = { path = "lib/tokens", version = "0.4.0+post0" }
🤖 Prompt for AI Agents
In Cargo.toml around line 18, the workspace package version was bumped to
"0.4.0+post0" but the local crate entries under [workspace.dependencies] remain
pinned to "0.4.0", causing a version mismatch; update the three dynamo-* entries
under [workspace.dependencies] to use version "0.4.0+post0" (or alternatively
revert the workspace.package.version to "0.4.0" if you prefer Option A).

edition = "2021"
description = "Dynamo Inference Framework"
authors = ["NVIDIA Inc. <sw-dl-dynamo@nvidia.com>"]
Expand Down
49 changes: 27 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,30 @@ limitations under the License.
[![Discord](https://dcbadge.limes.pink/api/server/D92uqZRjCZ?style=flat)](https://discord.gg/D92uqZRjCZ)
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/ai-dynamo/dynamo)

| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Documentation](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Examples](https://github.com/ai-dynamo/dynamo/tree/main/examples)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** |
| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Documentation](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Support Matrix](docs/support_matrix.md)** | **[Examples](https://github.com/ai-dynamo/dynamo/tree/main/examples)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** |

# NVIDIA Dynamo

High-throughput, low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.

## Framework Support Matrix

| Feature | vLLM | SGLang | TensorRT-LLM |
|---------|----------------------|----------------------------|----------------------------------------|
| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ |
| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | 🚧 | 🚧 |
| [**Load Based Planner**](/docs/architecture/load_planner.md) | ✅ | 🚧 | 🚧 |
| [**KVBM**](/docs/architecture/kvbm_architecture.md) | 🚧 | 🚧 | 🚧 |

To learn more about each framework and their capabilities, check out each framework's README and deploy them with Dynamo!
- **[vLLM](components/backends/vllm/README.md)**
- **[SGLang](components/backends/sglang/README.md)**
- **[TensorRT-LLM](components/backends/trtllm/README.md)**

Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.

## The Era of Multi-GPU, Multi-Node

<p align="center">
Expand All @@ -47,24 +65,6 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
<img src="./docs/images/frontpage-architecture.png" alt="Dynamo architecture" width="600" />
</p>

## Framework Support Matrix

| Feature | vLLM | SGLang | TensorRT-LLM |
|---------|----------------------|----------------------------|----------------------------------------|
| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ |
| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | 🚧 | 🚧 |
| [**Load Based Planner**](/docs/architecture/load_planner.md) | ✅ | 🚧 | 🚧 |
| [**KVBM**](/docs/architecture/kvbm_architecture.md) | 🚧 | 🚧 | 🚧 |

To learn more about each framework and their capabilities, check out each framework's README!
- **[vLLM](components/backends/vllm/README.md)**
- **[SGLang](components/backends/sglang/README.md)**
- **[TensorRT-LLM](components/backends/trtllm/README.md)**

Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.

# Installation

The following examples require a few system level packages.
Expand Down Expand Up @@ -115,11 +115,11 @@ Dynamo provides a simple way to spin up a local set of inference components incl

```
# Start an OpenAI compatible HTTP server, a pre-processor (prompt templating and tokenization) and a router:
python -m dynamo.frontend [--http-port 8080]
python -m dynamo.frontend --http-port 8080

# Start the SGLang engine, connecting to NATS and etcd to receive requests. You can run several of these,
# both for the same model and for multiple models. The frontend node will discover them.
python -m dynamo.sglang.worker deepseek-ai/DeepSeek-R1-Distill-Llama-8B
python -m dynamo.sglang.worker --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B --skip-tokenizer-init
```

#### Send a Request
Expand Down Expand Up @@ -167,10 +167,15 @@ To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.

## SGLang


```
# Install libnuma
# Install libnuma-dev
apt install -y libnuma-dev

# Install flashinfer-python pre-release (required by sglang for optimized inference)
uv pip install "flashinfer-python==0.2.9rc2" --prerelease=allow

# Install ai-dynamo with sglang support
uv pip install ai-dynamo[sglang]
```

Expand Down
1 change: 0 additions & 1 deletion benchmarks/llm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,3 @@ See the License for the specific language governing permissions and
limitations under the License.
-->

[../../examples/llm/benchmarks/README.md](../../examples/llm/benchmarks/README.md)
2 changes: 1 addition & 1 deletion components/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,4 +77,4 @@ To get started with Dynamo components:
4. **Run deployment scripts** from the engine's launch directory
5. **Monitor performance** using the metrics component

For detailed instructions, see the README files in each component directory and the main [Dynamo documentation](../../docs/).
For detailed instructions, see the README files in each component directory and the main [Dynamo documentation](../docs/).
2 changes: 1 addition & 1 deletion components/backends/llama_cpp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ python -m dynamo.llama_cpp --model-path /data/models/Qwen3-0.6B-Q8_0.gguf [args]

## Request Migration

In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
In a Distributed System, a request may fail due to connectivity issues between the Frontend and the Backend.

The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.

Expand Down
28 changes: 13 additions & 15 deletions components/backends/sglang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,26 +34,25 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

| Feature | SGLang | Notes |
|---------|--------|-------|
| [**Disaggregated Serving**](../../docs/architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
| [**KV-Aware Routing**](../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../docs/architecture/sla_planner.md) | ❌ | Planned |
| [**Load Based Planner**](../../docs/architecture/load_planner.md) | ❌ | Planned |
| [**KVBM**](../../docs/architecture/kvbm_architecture.md) | ❌ | Planned |
| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ❌ | Planned |
| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | ❌ | Planned |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | ❌ | Planned |

### Large Scale P/D and WideEP Features

| Feature | SGLang | Notes |
|--------------------|--------|-----------------------------------------------------------------------|
| **WideEP** | ✅/🚧 | Full support on H100s/GB200 WIP [PR](https://github.com/sgl-project/sglang/pull/7556) |
| **DP Rank Routing**| 🚧 | Direct routing supported. Process per DP rank is not supported |
| **GB200 Support** | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7556) |
| Feature | SGLang | Notes |
|---------------------|--------|--------------------------------------------------------------|
| **WideEP** | ✅ | Full support on H100s/GB200 |
| **DP Rank Routing** | 🚧 | Direct routing supported. Dynamo KV router does not router to DP worker |
| **GB200 Support** | ✅ | |

Comment on lines +46 to 51
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix grammar: “router” → “route” to DP worker.

Small but user‑visible in the Feature Matrix.

-| **DP Rank Routing** | 🚧     | Direct routing supported. Dynamo KV router does not router to DP worker |
+| **DP Rank Routing** | 🚧     | Direct routing supported. Dynamo KV router does not route to DP worker |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| Feature | SGLang | Notes |
|---------------------|--------|--------------------------------------------------------------|
| **WideEP** || Full support on H100s/GB200 |
| **DP Rank Routing** | 🚧 | Direct routing supported. Dynamo KV router does not router to DP worker |
| **GB200 Support** || |
| Feature | SGLang | Notes |
|---------------------|--------|--------------------------------------------------------------|
| **WideEP** || Full support on H100s/GB200 |
| **DP Rank Routing** | 🚧 | Direct routing supported. Dynamo KV router does not route to DP worker |
| **GB200 Support** || |
🧰 Tools
🪛 LanguageTool

[grammar] ~49-~49: There might be a mistake here.
Context: ...KV router does not router to DP worker | | GB200 Support | ✅ | ...

(QB_NEW_EN)

🤖 Prompt for AI Agents
In components/backends/sglang/README.md around lines 46 to 51, the
feature-matrix note uses the incorrect verb "router" in "Dynamo KV router does
not router to DP worker"; update the phrase to use the correct verb "route"
(e.g., "Dynamo KV router does not route to DP worker") and ensure surrounding
punctuation/capitalization remains consistent with the table style.


## Quick Start

Below we provide a guide that lets you run all of our the common deployment patterns on a single node. See our different [architectures](../llm/README.md#deployment-architectures) for a high level overview of each pattern and the architecture diagram for each.

Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
### Start NATS and ETCD in the background
Comment on lines +55 to 56
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix wording: extra “the” in Quick Start intro.

-Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
+Below we provide a guide that lets you run all of our common deployment patterns on a single node.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
### Start NATS and ETCD in the background
Below we provide a guide that lets you run all of our common deployment patterns on a single node.
### Start NATS and ETCD in the background
🧰 Tools
🪛 LanguageTool

[grammar] ~55-~55: There might be a mistake here.
Context: ...on deployment patterns on a single node. ### Start NATS and ETCD in the background S...

(QB_NEW_EN)

🤖 Prompt for AI Agents
components/backends/sglang/README.md around lines 55 to 56: The Quick Start
intro contains an extra definite article ("the the") — update the sentence
"Below we provide a guide that lets you run all of our the common deployment
patterns on a single node." to remove the duplicated "the" so it reads correctly
(e.g., "Below we provide a guide that lets you run all of our common deployment
patterns on a single node.").


Start using [Docker Compose](../../../deploy/docker-compose.yml)
Expand Down Expand Up @@ -141,7 +140,7 @@ cd $DYNAMO_ROOT/components/backends/sglang

## Request Migration

In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
In a Distributed System, a request may fail due to connectivity issues between the Frontend and the Backend.

The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.

Expand All @@ -164,7 +163,6 @@ Below we provide a selected list of advanced examples. Please open up an issue i

### Large scale P/D disaggregation with WideEP
- **[Run DeepSeek-R1 on 104+ H100s](docs/dsr1-wideep-h100.md)**
- **[Run DeepSeek-R1 on GB200s](docs/dsr1-wideep-gb200.md)**

### Speculative Decoding
- **[Deploying DeepSeek-R1 with MTP - coming soon!](.)**
Expand Down
Loading