ai-dynamo · biswapanda · Jul 30, 2025 · Jul 30, 2025 · Jul 30, 2025 · Jul 31, 2025
diff --git a/ATTRIBUTIONS-Go.md b/ATTRIBUTIONS-Go.md
diff --git a/Cargo.lock b/Cargo.lock
@@ -15,7 +15,7 @@ members = [
 resolver = "3"
 
 [workspace.package]
-version = "0.4.0"
+version = "0.4.0+post0"
-version = "0.4.0+post0"
+[workspace.dependencies]
+# Local crates
+dynamo-runtime = { path = "lib/runtime", version = "0.4.0+post0" }
+dynamo-llm      = { path = "lib/llm",     version = "0.4.0+post0" }
+dynamo-tokens   = { path = "lib/tokens",  version = "0.4.0+post0" }
-version = "0.4.0+post0"
+[workspace.dependencies]
+# Local crates
+dynamo-runtime = { path = "lib/runtime", version = "0.4.0+post0" }
+dynamo-llm      = { path = "lib/llm",     version = "0.4.0+post0" }
+dynamo-tokens   = { path = "lib/tokens",  version = "0.4.0+post0" }
 edition = "2021"
 description = "Dynamo Inference Framework"
 authors = ["NVIDIA Inc. <sw-dl-dynamo@nvidia.com>"]

diff --git a/README.md b/README.md
@@ -21,12 +21,30 @@ limitations under the License.
 [![Discord](https://dcbadge.limes.pink/api/server/D92uqZRjCZ?style=flat)](https://discord.gg/D92uqZRjCZ)
 [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/ai-dynamo/dynamo)
 
-| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Documentation](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Examples](https://github.com/ai-dynamo/dynamo/tree/main/examples)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** |
+| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Documentation](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Support Matrix](docs/support_matrix.md)** | **[Examples](https://github.com/ai-dynamo/dynamo/tree/main/examples)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** |
 
 # NVIDIA Dynamo
 
 High-throughput, low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
 
+## Framework Support Matrix
+
+| Feature | vLLM | SGLang | TensorRT-LLM |
+|---------|----------------------|----------------------------|----------------------------------------|
+| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ |
+| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
+| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
+| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | 🚧 | 🚧 |
+| [**Load Based Planner**](/docs/architecture/load_planner.md) | ✅ | 🚧 | 🚧 |
+| [**KVBM**](/docs/architecture/kvbm_architecture.md) | 🚧 | 🚧 | 🚧 |
+
+To learn more about each framework and their capabilities, check out each framework's README and deploy them with Dynamo!
+- **[vLLM](components/backends/vllm/README.md)**
+- **[SGLang](components/backends/sglang/README.md)**
+- **[TensorRT-LLM](components/backends/trtllm/README.md)**
+
+Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.
+
 ## The Era of Multi-GPU, Multi-Node
 
 <p align="center">
@@ -47,24 +65,6 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
   <img src="./docs/images/frontpage-architecture.png" alt="Dynamo architecture" width="600" />
 </p>
 
-## Framework Support Matrix
-
-| Feature | vLLM | SGLang | TensorRT-LLM |
-|---------|----------------------|----------------------------|----------------------------------------|
-| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ |
-| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
-| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
-| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | 🚧 | 🚧 |
-| [**Load Based Planner**](/docs/architecture/load_planner.md) | ✅ | 🚧 | 🚧 |
-| [**KVBM**](/docs/architecture/kvbm_architecture.md) | 🚧 | 🚧 | 🚧 |
-
-To learn more about each framework and their capabilities, check out each framework's README!
-- **[vLLM](components/backends/vllm/README.md)**
-- **[SGLang](components/backends/sglang/README.md)**
-- **[TensorRT-LLM](components/backends/trtllm/README.md)**
-
-Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.
-
 # Installation
 
 The following examples require a few system level packages.
@@ -115,11 +115,11 @@ Dynamo provides a simple way to spin up a local set of inference components incl
 
 ```
 # Start an OpenAI compatible HTTP server, a pre-processor (prompt templating and tokenization) and a router:
-python -m dynamo.frontend [--http-port 8080]
+python -m dynamo.frontend --http-port 8080
 
 # Start the SGLang engine, connecting to NATS and etcd to receive requests. You can run several of these,
 # both for the same model and for multiple models. The frontend node will discover them.
-python -m dynamo.sglang.worker deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+python -m dynamo.sglang.worker --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B --skip-tokenizer-init
 ```
 
 #### Send a Request
@@ -167,10 +167,15 @@ To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.
 
 ## SGLang
 
+
 ```
-# Install libnuma
+# Install libnuma-dev
 apt install -y libnuma-dev
 
+# Install flashinfer-python pre-release (required by sglang for optimized inference)
+uv pip install "flashinfer-python==0.2.9rc2" --prerelease=allow
+
+# Install ai-dynamo with sglang support
 uv pip install ai-dynamo[sglang]
 ```
 

diff --git a/benchmarks/llm/README.md b/benchmarks/llm/README.md
@@ -12,4 +12,3 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->
 
-[../../examples/llm/benchmarks/README.md](../../examples/llm/benchmarks/README.md)
diff --git a/components/README.md b/components/README.md
@@ -77,4 +77,4 @@ To get started with Dynamo components:
 4. **Run deployment scripts** from the engine's launch directory
 5. **Monitor performance** using the metrics component
 
-For detailed instructions, see the README files in each component directory and the main [Dynamo documentation](../../docs/).
+For detailed instructions, see the README files in each component directory and the main [Dynamo documentation](../docs/).
diff --git a/components/backends/llama_cpp/README.md b/components/backends/llama_cpp/README.md
@@ -13,7 +13,7 @@ python -m dynamo.llama_cpp --model-path /data/models/Qwen3-0.6B-Q8_0.gguf [args]
 
 ## Request Migration
 
-In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
+In a Distributed System, a request may fail due to connectivity issues between the Frontend and the Backend.
 
 The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.
 

diff --git a/components/backends/sglang/README.md b/components/backends/sglang/README.md
@@ -34,26 +34,25 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 
 | Feature | SGLang | Notes |
 |---------|--------|-------|
-| [**Disaggregated Serving**](../../docs/architecture/disagg_serving.md) | ✅ |  |
-| [**Conditional Disaggregation**](../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
-| [**KV-Aware Routing**](../../docs/architecture/kv_cache_routing.md) | ✅ |  |
-| [**SLA-Based Planner**](../../docs/architecture/sla_planner.md) | ❌ | Planned |
-| [**Load Based Planner**](../../docs/architecture/load_planner.md) | ❌ | Planned |
-| [**KVBM**](../../docs/architecture/kvbm_architecture.md) | ❌ | Planned |
+| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ |  |
+| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
+| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ |  |
+| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ❌ | Planned |
+| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | ❌ | Planned |
+| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | ❌ | Planned |
 
 ### Large Scale P/D and WideEP Features
 
-| Feature            | SGLang | Notes                                                                 |
-|--------------------|--------|-----------------------------------------------------------------------|
-| **WideEP**         | ✅/🚧 | Full support on H100s/GB200 WIP [PR](https://github.com/sgl-project/sglang/pull/7556)                                     |
-| **DP Rank Routing**| 🚧    | Direct routing supported. Process per DP rank is not supported        |
-| **GB200 Support**  | 🚧    | WIP [PR](https://github.com/sgl-project/sglang/pull/7556) |
+| Feature             | SGLang | Notes                                                        |
+|---------------------|--------|--------------------------------------------------------------|
+| **WideEP**          | ✅     | Full support on H100s/GB200                                  |
+| **DP Rank Routing** | 🚧     | Direct routing supported. Dynamo KV router does not router to DP worker |
+| **GB200 Support**   | ✅     |                                                              |
 
-| Feature             | SGLang | Notes                                                        |
-|---------------------|--------|--------------------------------------------------------------|
-| **WideEP**          | ✅     | Full support on H100s/GB200                                  |
-| **DP Rank Routing** | 🚧     | Direct routing supported. Dynamo KV router does not router to DP worker |
-| **GB200 Support**   | ✅     |                                                              |
+| Feature             | SGLang | Notes                                                        |
+|---------------------|--------|--------------------------------------------------------------|
+| **WideEP**          | ✅     | Full support on H100s/GB200                                  |
+| **DP Rank Routing** | 🚧     | Direct routing supported. Dynamo KV router does not route to DP worker |
+| **GB200 Support**   | ✅     |                                                              |
-| Feature             | SGLang | Notes                                                        |
-|---------------------|--------|--------------------------------------------------------------|
-| **WideEP**          | ✅     | Full support on H100s/GB200                                  |
-| **DP Rank Routing** | 🚧     | Direct routing supported. Dynamo KV router does not router to DP worker |
-| **GB200 Support**   | ✅     |                                                              |
+| Feature             | SGLang | Notes                                                        |
+|---------------------|--------|--------------------------------------------------------------|
+| **WideEP**          | ✅     | Full support on H100s/GB200                                  |
+| **DP Rank Routing** | 🚧     | Direct routing supported. Dynamo KV router does not route to DP worker |
+| **GB200 Support**   | ✅     |                                                              |
 
 ## Quick Start
 
-Below we provide a guide that lets you run all of our the common deployment patterns on a single node. See our different [architectures](../llm/README.md#deployment-architectures) for a high level overview of each pattern and the architecture diagram for each.
-
+Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
 ### Start NATS and ETCD in the background
-Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
-### Start NATS and ETCD in the background
+Below we provide a guide that lets you run all of our common deployment patterns on a single node.
+### Start NATS and ETCD in the background
-Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
-### Start NATS and ETCD in the background
+Below we provide a guide that lets you run all of our common deployment patterns on a single node.
+### Start NATS and ETCD in the background
 
 Start using [Docker Compose](../../../deploy/docker-compose.yml)
@@ -141,7 +140,7 @@ cd $DYNAMO_ROOT/components/backends/sglang
 
 ## Request Migration
 
-In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
+In a Distributed System, a request may fail due to connectivity issues between the Frontend and the Backend.
 
 The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.
 
@@ -164,7 +163,6 @@ Below we provide a selected list of advanced examples. Please open up an issue i
 
 ### Large scale P/D disaggregation with WideEP
 - **[Run DeepSeek-R1 on 104+ H100s](docs/dsr1-wideep-h100.md)**
-- **[Run DeepSeek-R1 on GB200s](docs/dsr1-wideep-gb200.md)**
 
 ### Speculative Decoding
 - **[Deploying DeepSeek-R1 with MTP - coming soon!](.)**
Original file line number	Diff line number	Diff line change
Expand Up		@@ -12,4 +12,3 @@ See the License for the specific language governing permissions and
		limitations under the License.
		-->

		[../../examples/llm/benchmarks/README.md](../../examples/llm/benchmarks/README.md)