ai-dynamo · ishandhanani · Jul 15, 2025 · Jul 11, 2025 · Jul 14, 2025 · Jul 14, 2025
diff --git a/README.md b/README.md
@@ -201,4 +201,4 @@ pip install ".[all]"
 docker compose -f deploy/metrics/docker-compose.yml up -d
 cd examples/llm
 dynamo serve graphs.agg:Frontend -f configs/agg.yaml
-```
+```
@@ -15,7 +15,7 @@
 
 # Note this container is built from a local dockerfile
 # Please see instructions in examples/sglang/README.md
-FROM deepep:latest
+FROM sgl-widepep:latest
 
 # Add NIXL build dependencies
 RUN apt-get update -y && \

@@ -31,28 +31,55 @@ You can find the latest release [here](https://github.com/ai-dynamo/dynamo/relea
 git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 ```
 
-## Deployment Architectures
+---
 
-See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture. SGLang currently supports aggregated and disaggregated serving. KV routing support is coming soon!
+## Table of Contents
+- [Feature Support Matrix](#feature-support-matrix)
+- [Quick Start](#quick-start)
+- [Single Node Examples](#run-single-node-examples)
+- [Multi-Node and Advanced Examples](#advanced-examples)
+- [Deploy on SLURM or Kubernetes](#deployment)
 
-## Getting Started
+## Feature Support Matrix
 
-1. Choose a deployment architecture based on your requirements
-2. Configure the components as needed
-3. Deploy using the provided scripts
+### Core Dynamo Features
 
-### Prerequisites
+| Feature | SGLang | Notes |
+|---------|--------|-------|
+| [**Disaggregated Serving**](../../docs/architecture/disagg_serving.md) | ✅ |  |
+| [**Conditional Disaggregation**](../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
+| [**KV-Aware Routing**](../../docs/architecture/kv_cache_routing.md) | ✅ |  |
+| [**SLA-Based Planner**](../../docs/architecture/sla_planner.md) | ❌ | Planned |
+| [**Load Based Planner**](../../docs/architecture/load_planner.md) | ❌ | Planned |
+| [**KVBM**](../../docs/architecture/kvbm_architecture.md) | ❌ | Planned |
 
-Start required services (etcd and NATS) using [Docker Compose](../../deploy/metrics/docker-compose.yml)
+### Large Scale P/D and WideEP Features
+
+| Feature            | SGLang | Notes                                                                 |
+|--------------------|--------|-----------------------------------------------------------------------|
+| **WideEP**         | ✅/🚧 | Full support on H100s/GB200 WIP [PR](https://github.com/sgl-project/sglang/pull/7556)                                     |
+| **DP Rank Routing**| 🚧    | Direct routing supported. Process per DP rank is not supported        |
+| **GB200 Support**  | 🚧    | WIP [PR](https://github.com/sgl-project/sglang/pull/7556) |
+
+
+## Quick Start
+
+Below we provide a guide that lets you run all of our the common deployment patterns on a single node. See our different [architectures](../llm/README.md#deployment-architectures) for a high level overview of each pattern and the architecture diagram for each.
+
+### Start NATS and ETCD in the background
+
+Start using [Docker Compose](../../deploy/metrics/docker-compose.yml)
 
 ```bash
 docker compose -f deploy/metrics/docker-compose.yml up -d
 ```
 
-### Build docker
+### Build container
 
 ```bash
-# On an x86 machine - sglang does not support ARM yet
+# pull our pre-build sglang runtime container
+docker pull nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.3.2
+# or build from source
 ./container/build.sh --framework sglang
 ```
 
@@ -62,44 +89,22 @@ docker compose -f deploy/metrics/docker-compose.yml up -d
 ./container/run.sh -it --framework sglang
 ```
 
-## Run Deployment
-
-This figure shows an overview of the major components to deploy:
-
-
-
-```
-
-+------+      +-----------+      +------------------+             +---------------+
-| HTTP |----->| processor |----->|      Worker      |------------>|     Prefill   |
-|      |<-----|           |<-----|                  |<------------|     Worker    |
-+------+      +-----------+      +------------------+             +---------------+
-                  |    ^                  |
-       query best |    | return           | publish kv events
-           worker |    | worker_id        v
-                  |    |         +------------------+
-                  |    +---------|     kv-router    |
-                  +------------->|                  |
-                                 +------------------+
-
-```
-
-Note: The above architecture illustrates all the components. The final components
-that get spawned depend upon the chosen graph.
-
-### Example architectures
+## Run Single Node Examples
 
 > [!IMPORTANT]
-> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `dynamo-run` to start up the ingress and using `python3` to start up the workers. You can easily take each commmand and run them in separate terminals.
+> Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.
+>
+> Additionally - because we use sglang's argument parser, you can pass in any argument that sglang supports to the worker!
 
-#### Aggregated
+
+### Aggregated Serving
 
 ```bash
 cd $DYNAMO_ROOT/examples/sglang
 ./launch/agg.sh
 ```
 
-#### Aggregated serving with KV Routing
+### Aggregated Serving with KV Routing
 
 > [!NOTE]
 > The current implementation of `examples/sglang/components/worker.py` publishes _placeholder_ engine metrics to keep the Dynamo KV-router happy. Real-time metrics will be surfaced directly from the SGLang engine once the following pull requests are merged:
@@ -112,10 +117,10 @@ cd $DYNAMO_ROOT/examples/sglang
 ./launch/agg_router.sh
 ```
 
-#### Disaggregated serving
+### Disaggregated serving
 
 <details>
-<summary>SGLang Load Balancer vs Dynamo Discovery</summary>
+<summary>Under the hood: SGLang Load Balancer vs Dynamo Discovery</summary>
 
 SGLang uses a mini load balancer to route requests to handle disaggregated serving. The load balancer functions as follows:
 
@@ -136,18 +141,42 @@ cd $DYNAMO_ROOT/examples/sglang
 ./launch/disagg.sh
 ```
 
-##### Disaggregated with MoE models and DP attention
+### Disaggregated Serving with Mixture-of-Experts (MoE) models and DP attention
 
-SGLang also supports DP attention for MoE models. We provide an example config for this in `configs/disagg-dp-attention.yaml` which is based on the [DeepSeek-R1-Small-2layers](https://huggingface.co/silence09/DeepSeek-R1-Small-2layers) model. You can use this configuration to test out disaggregated serving on a single node before scaling to the full DeepSeek-R1 model across multiple nodes.
+You can use this configuration to test out disaggregated serving with dp attention and expert parallelism on a single node before scaling to the full DeepSeek-R1 model across multiple nodes.
 
 ```bash
 # note this will require 4 GPUs
 cd $DYNAMO_ROOT/examples/sglang
 ./launch/disagg_dp_attn.sh
 ```
 
-In order to scale to the full DeepSeek-R1 model, you can follow the instructions in the [multinode-examples.md](./multinode-examples.md) file.
+## Advanced Examples
+
+Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
+
+### Run on multi-node
+- **[Run a multi-node model](docs/multinode-examples.md)**
+
+### Large scale P/D disaggregation with WideEP
+- **[Run DeepSeek-R1 on 104+ H100s](docs/dsr1-wideep-h100.md)**
+- **[Run DeepSeek-R1 on GB200s](docs/dsr1-wideep-gb200.md)**
+
+### Speculative Decoding
+- **[Deploying DeepSeek-R1 with MTP - coming soon!](.)**
+
+### Structured Output and Tool Calling
+- **[Tool calling with Dynamo - coming soon!](.)**
+
+### Supporting SGLang's native endpoints via Dynamo
+- **[HTTP Server for native SGLang endpoints](docs/sgl-http-server.md)**
+
+## Deployment
+
+We currently provide deployment examples for Kubernetes (coming soon!) and SLURM
 
-##### Disaggregated with WideEP
+## Kubernetes
+- **[Deploying Dynamo with SGLang on Kubernetes - coming soon!](.)**
 
-Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can find detailed deployment and benchmarking instructions [here](./dsr1-wideep.md)
+## SLURM
+- **[Deploying Dynamo with SGLang on SLURM](slurm_jobs/README.md)**
@@ -15,7 +15,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->
 
-# Running DeepSeek-R1 Disaggregated with WideEP
+# Running DeepSeek-R1 Disaggregated with WideEP on H100s
 
 Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://www.nvidia.com/en-us/technologies/ai/deepseek-r1-large-scale-p-d-with-wide-expert-parallelism/) for more details. We provide a Dockerfile for this in `container/Dockerfile.sglang-deepep` and configurations to deploy this at scale. In this example, we will run 1 prefill worker on 4 H100 nodes and 1 decode worker on 9 H100 nodes (104 total GPUs).
 
@@ -26,16 +26,16 @@ Dynamo supports SGLang's implementation of wide expert parallelism and large sca
 ```bash
 git clone -b v0.4.8.post1 https://github.com/sgl-project/sglang.git
 cd sglang/docker
-docker build -f Dockerfile -t deepep .
+docker build -f Dockerfile -t sgl-widepep .
 ```
 
-You will now have a `deepep:latest` image
+You will now have a `sgl-widepep:latest` image
 
 2. Build the Dynamo container
 
 ```bash
 cd $DYNAMO_ROOT
-docker build -f container/Dockerfile.sglang-deepep . -t dynamo-deepep --no-cache
+docker build -f container/Dockerfile.sglang-wideep . -t dynamo-wideep --no-cache
 ```
 
 3. You can run this container on each 8xH100 node using the following command.
@@ -56,7 +56,7 @@ docker run \
     --ulimit nofile=65536:65536 \
     --cap-add CAP_SYS_PTRACE \
     --ipc host \
-    dynamo-deepep:latest
+    dynamo-wideep:latest
 ```
 
 In each container, you should be in the `/sgl-workspace/dynamo/examples/sglang` directory.

@@ -1,3 +1,8 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
 # Multinode Examples
 
 ## Multi-node sized models

@@ -0,0 +1,92 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Supporting SGLang's native endpoints via HTTP Server
+
+# Introduction
+
+The SGLang HTTP server provides a REST API interface for managing and monitoring SGLang components running in a dynamo distributed environment. It leverages dynamo's service discovery mechanism to automatically find and communicate with SGLang workers across the cluster.
+
+## Architecture Overview
+
+The HTTP server (`sgl_http_server.py`) is built on FastAPI and integrates with dynamo's `DistributedRuntime` to discover and interact with SGLang components. It uses the following discovery flow:
+
+1. **Service Discovery**: Queries dynamo's etcd instance to find components that expose specific endpoints
+2. **Dynamic Targeting**: Automatically discovers all matching components across namespaces without requiring manual configuration
+3. **Direct Communication**: Establishes direct connections to discovered component instances using dynamo's client infrastructure
+
+## Discovery Mechanism
+
+The server uses dynamo's hierarchical service discovery structure:
+
+- **DistributedRuntime**: Maintains connections to etcd (service discovery) and NATS (messaging)
+- **Namespace**: Logical grouping of components (default: "dynamo")
+- **Component**: Individual SGLang workers or services
+- **Endpoint**: Specific functionality exposed by each component
+
+The discovery process queries etcd with the prefix `instances/` to find all registered components that expose the target endpoint. Components are identified by their namespace, component name, and endpoint, allowing the server to dynamically scale operations across multiple instances.
+
+## Supported Endpoints
+
+### Current Endpoints
+
+#### POST /flush_cache
+Flushes the radix cache across all discovered SGLang components.
+
+**Behavior:**
+- Discovers all components in the specified namespace that expose the `flush_cache` endpoint
+- Sends flush requests to all instances of each discovered component
+- Returns success/failure status with details about the operation
+
+**Response:**
+```json
+{
+  "message": "Cache flush initiated",
+  "success": true
+}
+```
+
+### Upcoming Endpoints
+
+The following endpoints will be supported in future releases:
+
+#### POST /start_expert_distribution_record
+Begins recording expert distribution metrics across SGLang components.
+
+#### POST /stop_expert_distribution_record
+Stops the expert distribution recording process.
+
+#### GET /dump_expert_distribution_record
+Retrieves the collected expert distribution data.
+
+## Configuration
+
+The server accepts the following command-line arguments:
+
+- `--port`: HTTP server port (default: 9001)
+- `--ns/--namespace`: Target dynamo namespace (default: "dynamo")
+- `--comp/--component`: Specific component name to target (default: discover all)
+- `--endpoint`: Endpoint name to discover (default: "flush_cache")
+
+## Usage
+
+Start the server:
+```bash
+python sgl_http_server.py --port 9001 --namespace dynamo
+```
+
+The server will automatically discover all SGLang components in the specified namespace and provide HTTP endpoints for managing them.