Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -201,4 +201,4 @@ pip install ".[all]"
docker compose -f deploy/metrics/docker-compose.yml up -d
cd examples/llm
dynamo serve graphs.agg:Frontend -f configs/agg.yaml
```
```
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

# Note this container is built from a local dockerfile
# Please see instructions in examples/sglang/README.md
FROM deepep:latest
FROM sgl-widepep:latest

# Add NIXL build dependencies
RUN apt-get update -y && \
Expand Down
121 changes: 75 additions & 46 deletions examples/sglang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,28 +31,55 @@ You can find the latest release [here](https://github.com/ai-dynamo/dynamo/relea
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```

## Deployment Architectures
---

See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture. SGLang currently supports aggregated and disaggregated serving. KV routing support is coming soon!
## Table of Contents
- [Feature Support Matrix](#feature-support-matrix)
- [Quick Start](#quick-start)
- [Single Node Examples](#run-single-node-examples)
- [Multi-Node and Advanced Examples](#advanced-examples)
- [Deploy on SLURM or Kubernetes](#deployment)

## Getting Started
## Feature Support Matrix

1. Choose a deployment architecture based on your requirements
2. Configure the components as needed
3. Deploy using the provided scripts
### Core Dynamo Features

### Prerequisites
| Feature | SGLang | Notes |
|---------|--------|-------|
| [**Disaggregated Serving**](../../docs/architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
| [**KV-Aware Routing**](../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../docs/architecture/sla_planner.md) | ❌ | Planned |
| [**Load Based Planner**](../../docs/architecture/load_planner.md) | ❌ | Planned |
| [**KVBM**](../../docs/architecture/kvbm_architecture.md) | ❌ | Planned |

Start required services (etcd and NATS) using [Docker Compose](../../deploy/metrics/docker-compose.yml)
### Large Scale P/D and WideEP Features

| Feature | SGLang | Notes |
|--------------------|--------|-----------------------------------------------------------------------|
| **WideEP** | ✅/🚧 | Full support on H100s/GB200 WIP [PR](https://github.com/sgl-project/sglang/pull/7556) |
| **DP Rank Routing**| 🚧 | Direct routing supported. Process per DP rank is not supported |
| **GB200 Support** | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7556) |


## Quick Start

Below we provide a guide that lets you run all of our the common deployment patterns on a single node. See our different [architectures](../llm/README.md#deployment-architectures) for a high level overview of each pattern and the architecture diagram for each.

### Start NATS and ETCD in the background

Start using [Docker Compose](../../deploy/metrics/docker-compose.yml)

```bash
docker compose -f deploy/metrics/docker-compose.yml up -d
```

### Build docker
### Build container

```bash
# On an x86 machine - sglang does not support ARM yet
# pull our pre-build sglang runtime container
docker pull nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.3.2
# or build from source
./container/build.sh --framework sglang
```

Expand All @@ -62,44 +89,22 @@ docker compose -f deploy/metrics/docker-compose.yml up -d
./container/run.sh -it --framework sglang
```

## Run Deployment

This figure shows an overview of the major components to deploy:



```

+------+ +-----------+ +------------------+ +---------------+
| HTTP |----->| processor |----->| Worker |------------>| Prefill |
| |<-----| |<-----| |<------------| Worker |
+------+ +-----------+ +------------------+ +---------------+
| ^ |
query best | | return | publish kv events
worker | | worker_id v
| | +------------------+
| +---------| kv-router |
+------------->| |
+------------------+

```

Note: The above architecture illustrates all the components. The final components
that get spawned depend upon the chosen graph.

### Example architectures
## Run Single Node Examples

> [!IMPORTANT]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `dynamo-run` to start up the ingress and using `python3` to start up the workers. You can easily take each commmand and run them in separate terminals.
> Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.
>
> Additionally - because we use sglang's argument parser, you can pass in any argument that sglang supports to the worker!

#### Aggregated

### Aggregated Serving

```bash
cd $DYNAMO_ROOT/examples/sglang
./launch/agg.sh
```

#### Aggregated serving with KV Routing
### Aggregated Serving with KV Routing

> [!NOTE]
> The current implementation of `examples/sglang/components/worker.py` publishes _placeholder_ engine metrics to keep the Dynamo KV-router happy. Real-time metrics will be surfaced directly from the SGLang engine once the following pull requests are merged:
Expand All @@ -112,10 +117,10 @@ cd $DYNAMO_ROOT/examples/sglang
./launch/agg_router.sh
```

#### Disaggregated serving
### Disaggregated serving

<details>
<summary>SGLang Load Balancer vs Dynamo Discovery</summary>
<summary>Under the hood: SGLang Load Balancer vs Dynamo Discovery</summary>

SGLang uses a mini load balancer to route requests to handle disaggregated serving. The load balancer functions as follows:

Expand All @@ -136,18 +141,42 @@ cd $DYNAMO_ROOT/examples/sglang
./launch/disagg.sh
```

##### Disaggregated with MoE models and DP attention
### Disaggregated Serving with Mixture-of-Experts (MoE) models and DP attention

SGLang also supports DP attention for MoE models. We provide an example config for this in `configs/disagg-dp-attention.yaml` which is based on the [DeepSeek-R1-Small-2layers](https://huggingface.co/silence09/DeepSeek-R1-Small-2layers) model. You can use this configuration to test out disaggregated serving on a single node before scaling to the full DeepSeek-R1 model across multiple nodes.
You can use this configuration to test out disaggregated serving with dp attention and expert parallelism on a single node before scaling to the full DeepSeek-R1 model across multiple nodes.

```bash
# note this will require 4 GPUs
cd $DYNAMO_ROOT/examples/sglang
./launch/disagg_dp_attn.sh
```

In order to scale to the full DeepSeek-R1 model, you can follow the instructions in the [multinode-examples.md](./multinode-examples.md) file.
## Advanced Examples

Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!

### Run on multi-node
- **[Run a multi-node model](docs/multinode-examples.md)**

### Large scale P/D disaggregation with WideEP
- **[Run DeepSeek-R1 on 104+ H100s](docs/dsr1-wideep-h100.md)**
- **[Run DeepSeek-R1 on GB200s](docs/dsr1-wideep-gb200.md)**

### Speculative Decoding
- **[Deploying DeepSeek-R1 with MTP - coming soon!](.)**

### Structured Output and Tool Calling
- **[Tool calling with Dynamo - coming soon!](.)**

### Supporting SGLang's native endpoints via Dynamo
- **[HTTP Server for native SGLang endpoints](docs/sgl-http-server.md)**

## Deployment

We currently provide deployment examples for Kubernetes (coming soon!) and SLURM

##### Disaggregated with WideEP
## Kubernetes
- **[Deploying Dynamo with SGLang on Kubernetes - coming soon!](.)**

Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can find detailed deployment and benchmarking instructions [here](./dsr1-wideep.md)
## SLURM
- **[Deploying Dynamo with SGLang on SLURM](slurm_jobs/README.md)**
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ See the License for the specific language governing permissions and
limitations under the License.
-->

# Running DeepSeek-R1 Disaggregated with WideEP
# Running DeepSeek-R1 Disaggregated with WideEP on H100s

Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://www.nvidia.com/en-us/technologies/ai/deepseek-r1-large-scale-p-d-with-wide-expert-parallelism/) for more details. We provide a Dockerfile for this in `container/Dockerfile.sglang-deepep` and configurations to deploy this at scale. In this example, we will run 1 prefill worker on 4 H100 nodes and 1 decode worker on 9 H100 nodes (104 total GPUs).

Expand All @@ -26,16 +26,16 @@ Dynamo supports SGLang's implementation of wide expert parallelism and large sca
```bash
git clone -b v0.4.8.post1 https://github.com/sgl-project/sglang.git
cd sglang/docker
docker build -f Dockerfile -t deepep .
docker build -f Dockerfile -t sgl-widepep .
```

You will now have a `deepep:latest` image
You will now have a `sgl-widepep:latest` image

2. Build the Dynamo container

```bash
cd $DYNAMO_ROOT
docker build -f container/Dockerfile.sglang-deepep . -t dynamo-deepep --no-cache
docker build -f container/Dockerfile.sglang-wideep . -t dynamo-wideep --no-cache
```

3. You can run this container on each 8xH100 node using the following command.
Expand All @@ -56,7 +56,7 @@ docker run \
--ulimit nofile=65536:65536 \
--cap-add CAP_SYS_PTRACE \
--ipc host \
dynamo-deepep:latest
dynamo-wideep:latest
```

In each container, you should be in the `/sgl-workspace/dynamo/examples/sglang` directory.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

# Multinode Examples

## Multi-node sized models
Expand Down
92 changes: 92 additions & 0 deletions examples/sglang/docs/sgl-http-server.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# Supporting SGLang's native endpoints via HTTP Server

# Introduction

The SGLang HTTP server provides a REST API interface for managing and monitoring SGLang components running in a dynamo distributed environment. It leverages dynamo's service discovery mechanism to automatically find and communicate with SGLang workers across the cluster.

## Architecture Overview

The HTTP server (`sgl_http_server.py`) is built on FastAPI and integrates with dynamo's `DistributedRuntime` to discover and interact with SGLang components. It uses the following discovery flow:

1. **Service Discovery**: Queries dynamo's etcd instance to find components that expose specific endpoints
2. **Dynamic Targeting**: Automatically discovers all matching components across namespaces without requiring manual configuration
3. **Direct Communication**: Establishes direct connections to discovered component instances using dynamo's client infrastructure

## Discovery Mechanism

The server uses dynamo's hierarchical service discovery structure:

- **DistributedRuntime**: Maintains connections to etcd (service discovery) and NATS (messaging)
- **Namespace**: Logical grouping of components (default: "dynamo")
- **Component**: Individual SGLang workers or services
- **Endpoint**: Specific functionality exposed by each component

The discovery process queries etcd with the prefix `instances/` to find all registered components that expose the target endpoint. Components are identified by their namespace, component name, and endpoint, allowing the server to dynamically scale operations across multiple instances.

## Supported Endpoints

### Current Endpoints

#### POST /flush_cache
Flushes the radix cache across all discovered SGLang components.

**Behavior:**
- Discovers all components in the specified namespace that expose the `flush_cache` endpoint
- Sends flush requests to all instances of each discovered component
- Returns success/failure status with details about the operation

**Response:**
```json
{
"message": "Cache flush initiated",
"success": true
}
```

### Upcoming Endpoints

The following endpoints will be supported in future releases:

#### POST /start_expert_distribution_record
Begins recording expert distribution metrics across SGLang components.

#### POST /stop_expert_distribution_record
Stops the expert distribution recording process.

#### GET /dump_expert_distribution_record
Retrieves the collected expert distribution data.

## Configuration

The server accepts the following command-line arguments:

- `--port`: HTTP server port (default: 9001)
- `--ns/--namespace`: Target dynamo namespace (default: "dynamo")
- `--comp/--component`: Specific component name to target (default: discover all)
- `--endpoint`: Endpoint name to discover (default: "flush_cache")

## Usage

Start the server:
```bash
python sgl_http_server.py --port 9001 --namespace dynamo
```

The server will automatically discover all SGLang components in the specified namespace and provide HTTP endpoints for managing them.
Loading