Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
992adfb
fix: add better port logic (#2175) (#2192)
alec-flowers Jul 30, 2025
9a93f11
chore: fix install (#2191)
ishandhanani Jul 30, 2025
2a616da
chore: fix QA bugs in documentation/readmes (#2199)
athreesh Jul 30, 2025
d0de1a0
feat: Add trtllm deploy examples for k8s #2133 (#2207)
biswapanda Jul 31, 2025
edccbd5
fix(sglang): disagg yaml worker change and agg kv router fix (#2205)
ishandhanani Jul 31, 2025
54fbff3
fix: add curl and jq for health checks #2203 (#2209)
biswapanda Jul 31, 2025
a9b6b28
fix: Kprashanth/trtllm rc4 cherry pick (#2218)
KrishnanPrash Jul 31, 2025
65e89b3
chore: cleanup dead links (#2208)
nealvaidya Jul 31, 2025
c92dc98
chore: update nixl version to 0.4.1 (#2221) (#2228)
nv-anants Jul 31, 2025
eb58916
chore: Remove multimodal readme. (#2212) (#2234)
krishung5 Jul 31, 2025
e848cf5
fix: Cherry pick pr 2186 release 0.4.0 to fix docs/runtime/README.md …
keivenchang Aug 1, 2025
5e3586d
fix: drop cuda graph bs (batch size) on dsr1 h100 sgl (#2235)
ishandhanani Aug 1, 2025
4fbb4e5
fix: handle groveTerminationDelay and auto-detect grove installation …
julienmancuso Aug 1, 2025
dc13774
fix: Locked triton==3.3.1 since triton 3.4.0 breaks tensorrt-llm 1.0.…
dmitry-tokarev-nv Aug 1, 2025
e5e94ad
fix: sgl instructions point to new frontend (#2245)
ishandhanani Aug 1, 2025
92781d3
fix: Update disagg configs for trtllm 1.0.0rc4 changes (release/0.4.0…
rmccorm4 Aug 4, 2025
58ad4a2
fix: readme instruction (#2265)
ishandhanani Aug 4, 2025
039c061
fix: Update eagle_one configs with speculative_model_dir field (#2283)
rmccorm4 Aug 4, 2025
2a8e251
docs: Backport: Dyn 591 (#2247) to 0.4.0 (#2251)
atchernych Aug 4, 2025
2dc4a4b
fix: trtllm container - ENV var used before declaration (#2277)
dmitry-tokarev-nv Aug 5, 2025
85737ba
fix: Update the NIXL TRTLLM commit version to rc4 (#2285)
tanmayv25 Aug 5, 2025
27c8a97
docs: add instruction to deploy model with inference gateway #2257 (#…
biswapanda Aug 5, 2025
641e49d
fix: fix nil pointer deref in dynamo controller (#2293) (#2299)
mohammedabdulwahhab Aug 5, 2025
a871d03
update from main
atchernych Aug 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

49 changes: 27 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,30 @@ limitations under the License.
[![Discord](https://dcbadge.limes.pink/api/server/D92uqZRjCZ?style=flat)](https://discord.gg/D92uqZRjCZ)
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/ai-dynamo/dynamo)

| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Documentation](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Examples](https://github.com/ai-dynamo/dynamo/tree/main/examples)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** |
| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Documentation](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Support Matrix](docs/support_matrix.md)** | **[Examples](https://github.com/ai-dynamo/dynamo/tree/main/examples)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** |

# NVIDIA Dynamo

High-throughput, low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.

## Framework Support Matrix

| Feature | vLLM | SGLang | TensorRT-LLM |
|---------|----------------------|----------------------------|----------------------------------------|
| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ |
| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | 🚧 | 🚧 |
| [**Load Based Planner**](/docs/architecture/load_planner.md) | ✅ | 🚧 | 🚧 |
| [**KVBM**](/docs/architecture/kvbm_architecture.md) | 🚧 | 🚧 | 🚧 |

To learn more about each framework and their capabilities, check out each framework's README and deploy them with Dynamo!
- **[vLLM](components/backends/vllm/README.md)**
- **[SGLang](components/backends/sglang/README.md)**
- **[TensorRT-LLM](components/backends/trtllm/README.md)**

Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.

## The Era of Multi-GPU, Multi-Node

<p align="center">
Expand All @@ -47,24 +65,6 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
<img src="./docs/images/frontpage-architecture.png" alt="Dynamo architecture" width="600" />
</p>

## Framework Support Matrix

| Feature | vLLM | SGLang | TensorRT-LLM |
|---------|----------------------|----------------------------|----------------------------------------|
| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ |
| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | 🚧 | 🚧 |
| [**Load Based Planner**](/docs/architecture/load_planner.md) | ✅ | 🚧 | 🚧 |
| [**KVBM**](/docs/architecture/kvbm_architecture.md) | 🚧 | 🚧 | 🚧 |

To learn more about each framework and their capabilities, check out each framework's README!
- **[vLLM](components/backends/vllm/README.md)**
- **[SGLang](components/backends/sglang/README.md)**
- **[TensorRT-LLM](components/backends/trtllm/README.md)**

Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.

# Installation

The following examples require a few system level packages.
Expand Down Expand Up @@ -115,11 +115,11 @@ Dynamo provides a simple way to spin up a local set of inference components incl

```
# Start an OpenAI compatible HTTP server, a pre-processor (prompt templating and tokenization) and a router:
python -m dynamo.frontend [--http-port 8080]
python -m dynamo.frontend --http-port 8080

# Start the SGLang engine, connecting to NATS and etcd to receive requests. You can run several of these,
# both for the same model and for multiple models. The frontend node will discover them.
python -m dynamo.sglang.worker deepseek-ai/DeepSeek-R1-Distill-Llama-8B
python -m dynamo.sglang.worker --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B --skip-tokenizer-init
```

#### Send a Request
Expand Down Expand Up @@ -167,10 +167,15 @@ To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.

## SGLang


```
# Install libnuma
# Install libnuma-dev
apt install -y libnuma-dev

# Install flashinfer-python pre-release (required by sglang for optimized inference)
uv pip install "flashinfer-python==0.2.9rc2" --prerelease=allow

# Install ai-dynamo with sglang support
uv pip install ai-dynamo[sglang]
```

Expand Down
1 change: 0 additions & 1 deletion benchmarks/llm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,3 @@ See the License for the specific language governing permissions and
limitations under the License.
-->

[../../examples/llm/benchmarks/README.md](../../examples/llm/benchmarks/README.md)
2 changes: 1 addition & 1 deletion components/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,4 +77,4 @@ To get started with Dynamo components:
4. **Run deployment scripts** from the engine's launch directory
5. **Monitor performance** using the metrics component

For detailed instructions, see the README files in each component directory and the main [Dynamo documentation](../../docs/).
For detailed instructions, see the README files in each component directory and the main [Dynamo documentation](../docs/).
2 changes: 1 addition & 1 deletion components/backends/llama_cpp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ python -m dynamo.llama_cpp --model-path /data/models/Qwen3-0.6B-Q8_0.gguf [args]

## Request Migration

In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
In a Distributed System, a request may fail due to connectivity issues between the Frontend and the Backend.

The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.

Expand Down
18 changes: 8 additions & 10 deletions components/backends/sglang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,12 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

| Feature | SGLang | Notes |
|---------|--------|-------|
| [**Disaggregated Serving**](../../docs/architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
| [**KV-Aware Routing**](../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../docs/architecture/sla_planner.md) | ❌ | Planned |
| [**Load Based Planner**](../../docs/architecture/load_planner.md) | ❌ | Planned |
| [**KVBM**](../../docs/architecture/kvbm_architecture.md) | ❌ | Planned |
| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ❌ | Planned |
| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | ❌ | Planned |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | ❌ | Planned |

### Large Scale P/D and WideEP Features

Expand All @@ -52,8 +52,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

## Quick Start

Below we provide a guide that lets you run all of our the common deployment patterns on a single node. See our different [architectures](../llm/README.md#deployment-architectures) for a high level overview of each pattern and the architecture diagram for each.

Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
### Start NATS and ETCD in the background

Start using [Docker Compose](../../../deploy/docker-compose.yml)
Expand Down Expand Up @@ -141,7 +140,7 @@ cd $DYNAMO_ROOT/components/backends/sglang

## Request Migration

In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
In a Distributed System, a request may fail due to connectivity issues between the Frontend and the Backend.

The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.

Expand All @@ -164,7 +163,6 @@ Below we provide a selected list of advanced examples. Please open up an issue i

### Large scale P/D disaggregation with WideEP
- **[Run DeepSeek-R1 on 104+ H100s](docs/dsr1-wideep-h100.md)**
- **[Run DeepSeek-R1 on GB200s](docs/dsr1-wideep-gb200.md)**

### Speculative Decoding
- **[Deploying DeepSeek-R1 with MTP - coming soon!](.)**
Expand Down
162 changes: 162 additions & 0 deletions components/backends/sglang/deploy/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
# SGLang Kubernetes Deployment Configurations

This directory contains Kubernetes Custom Resource Definition (CRD) templates for deploying SGLang inference graphs using the **DynamoGraphDeployment** resource.

## Available Deployment Patterns

### 1. **Aggregated Deployment** (`agg.yaml`)
Basic deployment pattern with frontend and a single decode worker.

**Architecture:**
- `Frontend`: OpenAI-compatible API server
- `SGLangDecodeWorker`: Single worker handling both prefill and decode

### 2. **Aggregated Router Deployment** (`agg_router.yaml`)
Enhanced aggregated deployment with KV cache routing capabilities.

**Architecture:**
- `Frontend`: OpenAI-compatible API server with router mode enabled (`--router-mode kv`)
- `SGLangDecodeWorker`: Single worker handling both prefill and decode

### 3. **Disaggregated Deployment** (`disagg.yaml`)**
High-performance deployment with separated prefill and decode workers.

**Architecture:**
- `Frontend`: HTTP API server coordinating between workers
- `SGLangDecodeWorker`: Specialized decode-only worker (`--disaggregation-mode decode`)
- `SGLangPrefillWorker`: Specialized prefill-only worker (`--disaggregation-mode prefill`)
- Communication via NIXL transfer backend (`--disaggregation-transfer-backend nixl`)

## CRD Structure

All templates use the **DynamoGraphDeployment** CRD:

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: <deployment-name>
spec:
services:
<ServiceName>:
# Service configuration
```

### Key Configuration Options

**Resource Management:**
```yaml
resources:
requests:
cpu: "10"
memory: "20Gi"
gpu: "1"
limits:
cpu: "10"
memory: "20Gi"
gpu: "1"
```

**Container Configuration:**
```yaml
extraPodSpec:
mainContainer:
image: my-registry/sglang-runtime:my-tag
workingDir: /workspace/components/backends/sglang
args:
- "python3"
- "-m"
- "dynamo.sglang.worker"
# Model-specific arguments
```

## Prerequisites

Before using these templates, ensure you have:

1. **Dynamo Cloud Platform installed** - See [Installing Dynamo Cloud](../../docs/guides/dynamo_deploy/dynamo_cloud.md)
2. **Kubernetes cluster with GPU support**
3. **Container registry access** for SGLang runtime images
4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`)

## Usage

### 1. Choose Your Template
Select the deployment pattern that matches your requirements:
- Use `agg.yaml` for development/testing
- Use `agg_router.yaml` for production with load balancing
- Use `disagg.yaml` for maximum performance

### 2. Customize Configuration
Edit the template to match your environment:

```yaml
# Update image registry and tag
image: your-registry/sglang-runtime:your-tag

# Configure your model
args:
- "--model-path"
- "your-org/your-model"
- "--served-model-name"
- "your-org/your-model"
```

### 3. Deploy

Use the following command to deploy the deployment file.

First, create a secret for the HuggingFace token.
```bash
export HF_TOKEN=your_hf_token
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=${HF_TOKEN} \
-n ${NAMESPACE}
```

Then, deploy the model using the deployment file.

```bash
export DEPLOYMENT_FILE=agg.yaml
kubectl apply -f $DEPLOYMENT_FILE -n ${NAMESPACE}
```

### 4. Using Custom Dynamo Frameworks Image for SGLang

To use a custom dynamo frameworks image for SGLang, you can update the deployment file using yq:

```bash
export DEPLOYMENT_FILE=agg.yaml
export FRAMEWORK_RUNTIME_IMAGE=<sglang-image>

yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE > $DEPLOYMENT_FILE.generated
kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACE
```

## Model Configuration

All templates use **DeepSeek-R1-Distill-Llama-8B** as the default model. But you can use any sglang argument and configuration. Key parameters:

## Monitoring and Health

- **Frontend health endpoint**: `http://<frontend-service>:8000/health`
- **Liveness probes**: Check process health every 60s

## Further Reading

- **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/guides/dynamo_deploy/create_deployment.md)
- **Quickstart**: [Deployment Quickstart](../../../../docs/guides/dynamo_deploy/quickstart.md)
- **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md)
- **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
- **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)

## Troubleshooting

Common issues and solutions:

1. **Pod fails to start**: Check image registry access and HuggingFace token secret
2. **GPU not allocated**: Verify cluster has GPU nodes and proper resource limits
3. **Health check failures**: Review model loading logs and increase `initialDelaySeconds`
4. **Out of memory**: Increase memory limits or reduce model batch size

For additional support, refer to the [deployment troubleshooting guide](../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting).
4 changes: 2 additions & 2 deletions components/backends/sglang/deploy/disagg.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ spec:
args:
- "python3"
- "-m"
- "dynamo.sglang.worker"
- "dynamo.sglang.decode_worker"
- "--model-path"
- "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
- "--served-model-name"
Expand Down Expand Up @@ -152,4 +152,4 @@ spec:
- "--disaggregation-mode"
- "prefill"
- "--disaggregation-transfer-backend"
- "nixl"
- "nixl"
Loading
Loading