You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
High-throughput, low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
29
29
30
+
## The Era of Multi-GPU, Multi-Node
30
31
31
-
Large language models are quickly outgrowing the memory and compute budget of any single GPU. Tensor-parallelism solves the capacity problem by spreading each layer across many GPUs—and sometimes many servers—but it creates a new one: how do you coordinate those shards, route requests, and share KV cache fast enough to feel like one accelerator? This orchestration gap is exactly what NVIDIA Dynamo is built to close.
Large language models are quickly outgrowing the memory and compute budget of any single GPU. Tensor-parallelism solves the capacity problem by spreading each layer across many GPUs—and sometimes many servers—but it creates a new one: how do you coordinate those shards, route requests, and share KV cache fast enough to feel like one accelerator? This orchestration gap is exactly what NVIDIA Dynamo is built to close.
36
37
37
-
### Introducing NVIDIA Dynamo
38
-
39
-
NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments. Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLang or others) and captures LLM-specific capabilities such as:
-**Accelerated data transfer** – Reduces inference response time using NIXL.
47
44
-**KV cache offloading** – Leverages multiple memory hierarchies for higher system throughput
48
45
49
-
Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.
Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.
67
+
68
+
# Installation
54
69
55
70
The following examples require a few system level packages.
56
71
Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/support_matrix.md](docs/support_matrix.md)
57
72
58
-
1. Install etcd and nats
73
+
## 1. Initial setup
74
+
75
+
The Dynamo team recommends the `uv` Python package manager, although any way works. Install uv:
76
+
```
77
+
curl -LsSf https://astral.sh/uv/install.sh | sh
78
+
```
59
79
60
-
To co-ordinate across the data center Dynamo relies on an etcd and nats cluster. To run locally these need to be available.
80
+
### Install etcd and NATS (required)
81
+
82
+
To coordinate across a data center, Dynamo relies on etcd and NATS. To run Dynamo locally, these need to be available.
61
83
62
84
-[etcd](https://etcd.io/) can be run directly as `./etcd`.
The Dynamo team recommend the `uv` Python package manager, although anyway works. Install uv:
87
+
To quickly setup etcd & NATS, you can also run:
66
88
```
67
-
curl -LsSf https://astral.sh/uv/install.sh | sh
89
+
# At the root of the repository:
90
+
docker compose -f deploy/docker-compose.yml up -d
68
91
```
69
92
70
-
2. Select an engine
93
+
## 2. Select an engine
71
94
72
-
We publish Python wheels specialized for each of our supported engines: vllm, sglang, llama.cpp and trtllm. The examples that follow use sglang, read on for other engines.
95
+
We publish Python wheels specialized for each of our supported engines: vllm, sglang, trtllm, and llama.cpp. The examples that follow use SGLang; continue reading for other engines.
73
96
74
97
```
75
98
uv venv venv
76
99
source venv/bin/activate
77
100
uv pip install pip
78
101
79
102
# Choose one
80
-
uv pip install "ai-dynamo[sglang]"
81
-
uv pip install "ai-dynamo[vllm]"
82
-
uv pip install "ai-dynamo[trtllm]"
83
-
uv pip install "ai-dynamo[llama_cpp]" # CPU, see later for GPU
103
+
uv pip install "ai-dynamo[sglang]" #replace with [vllm], [trtllm], etc.
84
104
```
85
105
86
-
### Running and Interacting with an LLM Locally
87
-
88
-
You can run a model and interact with it locally using commands below.
89
-
90
-
#### Example Commands
91
-
92
-
```
93
-
python -m dynamo.frontend --interactive
94
-
python -m dynamo.sglang.worker Qwen/Qwen3-4B
95
-
```
96
-
97
-
```
98
-
✔ User · Hello, how are you?
99
-
Okay, so I'm trying to figure out how to respond to the user's greeting. They said, "Hello, how are you?" and then followed it with "Hello! I'm just a program, but thanks for asking." Hmm, I need to come up with a suitable reply. ...
100
-
```
101
-
102
-
If the model is not available locally it will be downloaded from HuggingFace and cached.
103
-
104
-
You can also pass a local path: `python -m dynamo.sglang.worker --model-path ~/llms/Qwen3-0.6B`
106
+
## 3. Run Dynamo
105
107
106
108
### Running an LLM API server
107
109
@@ -115,7 +117,7 @@ Dynamo provides a simple way to spin up a local set of inference components incl
115
117
# Start an OpenAI compatible HTTP server, a pre-processor (prompt templating and tokenization) and a router:
116
118
python -m dynamo.frontend [--http-port 8080]
117
119
118
-
# Start the vllm engine, connecting to nats and etcd to receive requests. You can run several of these,
120
+
# Start the SGLang engine, connecting to NATS and etcd to receive requests. You can run several of these,
119
121
# both for the same model and for multiple models. The frontend node will discover them.
Rerun with `curl -N` and change `stream` in the request to `true` to get the responses as soon as the engine issues them.
140
142
141
-
### Engines
143
+
### Deploying Dynamo
144
+
145
+
- Follow the [Quickstart Guide](docs/guides/dynamo_deploy/README.md) to deploy on Kubernetes.
146
+
- Check out [Backends](components/backends) to deploy various workflow configurations (e.g. SGLang with router, vLLM with disaggregated serving, etc.)
147
+
- Run some [Examples](examples) to learn about building components in Dynamo and exploring various integrations.
142
148
143
-
In the introduction we installed the `sglang` engine. There are other options.
149
+
# Engines
144
150
145
-
All of these requires nats and etcd, as well as a frontend (`python -m dynamo.frontend [--interactive]`).
151
+
Dynamo is designed to be inference engine agnostic. To use any engine with Dynamo, NATS and etcd need to be installed, along with a Dynamo frontend (`python -m dynamo.frontend [--interactive]`).
146
152
147
-
#vllm
153
+
## vLLM
148
154
149
155
```
150
156
uv pip install ai-dynamo[vllm]
@@ -155,26 +161,26 @@ Run the backend/worker like this:
155
161
python -m dynamo.vllm --help
156
162
```
157
163
158
-
vllm attempts to allocate enough KV cache for the full context length at startup. If that does not fit in your available memory pass `--context-length <value>`.
164
+
vLLM attempts to allocate enough KV cache for the full context length at startup. If that does not fit in your available memory pass `--context-length <value>`.
159
165
160
166
To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.
161
167
162
-
#sglang
168
+
## SGLang
163
169
164
170
```
165
171
uv pip install ai-dynamo[sglang]
166
172
```
167
173
168
174
Run the backend/worker like this:
169
175
```
170
-
python -m dynamo.sglang.worker --help
176
+
python -m dynamo.sglang.worker --help #Note the '.worker' in the module path for SGLang
171
177
```
172
178
173
179
You can pass any sglang flags directly to this worker, see https://docs.sglang.ai/backend/server_arguments.html . See there to use multiple GPUs.
174
180
175
-
#TRT-LLM
181
+
## TensorRT-LLM
176
182
177
-
It is recommended to use [NGC PyTorch Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for running TensorRT-LLM engine.
183
+
It is recommended to use [NGC PyTorch Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for running the TensorRT-LLM engine.
178
184
179
185
> [!Note]
180
186
> Ensure that you select a PyTorch container image version that matches the version of TensorRT-LLM you are using.
@@ -184,7 +190,7 @@ It is recommended to use [NGC PyTorch Container](https://catalog.ngc.nvidia.com/
184
190
> [!Important]
185
191
> Launch container with the following additional settings `--shm-size=1g --ulimit memlock=-1`
186
192
187
-
## Install prerequites
193
+
###Install prerequisites
188
194
```
189
195
# Optional step: Only required for Blackwell and Grace Hopper
> You can learn more about these prequisites and known issues with TensorRT-LLM pip based installation [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html).
197
203
198
-
##Install dynamo
204
+
### After installing the pre-requisites above, install Dynamo
If you have multiple GPUs, llama.cpp does automatic tensor parallelism. You do not need to pass any extra flags to dynamo-run to enable it.
233
-
234
-
### Local Development
235
-
236
-
1. Install libraries
218
+
## 1. Install libraries
237
219
238
220
**Ubuntu:**
239
221
```
@@ -257,36 +239,36 @@ xcrun -sdk macosx metal
257
239
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
258
240
259
241
260
-
2. Install Rust
242
+
## 2. Install Rust
261
243
262
244
```
263
245
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
264
246
source $HOME/.cargo/env
265
247
```
266
248
267
-
3. Create a Python virtual env:
249
+
## 3. Create a Python virtual env:
268
250
269
251
```
270
252
uv venv dynamo
271
253
source dynamo/bin/activate
272
254
```
273
255
274
-
4. Install build tools
256
+
## 4. Install build tools
275
257
276
258
```
277
259
uv pip install pip maturin
278
260
```
279
261
280
262
[Maturin](https://github.com/PyO3/maturin) is the Rust<->Python bindings build tool.
281
263
282
-
5. Build the Rust bindings
264
+
## 5. Build the Rust bindings
283
265
284
266
```
285
267
cd lib/bindings/python
286
268
maturin develop --uv
287
269
```
288
270
289
-
6. Install the wheel
271
+
## 6. Install the wheel
290
272
291
273
```
292
274
cd $PROJECT_ROOT
@@ -302,8 +284,3 @@ Remember that nats and etcd must be running (see earlier).
302
284
Set the environment variable `DYN_LOG` to adjust the logging level; for example, `export DYN_LOG=debug`. It has the same syntax as `RUST_LOG`.
303
285
304
286
If you use vscode or cursor, we have a .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions see the [ReadMe](.devcontainer/README.md) for more details.
305
-
306
-
### Deployment to Kubernetes
307
-
308
-
Follow the [Quickstart Guide](docs/guides/dynamo_deploy/quickstart.md) to deploy to Kubernetes.
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3
+
SPDX-License-Identifier: Apache-2.0
4
+
5
+
Licensed under the Apache License, Version 2.0 (the "License");
6
+
you may not use this file except in compliance with the License.
7
+
You may obtain a copy of the License at
8
+
9
+
https://www.apache.org/licenses/LICENSE-2.0
10
+
11
+
Unless required by applicable law or agreed to in writing, software
12
+
distributed under the License is distributed on an "AS IS" BASIS,
13
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+
See the License for the specific language governing permissions and
15
+
limitations under the License.
16
+
-->
17
+
18
+
# Dynamo Components
19
+
20
+
This directory contains the core components that make up the Dynamo inference framework. Each component serves a specific role in the distributed LLM serving architecture, enabling high-throughput, low-latency inference across multiple nodes and GPUs.
21
+
22
+
## Supported Inference Engines
23
+
24
+
Dynamo supports multiple inference engines (with a focus on SGLang, vLLM, and TensorRT-LLM), each with their own deployment configurations and capabilities:
25
+
26
+
-**[vLLM](backends/vllm/README.md)** - High-performance LLM inference with native KV cache events and NIXL-based transfer mechanisms
27
+
-**[SGLang](backends/sglang/README.md)** - Structured generation language framework with ZMQ-based communication
28
+
-**[TensorRT-LLM](backends/trtllm/README.md)** - NVIDIA's optimized LLM inference engine with TensorRT acceleration
29
+
30
+
Each engine provides launch scripts for different deployment patterns in their respective `/launch` & `/deploy` directories.
31
+
32
+
## Core Components
33
+
34
+
### [Backends](backends/)
35
+
36
+
The backends directory contains inference engine integrations and implementations, with a key focus on:
37
+
38
+
-**vLLM** - Full-featured vLLM integration with disaggregated serving, KV-aware routing, and SLA-based planning
0 commit comments