Skip to content

Commit 6dddd12

Browse files
committed
Merge branch 'main' of github.com:ai-dynamo/dynamo into jacky-ft-complete-final
2 parents 8ea92fb + bd0d67d commit 6dddd12

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

57 files changed

+68697
-50742
lines changed

ATTRIBUTIONS-Go.md

Lines changed: 26845 additions & 23008 deletions
Large diffs are not rendered by default.

ATTRIBUTIONS-Python.md

Lines changed: 16904 additions & 8904 deletions
Large diffs are not rendered by default.

ATTRIBUTIONS-Rust.md

Lines changed: 20889 additions & 18385 deletions
Large diffs are not rendered by default.

Cargo.lock

Lines changed: 664 additions & 238 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ tokio-util = { version = "0.7", features = ["codec", "net"] }
7676
tracing = { version = "0.1" }
7777
tracing-subscriber = { version = "0.3", features = ["env-filter", "local-time", "json"] }
7878
validator = { version = "0.20.0", features = ["derive"] }
79-
uuid = { version = "1", features = ["v4", "serde"] }
79+
uuid = { version = "1.17", features = ["v4", "serde"] }
8080
url = {version = "2.5", features = ["serde"]}
8181
xxhash-rust = { version = "0.8", features = ["xxh3", "const_xxh3"] }
8282

container/build.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ SGLANG_BASE_IMAGE_TAG="25.01-cuda12.8-devel-ubuntu24.04"
114114
VLLM_V1_BASE_IMAGE="nvcr.io/nvidia/cuda-dl-base"
115115
VLLM_V1_BASE_IMAGE_TAG="25.01-cuda12.8-devel-ubuntu24.04"
116116

117-
NIXL_COMMIT=16348080f5bdeb9fe6058a23be140cec020ef3f3
117+
NIXL_COMMIT=3503658e71143b56f9d5b1b440d84a94b9c41af8
118118
NIXL_REPO=ai-dynamo/nixl.git
119119

120120
NIXL_UCX_EFA_REF=7ec95b95e524a87e81cac92f5ca8523e3966b16b

deny.toml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,9 @@ allow = [
3030
"OpenSSL",
3131
"Unicode-3.0",
3232
"BSL-1.0",
33-
"MPL-2.0"
33+
"MPL-2.0",
34+
"CDLA-Permissive-2.0",
35+
"Zlib"
3436
]
3537

3638
# TODO exceptions

deploy/sdk/src/dynamo/sdk/cli/circus.py

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,23 @@ def create_circus_watcher(
8686
use_sockets: bool = True,
8787
**kwargs: Any,
8888
) -> Watcher:
89+
log_dir = os.environ.get("DYN_CIRCUS_LOG_DIR", None)
90+
if log_dir is not None:
91+
prefix = f"{log_dir}/{name}"
92+
os.makedirs(prefix, exist_ok=True)
93+
stdout_stream = {
94+
"class": "FileStream",
95+
"filename": f"{prefix}/output.log",
96+
"backup_count": 10,
97+
}
98+
stderr_stream = {
99+
"class": "FileStream",
100+
"filename": f"{prefix}/error.log",
101+
"backup_count": 10,
102+
}
103+
else:
104+
stdout_stream = None
105+
stderr_stream = None
89106
return Watcher(
90107
name=name,
91108
cmd=shlex.quote(cmd) if psutil.POSIX else cmd,
@@ -94,7 +111,10 @@ def create_circus_watcher(
94111
stop_children=True,
95112
use_sockets=use_sockets,
96113
graceful_timeout=86400,
97-
respawn=False, # TODO
114+
respawn=os.environ.get("DYN_CIRCUS_RESPAWN", "false").lower()
115+
in ("true", "1", "yes"),
116+
stdout_stream=stdout_stream,
117+
stderr_stream=stderr_stream,
98118
**kwargs,
99119
)
100120

examples/tensorrt_llm/configs/deepseek_r1/engine_configs/prefill_config.yaml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,7 @@ max_num_tokens: 8192
2525
max_seq_len: 8192
2626

2727
kv_cache_config:
28-
# With dp attention disabled: high free_gpu_memory_fraction is fine.
2928
free_gpu_memory_fraction: 0.75
30-
# With dp attention enabled: large ISL at high concurrency may need
31-
# free_gpu_memory_fraction low to have enough available memory.
32-
# free_gpu_memory_fraction: 0.30
3329

3430
# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
3531
# NOTE: overlap_scheduler enabled by default since this commit and changed

examples/tensorrt_llm/configs/deepseek_r1/multinode/README.md

Lines changed: 90 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,13 @@ the set of nodes need to be launched together in the same MPI world, such as
1010
via `mpirun` or `srun`. This is true regardless of whether the worker is
1111
aggregated, prefill-only, or decode-only.
1212

13-
In this document we will demonstrate an example of launching a multi-node TP16/EP16
14-
aggregated worker on a slurm cluster with `srun`.
13+
In this document we will demonstrate two examples launching multinode workers
14+
on a slurm cluster with `srun`:
15+
1. Deploying an aggregated nvidia/DeepSeek-R1 model as a multi-node TP16/EP16
16+
worker across 4 GB200 nodes
17+
2. Deploying a disaggregated nvidia/DeepSeek-R1 model with a multi-node
18+
TP16/EP16 prefill worker (4 nodes) and a multi-node TP16/EP16 decode
19+
worker (4 nodes) across a total of 8 GB200 nodes.
1520

1621
NOTE: Some of the scripts used in this example like `start_frontend_services.sh` and
1722
`start_trtllm_worker.sh` should be translatable to other environments like Kubernetes, or
@@ -25,7 +30,7 @@ For simplicity of the example, we will make some assumptions about your slurm cl
2530
testing, you should aim to allocate groups of nodes that are performantly
2631
inter-connected, such as those in an NVL72 setup.
2732
2. Second, we assume this slurm cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
28-
SPANK plugin setup. In particular, the `srun_script.sh` script in this
33+
SPANK plugin setup. In particular, the `srun_aggregated.sh` script in this
2934
example will use `srun` arguments like `--container-image`,
3035
`--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
3136
If your cluster supports similar container based plugins, you may be able to
@@ -34,10 +39,14 @@ For simplicity of the example, we will make some assumptions about your slurm cl
3439
described [here](https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker).
3540
This is the image that can be set to the `IMAGE` environment variable in later steps.
3641
4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We
37-
will allocate 4 nodes below as a reference command. This is technically not
38-
a requirement, but makes iterations of testing/experimenting easier when
39-
you have a reserved set of nodes for a period of time. Make sure to set your
40-
`PARTITION` and `ACCOUNT` according to your slurm cluster setup:
42+
will allocate 8 nodes below as a reference command to have enough capacity
43+
to run both examples. If you plan to only run the aggregated example, you
44+
will only need 4 nodes. If you customize the configurations to require a
45+
different number of nodes, you can adjust the number of allocated nodes
46+
accordingly. Pre-allocating nodes is technically not a requirement,
47+
but it makes iterations of testing/experimenting easier.
48+
49+
Make sure to set your `PARTITION` and `ACCOUNT` according to your slurm cluster setup:
4150
```bash
4251
# Set partition manually based on your slurm cluster's partition names
4352
PARTITION=""
@@ -48,20 +57,21 @@ For simplicity of the example, we will make some assumptions about your slurm cl
4857
--account="${ACCOUNT}" \
4958
--job-name="${ACCOUNT}-dynamo.trtllm" \
5059
-t 05:00:00 \
51-
--nodes 4
60+
--nodes 8
5261
```
5362
5. Lastly, we will assume you are inside an interactive shell on one of your allocated
54-
nodes, which should be the default behavior after executing the `salloc` command above.
55-
If not, then you should SSH into one of the allocated nodes.
63+
nodes, which may be the default behavior after executing the `salloc` command above
64+
depending on the cluster setup. If not, then you should SSH into one of the allocated nodes.
5665

57-
## Launching Slurm Jobs
66+
### Environment Variable Setup
5867

5968
This example aims to automate as much of the environment setup as possible,
6069
but all slurm clusters and environments are different, and you may need to
6170
dive into the scripts to make modifications based on your specific environment.
6271

63-
Assuming you have already allocated at least 4 nodes via `salloc`, and are
64-
inside an interactive shell on one of the allocated nodes:
72+
Assuming you have already allocated your nodes via `salloc`, and are
73+
inside an interactive shell on one of the allocated nodes, set the
74+
following environment variables based:
6575
```bash
6676
# NOTE: IMAGE must be set manually for now
6777
# To build an iamge, see the steps here:
@@ -77,7 +87,7 @@ export IMAGE="<dynamo_trtllm_image>"
7787
#
7888
# NOTE: Currently, this example assumes that the local bash scripts and configs
7989
# referenced are mounted into into /mnt inside the container. If you want to
80-
# customize the location of the scripts, make sure to modify `srun_script.sh`
90+
# customize the location of the scripts, make sure to modify `srun_aggregated.sh`
8191
# accordingly for the new locations of `start_frontend_services.sh` and
8292
# `start_trtllm_worker.sh`.
8393
#
@@ -105,28 +115,68 @@ export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
105115
# By default this is inferred from MODEL_PATH, but when using locally downloaded
106116
# model weights, it can be nice to have explicit control over the name.
107117
export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
118+
```
119+
120+
## Aggregated WideEP
121+
122+
Assuming you have at least 4 nodes allocated following the setup steps above,
123+
follow these steps below to launch an **aggregated** deployment across 4 nodes:
108124

109-
# NOTE: This path assumes you have mounted the config file into /mnt inside
110-
# the container. See the MOUNTS variable in srun_script.sh
111-
export ENGINE_CONFIG="/mnt/agg_DEP16_dsr1.yaml"
125+
```bash
126+
# Default set in srun_aggregated.sh, but can customize here.
127+
# export ENGINE_CONFIG="/mnt/engine_configs/wide_ep_agg.yaml"
112128
113129
# Customize NUM_NODES to match the desired parallelism in ENGINE_CONFIG
114-
# The produce of NUM_NODES*NUM_GPUS_PER_NODE should match the number of
130+
# The product of NUM_NODES*NUM_GPUS_PER_NODE should match the number of
115131
# total GPUs necessary to satisfy the requested parallelism. For example,
116132
# 4 nodes x 4 gpus/node = 16 gpus total for TP16/EP16.
117-
export NUM_NODES=4
133+
# export NUM_NODES=4
134+
135+
# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
136+
# export NUM_GPUS_PER_NODE=4
137+
138+
# Launches:
139+
# - frontend + etcd/nats on current (head) node
140+
# - one large aggregated trtllm worker across multiple nodes via MPI tasks
141+
./srun_aggregated.sh
142+
```
143+
144+
## Disaggregated WideEP
145+
146+
Assuming you have at least 8 nodes allocated (4 for prefill, 4 for decode)
147+
following the setup above, follow these steps below to launch a **disaggregated**
148+
deployment across 8 nodes:
149+
150+
> [!Tip]
151+
> Make sure you have a fresh environment and don't still have the aggregated
152+
> example above still deployed on the same set of nodes.
153+
154+
```bash
155+
# Defaults set in srun_disaggregated.sh, but can customize here.
156+
# export PREFILL_ENGINE_CONFIG="/mnt/engine_configs/wide_ep_prefill.yaml"
157+
# export DECODE_ENGINE_CONFIG="/mnt/engine_configs/wide_ep_decode.yaml"
158+
159+
# Customize NUM_PREFILL_NODES to match the desired parallelism in PREFILL_ENGINE_CONFIG
160+
# Customize NUM_DECODE_NODES to match the desired parallelism in DECODE_ENGINE_CONFIG
161+
# The products of NUM_PREFILL_NODES*NUM_GPUS_PER_NODE and
162+
# NUM_DECODE_NODES*NUM_GPUS_PER_NODE should match the respective number of
163+
# GPUs necessary to satisfy the requested parallelism in each config.
164+
# export NUM_PREFILL_NODES=4
165+
# export NUM_DECODE_NODES=4
118166
119167
# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
120-
export NUM_GPUS_PER_NODE=4
168+
# export NUM_GPUS_PER_NODE=4
121169
122-
# Launches frontend + etcd/nats on current (head) node.
123-
# Launches one large trtllm worker across multiple nodes via MPI tasks.
124-
./srun_script.sh
170+
# Launches:
171+
# - frontend + etcd/nats on current (head) node.
172+
# - one large prefill trtllm worker across multiple nodes via MPI tasks
173+
# - one large decode trtllm worker across multiple nodes via MPI tasks
174+
./srun_disaggregated.sh
125175
```
126176
127177
## Understanding the Output
128178
129-
1. The `srun_script.sh` launches two `srun` jobs. The first launches
179+
1. The `srun_aggregated.sh` launches two `srun` jobs. The first launches
130180
etcd, NATS, and the OpenAI frontend on the head node only
131181
called "node1" in the example output below. The second launches
132182
a single TP16 Dynamo+TRTLLM worker spread across 4 nodes, each node
@@ -168,7 +218,9 @@ export NUM_GPUS_PER_NODE=4
168218
```
169219
5. At this point, with the worker fully initialized and detected by the frontend,
170220
it is now ready for inference.
171-
221+
6. For `srun_disaggregated.sh`, it follows a very similar flow, but instead launches
222+
three srun jobs instead of two. One for frontend, one for prefill worker,
223+
and one for decode worker.
172224
173225
## Example Request
174226
@@ -195,7 +247,8 @@ curl -w "%{http_code}" ${HOST}:${PORT}/v1/chat/completions \
195247

196248
## Cleanup
197249

198-
To cleanup background `srun` processes launched by `srun_script.sh`, you can run:
250+
To cleanup background `srun` processes launched by `srun_aggregated.sh` or
251+
`srun_disaggregated.sh`, you can run:
199252
```bash
200253
pkill srun
201254
```
@@ -209,3 +262,14 @@ pkill srun
209262
serving example will be added in the near future.
210263
- WideEP configs in this directory are still being tested. A WideEP specific
211264
example with documentation will be added once ready.
265+
- There are known issues where WideEP workers may not cleanly shut down:
266+
- This may lead to leftover shared memory files in `/dev/shm/moe_*`. For
267+
now, you must manually clean these up before deploying again on the
268+
same set of nodes.
269+
- Similarly, there may be GPU memory left in-use after killing the `srun`
270+
jobs. After cleaning up any leftover shared memory files as described
271+
above, the GPU memory may slowly come back. You can run `watch nvidia-smi`
272+
to check on this behavior. If you don't free the GPU memory before the
273+
next deployment, you may get a CUDA OOM error while loading the model.
274+
- There is mention of this issue in the relevant TRT-LLM blog
275+
[here](https://github.com/NVIDIA/TensorRT-LLM/blob/6021a439ab9c29f4c46f721eeb59f6b992c425ea/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md#miscellaneous).

0 commit comments

Comments
 (0)