@@ -10,8 +10,13 @@ the set of nodes need to be launched together in the same MPI world, such as
1010via ` mpirun ` or ` srun ` . This is true regardless of whether the worker is
1111aggregated, prefill-only, or decode-only.
1212
13- In this document we will demonstrate an example of launching a multi-node TP16/EP16
14- aggregated worker on a slurm cluster with ` srun ` .
13+ In this document we will demonstrate two examples launching multinode workers
14+ on a slurm cluster with ` srun ` :
15+ 1 . Deploying an aggregated nvidia/DeepSeek-R1 model as a multi-node TP16/EP16
16+ worker across 4 GB200 nodes
17+ 2 . Deploying a disaggregated nvidia/DeepSeek-R1 model with a multi-node
18+ TP16/EP16 prefill worker (4 nodes) and a multi-node TP16/EP16 decode
19+ worker (4 nodes) across a total of 8 GB200 nodes.
1520
1621NOTE: Some of the scripts used in this example like ` start_frontend_services.sh ` and
1722` start_trtllm_worker.sh ` should be translatable to other environments like Kubernetes, or
@@ -25,7 +30,7 @@ For simplicity of the example, we will make some assumptions about your slurm cl
2530 testing, you should aim to allocate groups of nodes that are performantly
2631 inter-connected, such as those in an NVL72 setup.
27322 . Second, we assume this slurm cluster has the [ Pyxis] ( https://github.com/NVIDIA/pyxis )
28- SPANK plugin setup. In particular, the ` srun_script .sh` script in this
33+ SPANK plugin setup. In particular, the ` srun_aggregated .sh` script in this
2934 example will use ` srun ` arguments like ` --container-image ` ,
3035 ` --container-mounts ` , and ` --container-env ` that are added to ` srun ` by Pyxis.
3136 If your cluster supports similar container based plugins, you may be able to
@@ -34,10 +39,14 @@ For simplicity of the example, we will make some assumptions about your slurm cl
3439 described [ here] ( https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker ) .
3540 This is the image that can be set to the ` IMAGE ` environment variable in later steps.
36414 . Fourth, we assume you pre-allocate a group of nodes using ` salloc ` . We
37- will allocate 4 nodes below as a reference command. This is technically not
38- a requirement, but makes iterations of testing/experimenting easier when
39- you have a reserved set of nodes for a period of time. Make sure to set your
40- ` PARTITION ` and ` ACCOUNT ` according to your slurm cluster setup:
42+ will allocate 8 nodes below as a reference command to have enough capacity
43+ to run both examples. If you plan to only run the aggregated example, you
44+ will only need 4 nodes. If you customize the configurations to require a
45+ different number of nodes, you can adjust the number of allocated nodes
46+ accordingly. Pre-allocating nodes is technically not a requirement,
47+ but it makes iterations of testing/experimenting easier.
48+
49+ Make sure to set your ` PARTITION ` and ` ACCOUNT ` according to your slurm cluster setup:
4150 ``` bash
4251 # Set partition manually based on your slurm cluster's partition names
4352 PARTITION=" "
@@ -48,20 +57,21 @@ For simplicity of the example, we will make some assumptions about your slurm cl
4857 --account=" ${ACCOUNT} " \
4958 --job-name=" ${ACCOUNT} -dynamo.trtllm" \
5059 -t 05:00:00 \
51- --nodes 4
60+ --nodes 8
5261 ```
53625. Lastly, we will assume you are inside an interactive shell on one of your allocated
54- nodes, which should be the default behavior after executing the ` salloc` command above.
55- If not, then you should SSH into one of the allocated nodes.
63+ nodes, which may be the default behavior after executing the ` salloc` command above
64+ depending on the cluster setup. If not, then you should SSH into one of the allocated nodes.
5665
57- # # Launching Slurm Jobs
66+ # ## Environment Variable Setup
5867
5968This example aims to automate as much of the environment setup as possible,
6069but all slurm clusters and environments are different, and you may need to
6170dive into the scripts to make modifications based on your specific environment.
6271
63- Assuming you have already allocated at least 4 nodes via ` salloc` , and are
64- inside an interactive shell on one of the allocated nodes:
72+ Assuming you have already allocated your nodes via ` salloc` , and are
73+ inside an interactive shell on one of the allocated nodes, set the
74+ following environment variables based:
6575` ` ` bash
6676# NOTE: IMAGE must be set manually for now
6777# To build an iamge, see the steps here:
@@ -77,7 +87,7 @@ export IMAGE="<dynamo_trtllm_image>"
7787#
7888# NOTE: Currently, this example assumes that the local bash scripts and configs
7989# referenced are mounted into into /mnt inside the container. If you want to
80- # customize the location of the scripts, make sure to modify `srun_script .sh`
90+ # customize the location of the scripts, make sure to modify `srun_aggregated .sh`
8191# accordingly for the new locations of `start_frontend_services.sh` and
8292# `start_trtllm_worker.sh`.
8393#
@@ -105,28 +115,68 @@ export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
105115# By default this is inferred from MODEL_PATH, but when using locally downloaded
106116# model weights, it can be nice to have explicit control over the name.
107117export SERVED_MODEL_NAME=" nvidia/DeepSeek-R1-FP4"
118+ ` ` `
119+
120+ # # Aggregated WideEP
121+
122+ Assuming you have at least 4 nodes allocated following the setup steps above,
123+ follow these steps below to launch an ** aggregated** deployment across 4 nodes:
108124
109- # NOTE: This path assumes you have mounted the config file into /mnt inside
110- # the container. See the MOUNTS variable in srun_script.sh
111- export ENGINE_CONFIG=" /mnt/agg_DEP16_dsr1 .yaml"
125+ ` ` ` bash
126+ # Default set in srun_aggregated.sh, but can customize here.
127+ # export ENGINE_CONFIG="/mnt/engine_configs/wide_ep_agg .yaml"
112128
113129# Customize NUM_NODES to match the desired parallelism in ENGINE_CONFIG
114- # The produce of NUM_NODES*NUM_GPUS_PER_NODE should match the number of
130+ # The product of NUM_NODES*NUM_GPUS_PER_NODE should match the number of
115131# total GPUs necessary to satisfy the requested parallelism. For example,
116132# 4 nodes x 4 gpus/node = 16 gpus total for TP16/EP16.
117- export NUM_NODES=4
133+ # export NUM_NODES=4
134+
135+ # GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
136+ # export NUM_GPUS_PER_NODE=4
137+
138+ # Launches:
139+ # - frontend + etcd/nats on current (head) node
140+ # - one large aggregated trtllm worker across multiple nodes via MPI tasks
141+ ./srun_aggregated.sh
142+ ` ` `
143+
144+ # # Disaggregated WideEP
145+
146+ Assuming you have at least 8 nodes allocated (4 for prefill, 4 for decode)
147+ following the setup above, follow these steps below to launch a ** disaggregated**
148+ deployment across 8 nodes:
149+
150+ > [! Tip]
151+ > Make sure you have a fresh environment and don' t still have the aggregated
152+ > example above still deployed on the same set of nodes.
153+
154+ ```bash
155+ # Defaults set in srun_disaggregated.sh, but can customize here.
156+ # export PREFILL_ENGINE_CONFIG="/mnt/engine_configs/wide_ep_prefill.yaml"
157+ # export DECODE_ENGINE_CONFIG="/mnt/engine_configs/wide_ep_decode.yaml"
158+
159+ # Customize NUM_PREFILL_NODES to match the desired parallelism in PREFILL_ENGINE_CONFIG
160+ # Customize NUM_DECODE_NODES to match the desired parallelism in DECODE_ENGINE_CONFIG
161+ # The products of NUM_PREFILL_NODES*NUM_GPUS_PER_NODE and
162+ # NUM_DECODE_NODES*NUM_GPUS_PER_NODE should match the respective number of
163+ # GPUs necessary to satisfy the requested parallelism in each config.
164+ # export NUM_PREFILL_NODES=4
165+ # export NUM_DECODE_NODES=4
118166
119167# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
120- export NUM_GPUS_PER_NODE=4
168+ # export NUM_GPUS_PER_NODE=4
121169
122- # Launches frontend + etcd/nats on current (head) node.
123- # Launches one large trtllm worker across multiple nodes via MPI tasks.
124- ./srun_script.sh
170+ # Launches:
171+ # - frontend + etcd/nats on current (head) node.
172+ # - one large prefill trtllm worker across multiple nodes via MPI tasks
173+ # - one large decode trtllm worker across multiple nodes via MPI tasks
174+ ./srun_disaggregated.sh
125175```
126176
127177## Understanding the Output
128178
129- 1. The ` srun_script .sh` launches two ` srun` jobs. The first launches
179+ 1. The `srun_aggregated .sh` launches two `srun` jobs. The first launches
130180 etcd, NATS, and the OpenAI frontend on the head node only
131181 called "node1" in the example output below. The second launches
132182 a single TP16 Dynamo+TRTLLM worker spread across 4 nodes, each node
@@ -168,7 +218,9 @@ export NUM_GPUS_PER_NODE=4
168218 ```
1692195. At this point, with the worker fully initialized and detected by the frontend,
170220 it is now ready for inference.
171-
221+ 6. For `srun_disaggregated.sh`, it follows a very similar flow, but instead launches
222+ three srun jobs instead of two. One for frontend, one for prefill worker,
223+ and one for decode worker.
172224
173225## Example Request
174226
@@ -195,7 +247,8 @@ curl -w "%{http_code}" ${HOST}:${PORT}/v1/chat/completions \
195247
196248## Cleanup
197249
198- To cleanup background ` srun ` processes launched by ` srun_script.sh ` , you can run:
250+ To cleanup background ` srun ` processes launched by ` srun_aggregated.sh ` or
251+ ` srun_disaggregated.sh ` , you can run:
199252``` bash
200253pkill srun
201254```
@@ -209,3 +262,14 @@ pkill srun
209262 serving example will be added in the near future.
210263- WideEP configs in this directory are still being tested. A WideEP specific
211264 example with documentation will be added once ready.
265+ - There are known issues where WideEP workers may not cleanly shut down:
266+ - This may lead to leftover shared memory files in ` /dev/shm/moe_* ` . For
267+ now, you must manually clean these up before deploying again on the
268+ same set of nodes.
269+ - Similarly, there may be GPU memory left in-use after killing the ` srun `
270+ jobs. After cleaning up any leftover shared memory files as described
271+ above, the GPU memory may slowly come back. You can run ` watch nvidia-smi `
272+ to check on this behavior. If you don't free the GPU memory before the
273+ next deployment, you may get a CUDA OOM error while loading the model.
274+ - There is mention of this issue in the relevant TRT-LLM blog
275+ [ here] ( https://github.com/NVIDIA/TensorRT-LLM/blob/6021a439ab9c29f4c46f721eeb59f6b992c425ea/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md#miscellaneous ) .
0 commit comments