ai-dynamo
diff --git a/‎ATTRIBUTIONS-Go.md‎
Lines changed: 26845 additions & 23008 deletions b/‎ATTRIBUTIONS-Go.md‎
Lines changed: 26845 additions & 23008 deletions
diff --git a/‎ATTRIBUTIONS-Python.md‎
Lines changed: 16904 additions & 8904 deletions b/‎ATTRIBUTIONS-Python.md‎
Lines changed: 16904 additions & 8904 deletions
diff --git a/‎ATTRIBUTIONS-Rust.md‎
Lines changed: 20889 additions & 18385 deletions b/‎ATTRIBUTIONS-Rust.md‎
Lines changed: 20889 additions & 18385 deletions
diff --git a/‎Cargo.lock‎
Lines changed: 664 additions & 238 deletions b/‎Cargo.lock‎
Lines changed: 664 additions & 238 deletions
diff --git a/‎Cargo.toml‎
Lines changed: 1 addition & 1 deletion b/‎Cargo.toml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎container/build.sh‎
Lines changed: 1 addition & 1 deletion b/‎container/build.sh‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎deny.toml‎
Lines changed: 3 additions & 1 deletion b/‎deny.toml‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎deploy/sdk/src/dynamo/sdk/cli/circus.py‎
Lines changed: 21 additions & 1 deletion b/‎deploy/sdk/src/dynamo/sdk/cli/circus.py‎
Lines changed: 21 additions & 1 deletion
diff --git a/‎examples/tensorrt_llm/configs/deepseek_r1/engine_configs/prefill_config.yaml‎
Lines changed: 0 additions & 4 deletions b/‎examples/tensorrt_llm/configs/deepseek_r1/engine_configs/prefill_config.yaml‎
Lines changed: 0 additions & 4 deletions
diff --git a/‎examples/tensorrt_llm/configs/deepseek_r1/multinode/README.md‎
Lines changed: 90 additions & 26 deletions b/‎examples/tensorrt_llm/configs/deepseek_r1/multinode/README.md‎
Lines changed: 90 additions & 26 deletions
@@ -76,7 +76,7 @@ tokio-util = { version = "0.7", features = ["codec", "net"] }
 tracing = { version = "0.1" }
 tracing-subscriber = { version = "0.3", features = ["env-filter", "local-time", "json"] }
 validator = { version = "0.20.0", features = ["derive"] }
-uuid = { version = "1", features = ["v4", "serde"] }
+uuid = { version = "1.17", features = ["v4", "serde"] }
 url = {version = "2.5", features = ["serde"]}
 xxhash-rust = { version = "0.8", features = ["xxh3", "const_xxh3"] }
 
 
@@ -114,7 +114,7 @@ SGLANG_BASE_IMAGE_TAG="25.01-cuda12.8-devel-ubuntu24.04"
 VLLM_V1_BASE_IMAGE="nvcr.io/nvidia/cuda-dl-base"
 VLLM_V1_BASE_IMAGE_TAG="25.01-cuda12.8-devel-ubuntu24.04"
 
-NIXL_COMMIT=16348080f5bdeb9fe6058a23be140cec020ef3f3
+NIXL_COMMIT=3503658e71143b56f9d5b1b440d84a94b9c41af8
 NIXL_REPO=ai-dynamo/nixl.git
 
 NIXL_UCX_EFA_REF=7ec95b95e524a87e81cac92f5ca8523e3966b16b
 
@@ -30,7 +30,9 @@ allow = [
     "OpenSSL",
     "Unicode-3.0",
     "BSL-1.0",
-    "MPL-2.0"
+    "MPL-2.0",
+    "CDLA-Permissive-2.0",
+    "Zlib"
 ]
 
 # TODO exceptions
 
@@ -86,6 +86,23 @@ def create_circus_watcher(
     use_sockets: bool = True,
     **kwargs: Any,
 ) -> Watcher:
+    log_dir = os.environ.get("DYN_CIRCUS_LOG_DIR", None)
+    if log_dir is not None:
+        prefix = f"{log_dir}/{name}"
+        os.makedirs(prefix, exist_ok=True)
+        stdout_stream = {
+            "class": "FileStream",
+            "filename": f"{prefix}/output.log",
+            "backup_count": 10,
+        }
+        stderr_stream = {
+            "class": "FileStream",
+            "filename": f"{prefix}/error.log",
+            "backup_count": 10,
+        }
+    else:
+        stdout_stream = None
+        stderr_stream = None
     return Watcher(
         name=name,
         cmd=shlex.quote(cmd) if psutil.POSIX else cmd,
@@ -94,7 +111,10 @@ def create_circus_watcher(
         stop_children=True,
         use_sockets=use_sockets,
         graceful_timeout=86400,
-        respawn=False,  # TODO
+        respawn=os.environ.get("DYN_CIRCUS_RESPAWN", "false").lower()
+        in ("true", "1", "yes"),
+        stdout_stream=stdout_stream,
+        stderr_stream=stderr_stream,
         **kwargs,
     )
 
 
@@ -25,11 +25,7 @@ max_num_tokens: 8192
 max_seq_len: 8192
 
 kv_cache_config:
-  # With dp attention disabled: high free_gpu_memory_fraction is fine.
   free_gpu_memory_fraction: 0.75
-  # With dp attention enabled: large ISL at high concurrency may need
-  # free_gpu_memory_fraction low to have enough available memory.
-  # free_gpu_memory_fraction: 0.30
 
 # NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
 # NOTE: overlap_scheduler enabled by default since this commit and changed
 
@@ -10,8 +10,13 @@ the set of nodes need to be launched together in the same MPI world, such as
 via `mpirun` or `srun`. This is true regardless of whether the worker is
 aggregated, prefill-only, or decode-only.
 
-In this document we will demonstrate an example of launching a multi-node TP16/EP16
-aggregated worker on a slurm cluster with `srun`.
+In this document we will demonstrate two examples launching multinode workers
+on a slurm cluster with `srun`:
+1. Deploying an aggregated nvidia/DeepSeek-R1 model as a multi-node TP16/EP16
+   worker across 4 GB200 nodes
+2. Deploying a disaggregated nvidia/DeepSeek-R1 model with a multi-node
+   TP16/EP16 prefill worker (4 nodes) and a multi-node TP16/EP16 decode
+   worker (4 nodes) across a total of 8 GB200 nodes.
 
 NOTE: Some of the scripts used in this example like `start_frontend_services.sh` and
 `start_trtllm_worker.sh` should be translatable to other environments like Kubernetes, or
@@ -25,7 +30,7 @@ For simplicity of the example, we will make some assumptions about your slurm cl
    testing, you should aim to allocate groups of nodes that are performantly
    inter-connected, such as those in an NVL72 setup.
 2. Second, we assume this slurm cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
-   SPANK plugin setup. In particular, the `srun_script.sh` script in this
+   SPANK plugin setup. In particular, the `srun_aggregated.sh` script in this
    example will use `srun` arguments like `--container-image`,
    `--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
    If your cluster supports similar container based plugins, you may be able to
@@ -34,10 +39,14 @@ For simplicity of the example, we will make some assumptions about your slurm cl
    described [here](https://github.com/ai-dynamo/dynamo/tree/main/examples/tensorrt_llm#build-docker).
    This is the image that can be set to the `IMAGE` environment variable in later steps.
 4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We
-   will allocate 4 nodes below as a reference command. This is technically not
-   a requirement, but makes iterations of testing/experimenting easier when
-   you have a reserved set of nodes for a period of time. Make sure to set your
-   `PARTITION` and `ACCOUNT` according to your slurm cluster setup:
+   will allocate 8 nodes below as a reference command to have enough capacity
+   to run both examples. If you plan to only run the aggregated example, you
+   will only need 4 nodes. If you customize the configurations to require a
+   different number of nodes, you can adjust the number of allocated nodes
+   accordingly. Pre-allocating nodes is technically not a requirement,
+   but it makes iterations of testing/experimenting easier.
+
+   Make sure to set your `PARTITION` and `ACCOUNT` according to your slurm cluster setup:
     ```bash
     # Set partition manually based on your slurm cluster's partition names
     PARTITION=""
@@ -48,20 +57,21 @@ For simplicity of the example, we will make some assumptions about your slurm cl
       --account="${ACCOUNT}" \
       --job-name="${ACCOUNT}-dynamo.trtllm" \
       -t 05:00:00 \
-      --nodes 4
+      --nodes 8
     ```
 5. Lastly, we will assume you are inside an interactive shell on one of your allocated
-   nodes, which should be the default behavior after executing the `salloc` command above.
-   If not, then you should SSH into one of the allocated nodes.
+   nodes, which may be the default behavior after executing the `salloc` command above
+   depending on the cluster setup. If not, then you should SSH into one of the allocated nodes.
 
-## Launching Slurm Jobs
+### Environment Variable Setup
 
 This example aims to automate as much of the environment setup as possible,
 but all slurm clusters and environments are different, and you may need to
 dive into the scripts to make modifications based on your specific environment.
 
-Assuming you have already allocated at least 4 nodes via `salloc`, and are
-inside an interactive shell on one of the allocated nodes:
+Assuming you have already allocated your nodes via `salloc`, and are
+inside an interactive shell on one of the allocated nodes, set the
+following environment variables based:
 ```bash
 # NOTE: IMAGE must be set manually for now
 # To build an iamge, see the steps here:
@@ -77,7 +87,7 @@ export IMAGE="<dynamo_trtllm_image>"
 #
 # NOTE: Currently, this example assumes that the local bash scripts and configs
 # referenced are mounted into into /mnt inside the container. If you want to
-# customize the location of the scripts, make sure to modify `srun_script.sh`
+# customize the location of the scripts, make sure to modify `srun_aggregated.sh`
 # accordingly for the new locations of `start_frontend_services.sh` and
 # `start_trtllm_worker.sh`.
 #
@@ -105,28 +115,68 @@ export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
 # By default this is inferred from MODEL_PATH, but when using locally downloaded
 # model weights, it can be nice to have explicit control over the name.
 export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
+```
+
+## Aggregated WideEP
+
+Assuming you have at least 4 nodes allocated following the setup steps above,
+follow these steps below to launch an **aggregated** deployment across 4 nodes:
 
-# NOTE: This path assumes you have mounted the config file into /mnt inside
-# the container. See the MOUNTS variable in srun_script.sh
-export ENGINE_CONFIG="/mnt/agg_DEP16_dsr1.yaml"
+```bash
+# Default set in srun_aggregated.sh, but can customize here.
+# export ENGINE_CONFIG="/mnt/engine_configs/wide_ep_agg.yaml"
 
 # Customize NUM_NODES to match the desired parallelism in ENGINE_CONFIG
-# The produce of NUM_NODES*NUM_GPUS_PER_NODE should match the number of
+# The product of NUM_NODES*NUM_GPUS_PER_NODE should match the number of
 # total GPUs necessary to satisfy the requested parallelism. For example,
 # 4 nodes x 4 gpus/node = 16 gpus total for TP16/EP16.
-export NUM_NODES=4
+# export NUM_NODES=4
+
+# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
+# export NUM_GPUS_PER_NODE=4
+
+# Launches:
+# - frontend + etcd/nats on current (head) node
+# - one large aggregated trtllm worker across multiple nodes via MPI tasks
+./srun_aggregated.sh
+```
+
+## Disaggregated WideEP
+
+Assuming you have at least 8 nodes allocated (4 for prefill, 4 for decode)
+following the setup above, follow these steps below to launch a **disaggregated**
+deployment across 8 nodes:
+
+> [!Tip]
+> Make sure you have a fresh environment and don't still have the aggregated
+> example above still deployed on the same set of nodes.
+
+```bash
+# Defaults set in srun_disaggregated.sh, but can customize here.
+# export PREFILL_ENGINE_CONFIG="/mnt/engine_configs/wide_ep_prefill.yaml"
+# export DECODE_ENGINE_CONFIG="/mnt/engine_configs/wide_ep_decode.yaml"
+
+# Customize NUM_PREFILL_NODES to match the desired parallelism in PREFILL_ENGINE_CONFIG
+# Customize NUM_DECODE_NODES to match the desired parallelism in DECODE_ENGINE_CONFIG
+# The products of NUM_PREFILL_NODES*NUM_GPUS_PER_NODE and
+# NUM_DECODE_NODES*NUM_GPUS_PER_NODE should match the respective number of
+# GPUs necessary to satisfy the requested parallelism in each config.
+# export NUM_PREFILL_NODES=4
+# export NUM_DECODE_NODES=4
 
 # GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
-export NUM_GPUS_PER_NODE=4
+# export NUM_GPUS_PER_NODE=4
 
-# Launches frontend + etcd/nats on current (head) node.
-# Launches one large trtllm worker across multiple nodes via MPI tasks.
-./srun_script.sh
+# Launches:
+# - frontend + etcd/nats on current (head) node.
+# - one large prefill trtllm worker across multiple nodes via MPI tasks
+# - one large decode trtllm worker across multiple nodes via MPI tasks
+./srun_disaggregated.sh
 ```
 
 ## Understanding the Output
 
-1. The `srun_script.sh` launches two `srun` jobs. The first launches
+1. The `srun_aggregated.sh` launches two `srun` jobs. The first launches
    etcd, NATS, and the OpenAI frontend on the head node only
    called "node1" in the example output below. The second launches
    a single TP16 Dynamo+TRTLLM worker spread across 4 nodes, each node
@@ -168,7 +218,9 @@ export NUM_GPUS_PER_NODE=4
     ```
 5. At this point, with the worker fully initialized and detected by the frontend,
    it is now ready for inference.
-
+6. For `srun_disaggregated.sh`, it follows a very similar flow, but instead launches
+   three srun jobs instead of two. One for frontend, one for prefill worker,
+   and one for decode worker.
 
 ## Example Request
 
@@ -195,7 +247,8 @@ curl -w "%{http_code}" ${HOST}:${PORT}/v1/chat/completions \
 
 ## Cleanup
 
-To cleanup background `srun` processes launched by `srun_script.sh`, you can run:
+To cleanup background `srun` processes launched by `srun_aggregated.sh` or
+`srun_disaggregated.sh`, you can run:
 ```bash
 pkill srun
 ```
@@ -209,3 +262,14 @@ pkill srun
   serving example will be added in the near future.
 - WideEP configs in this directory are still being tested. A WideEP specific
   example with documentation will be added once ready.
+- There are known issues where WideEP workers may not cleanly shut down:
+    - This may lead to leftover shared memory files in `/dev/shm/moe_*`. For
+      now, you must manually clean these up before deploying again on the
+      same set of nodes.
+    - Similarly, there may be GPU memory left in-use after killing the `srun`
+      jobs. After cleaning up any leftover shared memory files as described
+      above, the GPU memory may slowly come back. You can run `watch nvidia-smi`
+      to check on this behavior. If you don't free the GPU memory before the
+      next deployment, you may get a CUDA OOM error while loading the model.
+    - There is mention of this issue in the relevant TRT-LLM blog
+      [here](https://github.com/NVIDIA/TensorRT-LLM/blob/6021a439ab9c29f4c46f721eeb59f6b992c425ea/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md#miscellaneous).