Skip to content

Commit 1327e3b

Browse files
authored
feat: Add performance sweeps for DeepSeek R1 on GB200 (#2387)
1 parent 7e4eec2 commit 1327e3b

File tree

14 files changed

+2497
-0
lines changed

14 files changed

+2497
-0
lines changed

components/backends/trtllm/README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
4343
- [Client](#client)
4444
- [Benchmarking](#benchmarking)
4545
- [Multimodal Support](#multimodal-support)
46+
- [Performance Sweep](#performance-sweep)
4647

4748
## Feature Support Matrix
4849

@@ -420,3 +421,7 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '
420421
### Supported Multimodal Models
421422

422423
Multimodel models listed [here](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/inputs/utils.py#L221) are supported by dynamo.
424+
425+
## Performance Sweep
426+
427+
For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](./performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.
Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
5+
Licensed under the Apache License, Version 2.0 (the "License");
6+
you may not use this file except in compliance with the License.
7+
You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
18+
# TensorRT-LLM Benchmark Scripts for DeepSeek R1 model
19+
20+
This directory contains scripts for benchmarking TensorRT-LLM performance with Dynamo using SLURM job scheduler.
21+
22+
## ⚠️ DISCLAIMER
23+
**These scripts are currently not QA'ed and are provided for demonstration purposes only.**
24+
25+
Please note that:
26+
27+
- These scripts have not undergone formal quality assurance testing
28+
- They were executed on GB200 systems
29+
- They are intended for demonstration and educational purposes
30+
- Use at your own risk in production environments
31+
- Always review and test scripts thoroughly before running in your specific environment
32+
- We are actively working on refining the configuration sweeps.
33+
34+
## Scripts Overview
35+
36+
### Core Scripts
37+
38+
1. `submit.sh` - Main entry point for submitting benchmark jobs for disaggregated configurations. This includes WideEP optimization for DEP>=16.
39+
2. `submit_agg.sh` - Main entry point for submitting benchmark jobs for aggregated configurations.
40+
3. `post_process.py` - Scan the genai-perf results to produce a json with entries to each config point.
41+
4. `plot_performance_comparison.py` - Takes the json result file for disaggregated and/or aggregated configuration sweeps and plots a pareto line for better visualization.
42+
43+
For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer [multinode-examples.md](../multinode/multinode-examples.md). This guide shares similar assumption to the multinode examples guide.
44+
45+
## Usage
46+
47+
### Prerequisites
48+
49+
Before running the scripts, ensure you have:
50+
1. Access to a SLURM cluster
51+
2. Container image of Dynamo with TensorRT-LLM built using instructions from [here](https://github.com/ai-dynamo/dynamo/tree/main/components/backends/trtllm#build-docker).
52+
3. Model files accessible on the cluster
53+
4. Required environment variables set
54+
55+
### Setup
56+
57+
Within the login node of the cluster, set the following variables
58+
59+
```bash
60+
# Set partition manually based on your slurm cluster's partition names
61+
export SLURM_PARTITION=""
62+
63+
# Set account manually if this command doesn't work on your cluster
64+
export SLURM_ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
65+
66+
# Set a job name for your benchmarking runs
67+
export SLURM_JOB_NAME=""
68+
69+
# NOTE: IMAGE must be set manually for now
70+
# To build an iamge, see the steps here:
71+
# https://github.com/ai-dynamo/dynamo/tree/main/components/backends/trtllm#build-docker
72+
export IMAGE="<dynamo_trtllm_image>"
73+
74+
# NOTE: In general, Deepseek R1 is very large, so it is recommended to
75+
# pre-download the model weights and save them in some shared location,
76+
# NFS storage, HF_CACHE, etc. and modify the `--model-path` below
77+
# to reuse the pre-downloaded weights instead.
78+
#
79+
# On Blackwell systems (ex: GB200), it is recommended to use the FP4 weights:
80+
# https://huggingface.co/nvidia/DeepSeek-R1-FP4
81+
#
82+
# On Hopper systems, FP4 isn't supported so you'll need to use the default weights:
83+
# https://huggingface.co/deepseek-ai/DeepSeek-R1
84+
export MODEL_PATH="<path_to_model_weights>"
85+
86+
# The name the model will be served/queried under, matching what's
87+
# returned by the /v1/models endpoint.
88+
#
89+
# By default this is inferred from MODEL_PATH, but when using locally downloaded
90+
# model weights, it can be nice to have explicit control over the name.
91+
export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
92+
```
93+
94+
## Launching benchmarking sweeps for different configurations
95+
96+
### Aggregated
97+
98+
```bash
99+
# Queues the SLURM jobs for aggregated configurations for DeepSeek R1.
100+
./submit_agg.sh
101+
```
102+
103+
### Disaggregated (Includes WideEP) - MTP off
104+
105+
```bash
106+
# Queues the SLURM jobs for disaggregated configurations for DeepSeek R1 without MTP
107+
./submit.sh mtp=off all
108+
```
109+
110+
### Disaggregated (Includes WideEP) - MTP on
111+
112+
```bash
113+
# Queues the SLURM jobs for disaggregated configurations for DeepSeek R1 with MTP
114+
./submit.sh mtp=on all
115+
```
116+
117+
## Post-Processing Results
118+
119+
The above jobs use genAI-perf tool to benchmark each configuration point across different concurrency values. These get stored in `dynamo_disagg-bm-8150-1024/<config-setup>/genai_perf_artifacts` and `dynamo_agg-bm-8150-1024/<config-setup>/genai_perf_artifacts` for disaggregated and aggregated respectively.
120+
121+
After your benchmarking jobs have completed, you can use the `post_process.py` script to aggregate and summarize the results from the generated genai_perf_artifacts.
122+
123+
To run the post-processing script, use:
124+
125+
### Aggregated
126+
127+
```bash
128+
python3 post_process.py dynamo_agg-bm-8150-1024 --output-file agg_result.json
129+
```
130+
131+
### Disaggregated
132+
133+
```bash
134+
python3 post_process.py dynamo_disagg-bm-8150-1024 --output-file disagg_result.json
135+
```
136+
137+
## Ploting Performance
138+
139+
You can now use the `plot_performance_comparison.py` like below to observe the performance.
140+
141+
```bash
142+
python3 plot_performance_comparison.py dynamo_agg-bm-8150-1024/agg_result.json dynamo_disagg-bm-8150-1024/disagg_result.json -o performance_plot.png
143+
```
144+
145+
This script will produce a scatter plot of all the configuration points with each concurrency on a Output Throughput per GPU vs Output Throughput per User. It will also include the roofline pareto line for both aggregated and disaggregated setups.
146+
147+
Refer to [Beyond the Buzz: A Pragmatic Take on Inference Disaggregation](https://arxiv.org/html/2506.05508v1) to learn how to interpret these plots.
148+
149+
## Known Issues
150+
151+
- Some jobs may time out if genai-perf requires more time to complete all concurrency levels.
152+
- Workers may encounter out-of-memory (OOM) errors during inference, especially with larger configurations.
153+
- Configurations affected by these issues will result in missing data points on the performance plot.
Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
#!/bin/bash
2+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
# SPDX-License-Identifier: Apache-2.0
4+
MULTI_ROUND="${MULTI_ROUND:-8}"
5+
6+
# set MOUNT_DIR
7+
MOUNT_DIR="${MOUNT_DIR:-${PWD}}"
8+
CONTAINER_NAME=disaggr-test
9+
10+
11+
STREAMING=true
12+
CTX_GPU_FRAC=0.75
13+
CACHE_TRANSCEIVER_MAX_NUM_TOKENS=8448
14+
15+
num_ctx_servers=$1
16+
ctx_tp_size=$2
17+
ctx_batch_size=$3
18+
ctx_max_num_tokens=$4
19+
ctx_enable_attention_dp=$5
20+
num_gen_servers=$6
21+
gen_tp_size=$7
22+
gen_batch_size=$8
23+
gen_max_num_tokens=$9
24+
gen_enable_attention_dp=${10}
25+
gen_gpu_memory_fraction=${11}
26+
eplb_num_slots=${12}
27+
mtp_size=${13}
28+
concurrency_list=${14}
29+
gen_nodes=${15}
30+
kind=${16}
31+
model_path=${17}
32+
served_model_name=${18}
33+
image=${19}
34+
isl=${20}
35+
osl=${21}
36+
37+
ctx_max_seq_len=$((${isl} + 203))
38+
gen_max_seq_len=$((${isl} + ${osl} + 203))
39+
40+
WORK_DIR=${MOUNT_DIR}
41+
LOG_DIR=$WORK_DIR/${kind}-bm-${isl}-${osl}
42+
SCRIPTS_DIR=${WORK_DIR}/
43+
set_clock_cmd="bash ${SCRIPTS_DIR}/set_clock.sh"
44+
mkdir -p ${LOG_DIR}
45+
echo "trying to submit job"
46+
47+
sub_dir=${LOG_DIR}/ctx${num_ctx_servers}_gen${num_gen_servers}_dep${gen_tp_size}_batch${gen_batch_size}_eplb${eplb_num_slots}_mtp${mtp_size}
48+
49+
echo "concurrency_list: ${concurrency_list}"
50+
51+
ctx_gpus=$((num_ctx_servers * ctx_tp_size))
52+
gen_gpus=$((num_gen_servers * gen_tp_size))
53+
54+
echo "enable_attention_dp: ${ctx_enable_attention_dp}, ${gen_enable_attention_dp}, gpu_memory_fraction: ${gen_gpu_memory_fraction}"
55+
56+
enable_pdl=false
57+
if [ "${gen_enable_attention_dp}" = "false" ]; then
58+
enable_pdl=true
59+
echo "enable_pdl: ${enable_pdl}"
60+
sub_dir=${LOG_DIR}/ctx${num_ctx_servers}_gen${num_gen_servers}_tep${gen_tp_size}_batch${gen_batch_size}_eplb${eplb_num_slots}_mtp${mtp_size}
61+
fi
62+
63+
full_logdir=${sub_dir}
64+
artifacts_dir=${full_logdir}/genai_perf_artifacts
65+
mkdir -p ${artifacts_dir}
66+
67+
68+
# Set clock
69+
srun ${set_clock_cmd}
70+
71+
container_mounts=${MOUNT_DIR}:${MOUNT_DIR},${model_path}:${model_path}
72+
73+
# start the container
74+
srun -l --container-image=${image} \
75+
--container-name=${CONTAINER_NAME} \
76+
--container-mounts=${container_mounts} \
77+
--mpi=pmix \
78+
echo "Container up."
79+
80+
# generate the yaml file
81+
srun -l --container-name=${CONTAINER_NAME} \
82+
--container-mounts=${container_mounts} \
83+
--mpi=pmix --overlap \
84+
-n 1 -N 1 \
85+
python3 ${SCRIPTS_DIR}/scripts/gen_yaml.py --config ${full_logdir}/config.yaml \
86+
--model ${model_path} \
87+
--num_ctx_servers ${num_ctx_servers} \
88+
--ctx_tp_size ${ctx_tp_size} \
89+
--ctx_batch_size ${ctx_batch_size} \
90+
--ctx_max_num_tokens ${ctx_max_num_tokens} \
91+
--ctx_max_seq_len ${ctx_max_seq_len} \
92+
--ctx_free_gpu_memory_fraction ${CTX_GPU_FRAC} \
93+
--cache_transceiver_max_num_tokens ${CACHE_TRANSCEIVER_MAX_NUM_TOKENS} \
94+
--num_gen_servers ${num_gen_servers} \
95+
--gen_tp_size ${gen_tp_size} \
96+
--gen_batch_size ${gen_batch_size} \
97+
--gen_max_num_tokens ${gen_max_num_tokens} \
98+
--gen_max_seq_len ${gen_max_seq_len} \
99+
--gen_gpu_memory_fraction ${gen_gpu_memory_fraction} \
100+
--eplb_num_slots ${eplb_num_slots} \
101+
$(if [ "${gen_enable_attention_dp}" = "true" ]; then echo "--gen_enable_attention_dp"; fi) \
102+
$(if [ "${ctx_enable_attention_dp}" = "true" ]; then echo "--ctx_enable_attention_dp"; fi) \
103+
$(if [ "${mtp_size}" -gt 0 ]; then echo "--mtp_size ${mtp_size}"; fi)
104+
105+
echo "YAML file generated."
106+
107+
nsys_on=""
108+
# nsys_on=${full_logdir}
109+
110+
nodes=($(scontrol show hostnames "$SLURM_JOB_NODELIST"))
111+
112+
export HEAD_NODE="${nodes[0]}"
113+
export HEAD_NODE_IP="$(hostname -i)"
114+
export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
115+
export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
116+
117+
# Create a temporary file to store PIDs
118+
PID_FILE=$(mktemp)
119+
trap 'cleanup_and_exit' EXIT
120+
121+
cleanup_and_exit() {
122+
if [ -f "$PID_FILE" ]; then
123+
echo "Cleaning up spawned processes..."
124+
while read -r pid; do
125+
if [ -n "$pid" ] && kill -0 "$pid" 2>/dev/null; then
126+
echo "Sending TERM to process $pid"
127+
kill -TERM "$pid" 2>/dev/null
128+
sleep 2
129+
if kill -0 "$pid" 2>/dev/null; then
130+
echo "Process $pid still running, sending KILL"
131+
kill -KILL "$pid" 2>/dev/null
132+
fi
133+
fi
134+
done < "$PID_FILE"
135+
rm -f "$PID_FILE"
136+
fi
137+
}
138+
139+
# start the server
140+
srun -l --container-name=${CONTAINER_NAME} \
141+
--container-mounts=${container_mounts} \
142+
--mpi=pmix --overlap -N 1 -n 1 \
143+
--oversubscribe \
144+
--overlap \
145+
--container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE \
146+
-w ${nodes[0]} \
147+
bash ${SCRIPTS_DIR}/scripts/start_frontend.sh &> ${full_logdir}/output_server.log &
148+
SERVER_PID=$!
149+
echo "$SERVER_PID" >> "$PID_FILE"
150+
151+
# wait for the server to start
152+
sleep 10
153+
154+
PREFILL_COUNT=$(grep 'prefill_count:' "${full_logdir}/instance_config.yaml" | awk '{print $2}')
155+
if [ -z "$PREFILL_COUNT" ]; then
156+
echo "Error: Failed to extract prefill_count from instance_config.yaml"
157+
exit 1
158+
fi
159+
echo "Prefill Count: $PREFILL_COUNT"
160+
161+
# start the prefill workers
162+
prefill_pids=()
163+
for ((i=1; i<=PREFILL_COUNT; i++)); do
164+
echo "Running Prefill Worker: ${i}"
165+
node_idx=$((i-1))
166+
echo "Running Prefill Nodes: ${nodes[node_idx]}"
167+
srun -l --container-name=${CONTAINER_NAME} \
168+
--container-mounts=${container_mounts} \
169+
--mpi=pmix --overlap -w ${nodes[node_idx]} \
170+
--oversubscribe \
171+
--overlap \
172+
--ntasks 4 \
173+
--nodes 1 \
174+
bash ${SCRIPTS_DIR}/scripts/start_worker.sh ${full_logdir}/prefill_config.yaml "${enable_pdl}" ${ctx_gpus} ${nsys_on} ${served_model_name} ${model_path} 'prefill' &> ${full_logdir}/output_workers.log &
175+
prefill_pids+=($!)
176+
echo "$!" >> "$PID_FILE"
177+
done
178+
179+
DECODE_COUNT=$(grep 'decode_count:' "${full_logdir}/instance_config.yaml" | awk '{print $2}')
180+
if [ -z "$DECODE_COUNT" ]; then
181+
echo "Error: Failed to extract decode_count from instance_config.yaml"
182+
exit 1
183+
fi
184+
echo "Decode Count: $DECODE_COUNT"
185+
186+
num_gen_nodes=$((gen_nodes/num_gen_servers))
187+
decode_start_idx=$PREFILL_COUNT
188+
for ((i=1; i<=DECODE_COUNT; i++)); do
189+
echo "Running Decode Worker: ${i}"
190+
decode_node_list=()
191+
for ((j=0; j<num_gen_nodes; j++)); do
192+
node_idx=$((decode_start_idx + (i-1)*num_gen_nodes + j))
193+
decode_node_list+=("${nodes[node_idx]}")
194+
done
195+
decode_nodes_csv=$(IFS=, ; echo "${decode_node_list[*]}")
196+
echo "Running Decode Nodes: ${decode_nodes_csv}"
197+
srun -l --container-name=${CONTAINER_NAME} \
198+
--container-mounts=${container_mounts} \
199+
--mpi=pmix \
200+
-w ${decode_nodes_csv} \
201+
--nodes ${num_gen_nodes} \
202+
--ntasks $gen_tp_size \
203+
--oversubscribe \
204+
--overlap \
205+
bash ${SCRIPTS_DIR}/scripts/start_worker.sh ${full_logdir}/decode_config.yaml "${enable_pdl}" ${ctx_gpus} ${nsys_on} ${served_model_name} ${model_path} 'decode' &> ${full_logdir}/output_workers.log &
206+
echo "$!" >> "$PID_FILE"
207+
done
208+
209+
total_gpus=$((ctx_gpus + gen_gpus))
210+
211+
# start the loadgen
212+
srun -l --container-name=${CONTAINER_NAME} \
213+
--container-mounts=${container_mounts},${artifacts_dir}:${artifacts_dir} \
214+
--mpi=pmix --overlap -N 1 -n 1 \
215+
-w ${nodes[0]} \
216+
bash ${SCRIPTS_DIR}/scripts/bench.sh ${served_model_name} ${MULTI_ROUND} ${num_gen_servers} "${concurrency_list}" ${STREAMING} ${full_logdir} ${total_gpus} ${artifacts_dir} ${model_path} ${isl} ${osl} ${kind} > ${full_logdir}/bench.log 2>&1
217+
218+
# Wait for all background processes to complete
219+
wait
220+
221+
# Cleanup will be handled by the EXIT trap

0 commit comments

Comments
 (0)