[ROCm][FEAT] Integrate AITER CustomAllreduce in cuda communicator. #23336

vllmellm · 2025-08-21T11:03:36Z

Purpose

Integrate aiter custom all reduce in cuda communicator, which boosts model performance.
This PR is tested on 5ee37dc commit from aiter package.

Benchmark Results

deepseek-ai/DeepSeek-V3 tp8

Metric	With Aiter CustomAllreduce	Without AITER CustomAllreduce
Request Throughput (req/s)	1.82	1.73
Output Token Thpt (tok/s)	494.16	484.04
Total Token Thpt (tok/s)	2311.82	2212.63
Mean TTFT (ms)	50.73	381.81
Median TTFT (ms)	48.39	389.49
P99 TTFT (ms)	64.95	709.10
Mean TPOT (ms)	22.98	117.79
Median TPOT (ms)	22.99	101.26
P99 TPOT (ms)	25.47	434.76
Mean ITL (ms)	22.92	25.73
Median ITL (ms)	22.98	23.73
P99 ITL (ms)	25.43	151.28

meta-llama/Llama-4-Scout-17B-16E-Instruct tp8

Metric	With Aiter CustomAllreduce	Without AITER CustomAllreduce
Request Throughput (req/s)	3.05	2.98
Output Token Thpt (tok/s)	913.81	859.67
Total Token Thpt (tok/s)	3935.91	3814.10
Mean TTFT (ms)	122.87	150.23
Median TTFT (ms)	108.24	148.77
P99 TTFT (ms)	190.81	227.81
Mean TPOT (ms)	18.20	20.51
Median TPOT (ms)	16.82	18.56
P99 TPOT (ms)	27.95	34.66
Mean ITL (ms)	15.93	16.87
Median ITL (ms)	12.88	13.30
P99 ITL (ms)	79.62	90.90

Qwen/Qwen3-235B-A22B-FP8 tp4

Metric	With Aiter CustomAllreduce	Without AITER CustomAllreduce
Request Throughput (req/s)	1.69	1.61
Output Token Thpt (tok/s)	1079.88	1116.64
Total Token Thpt (tok/s)	2761.42	2723.95
Mean TTFT (ms)	543.23	590.96
Median TTFT (ms)	550.70	600.72
P99 TTFT (ms)	990.83	1012.52
Mean TPOT (ms)	48.90	36.36
Median TPOT (ms)	27.36	28.08
P99 TPOT (ms)	428.00	143.51
Mean ITL (ms)	27.16	28.30
Median ITL (ms)	23.75	26.43
P99 ITL (ms)	131.04	29.65

benchmark setting

python vllm/benchmarks/benchmark_serving.py --backend vllm --model "$model_name" --dataset-name random --num-prompts 50 --request-rate 10 --random-input-len 1000 --random-output-len 1000

Test Plan

Test model that are afftected by this change, using lm_eval on gsm8k dataset.

environment setting

Step 1: run vllm serve

VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_CUSTOM_ALL_REDUCE=1 SAFETENSORS_FAST_GPU=1

vllm serve $MODEL_NAME --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' --trust-remote-code --max-model-len 32768 -tp 8 --block-size 1 --swap-space 16 --distributed-executor-backend mp

Step 2: run lm_eval

lm_eval --model local-completions --tasks gsm8k --model_args model=$MODEL_NAME,base_url=http://localhost:8000/v1/completions --trust_remote_code --num_fewshot 5 --batch_size 100

Test Results

deepseek-ai/DeepSeek-V3 tp8

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9477	±	0.0061
		strict-match	5	exact_match	↑	0.9469	±	0.0062

control it by the env flag VLLM_ROCM_USE_AITER_CUSTOM_ALL_REDUCE (default: True) Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

github-actions · 2025-08-21T11:10:57Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

ProExpertProg

Can you refactor this to extend the current dispatching instead of using conditional imports?

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

vllmellm · 2025-08-22T04:48:58Z

Can you refactor this to extend the current dispatching instead of using conditional imports?

@ProExpertProg The requested modification is applied.

…y and fix pre-commit error Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

ilmarkov · 2025-08-22T09:22:10Z

@vllmellm Could you also benchmark against QuickReduce in vllm? It is another alternative to custom allreduce for Rocm which has good speedup numbers. It can be enabled by this env variable.

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

SageMoore

Looks reasonable. Just nits.

SageMoore · 2025-08-25T15:49:52Z

vllm/distributed/device_communicators/cuda_communicator.py

+    """Dispatch the custom allreduce implementation based on the platform."""
+    if is_rocm_aiter_custom_allreduce_enabled():
+        from aiter.dist.custom_all_reduce import CustomAllreduce
+        logger.info("Using aiter.dist.custom_all_reduce for ROCm platform")


Nit: info_once

SageMoore · 2025-08-25T15:57:25Z

vllm/distributed/device_communicators/cuda_communicator.py

            )

-        self.ca_comm: Optional[CustomAllreduce] = None
+        self.ca_comm: Optional[CustomAllreduce] = None  # type: ignore


Nit: I think if you add a __call__ method to the CustomAllreduceProtocol you can get rid of the #type: ignore.

@SageMoore unfortunately, this solution didnt workout. to get rid of #type: ignore had to change the dtype into CustomAllreduceProtocol and implemented additional methods and attributes required by the class CustomAllreduce with this now mypy passes.

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

ilmarkov · 2025-08-27T15:14:41Z

@vllmellm @SageMoore I would suggest to have a new aiter_comm not as complete replacement of current custom allreduce but as an alternative to it, and enable/disable it by env or config.

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

vllmellm · 2025-08-28T05:32:14Z

@vllmellm @SageMoore I would suggest to have a new aiter_comm not as complete replacement of current custom allreduce but as an alternative to it, and enable/disable it by env or config.

@ilmarkov Thank you for the suggestion. Indeed, having a separate module as aiter_comm is a cleaner solution. However, we are limited and bound by the implementation in the aiter package, where the CustomAllreduce is an exact duplicate of the one in the vLLM framework with some modifications for enhancements on the ROCm platform. Thus, the intention is to replace this module rather than extend the current one in the vLLM framework. It would be difficult to extend on top of vLLM's existing module unless we moved the enhancement logic from the aiter package to vLLM, which doesn't sound optimal in terms of maintenance and code readability.

I have added back separate individual environment flag for this based on your comment for enabling/disabling.

ilmarkov · 2025-08-28T08:50:57Z

@vllmellm Sorry, if I unclearly shared the idea. I am suggesting to add aiter_comm independently on the existing ca_comm, not to extend the CustomAllreduce class. Just have a new comm near ca_comm in CudaCommunicator Similar to existing pynccl_comm and qr_comm.

vllmellm · 2025-08-29T14:40:06Z

@vllmellm Sorry, if I unclearly shared the idea. I am suggesting to add aiter_comm independently on the existing ca_comm, not to extend the CustomAllreduce class. Just have a new comm near ca_comm in CudaCommunicator Similar to existing pynccl_comm and qr_comm.

@ilmarkov I see what you mean—thank you for elaborating. I understand your point of view; however, the CustomAllreduce from aiter is literally a duplicate of the one in vLLM, and wherever ca_comm is referenced, we would need to check whether to use aiter and then use the aiter_comm, which would make the code cluttered with if/else statements.

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

ilmarkov · 2025-09-08T11:50:39Z

@vllmellm Isn't there any scenario when we can use aiter_comm for certain range of input sizes and ca_comm for the other, e.g. for better performance?

tjtanaa · 2025-09-08T13:46:14Z

@ilmarkov could you guide us on how we could evaluate the speed of the quick reduce while also take into consideration of the quantization error that is introduced by quick reduce? How should we set the quantization level? VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=[NONE|FP|INT8|INT6|INT4]
Or should we only compare to VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=FP?

zejunchen-zejun and others added 3 commits August 20, 2025 05:00

add custom allreduce from AITER to vllm and

8920437

control it by the env flag VLLM_ROCM_USE_AITER_CUSTOM_ALL_REDUCE (default: True) Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

update enability function

8a6eb2b

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge remote-tracking branch 'origin/main' into aiter-custom-reduce-all

b5bad01

mergify bot added the rocm Related to AMD ROCm label Aug 21, 2025

vllmellm marked this pull request as ready for review August 21, 2025 11:20

tjtanaa mentioned this pull request Aug 13, 2025

[Feature] [ROCm]: AITER Kernel Integration #14964

Open

61 tasks

vllmellm added 2 commits August 21, 2025 16:08

remove unnecessary cache decorator

edf5db4

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge remote-tracking branch 'origin/main' into aiter-custom-reduce-all

e2bd5e6

ProExpertProg requested changes Aug 21, 2025

View reviewed changes

vllmellm added 2 commits August 22, 2025 04:47

add dispatch logic instead of conditional import

75936cf

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge remote-tracking branch 'origin/main' into aiter-custom-reduce-all

8d673a9

remove individual env flag and update the condition of aiter enabilit…

00b5a01

…y and fix pre-commit error Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

vllmellm added 2 commits August 22, 2025 11:11

attempt to fix precommit error

31796d7

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

attempt to fix precommit error

18834f3

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

SageMoore reviewed Aug 25, 2025

View reviewed changes

vllmellm mentioned this pull request Aug 26, 2025

[Feature] [ROCm]: AITER Kernel Integration vllmellm/vllm#51

Open

61 tasks

vllmellm added 2 commits August 26, 2025 14:57

address reviewer comment fix type annotations

45226bc

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge remote-tracking branch 'origin/main' into aiter-custom-reduce-all

09dee4d

add back specific flag for CustomAllreduce from aiter package.

2e3f9bd

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge remote-tracking branch 'origin/main' into aiter-custom-reduce-all

a9d6225

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

b8zhong mentioned this pull request Oct 12, 2025

[ROCm] AITER Custom All-reduce sgl-project/sglang#11484

Open

wuhuikx mentioned this pull request Oct 14, 2025

[Performance]: Deepseek-V3 Performance Uplift Plan on ROCm Backend #26768

Open

27 tasks

Uh oh!

[ROCm][FEAT] Integrate AITER CustomAllreduce in cuda communicator. #23336

Are you sure you want to change the base?

[ROCm][FEAT] Integrate AITER CustomAllreduce in cuda communicator. #23336

Conversation

vllmellm commented Aug 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

deepseek-ai/DeepSeek-V3 tp8

meta-llama/Llama-4-Scout-17B-16E-Instruct tp8

Qwen/Qwen3-235B-A22B-FP8 tp4

Test Plan

Test Results

deepseek-ai/DeepSeek-V3 tp8

Uh oh!

github-actions bot commented Aug 21, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

vllmellm commented Aug 22, 2025

Uh oh!

ilmarkov commented Aug 22, 2025

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

SageMoore Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

SageMoore Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

vllmellm Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

ilmarkov commented Aug 27, 2025

Uh oh!

vllmellm commented Aug 28, 2025

Uh oh!

ilmarkov commented Aug 28, 2025

Uh oh!

vllmellm commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilmarkov commented Sep 8, 2025

Uh oh!

tjtanaa commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

vllmellm commented Aug 21, 2025 •

edited by github-actions bot

Loading

vllmellm commented Aug 29, 2025 •

edited

Loading

tjtanaa commented Sep 8, 2025 •

edited

Loading