Enable Allgather/ReduceScatter backend for NaiveAllToAll #23964

wenscarl · 2025-08-29T20:53:32Z

Purpose

Test Plan

accuracy:

VLLM_USE_FLASHINFER_MOE_FP4=1 \
VLLM_FLASHINFER_MOE_BACKEND="latency" \
/home/shuw/.local/bin/lm_eval --model vllm --model_args pretrained=nvidia/DeepSeek-R1-FP4,quantization=modelopt_fp4,data_parallel_size=4,enable_expert_parallel=True,tensor_parallel_size=1,max_model_len=2048 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto

Perf:

VLLM_ALL2ALL_BACKEND="allgather_reducescatter" \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
VLLM_USE_STANDALONE_COMPILE=0 \
VLLM_USE_FLASHINFER_MOE_FP4=1 \
VLLM_FLASHINFER_MOE_BACKEND="latency" \
  /home/shuw/.local/bin/vllm serve nvidia/DeepSeek-R1-FP4 \
    --quantization="modelopt_fp4" \
    --trust-remote-code \
    --max-model-len=2048 \
    --block-size=128 \
    --max-num-seqs=256 \
    --enable-expert-parallel \
    --gpu_memory_utilization=0.8 \
    --tensor-parallel-size 1 \
    --data-parallel-size 4
    
 python benchmarks/benchmark_serving.py \
  --model nvidia/DeepSeek-R1-FP4 \
  --dataset-name random \
  --ignore-eos \
  --num-prompts 256 \
  --max-concurrency 256 \
  --random-input-len 128 \
  --random-output-len 1024 
    
vs 
 VLLM_ALL2ALL_BACKEND="naive" \
 ...

Test Result

accuracy:

vllm (pretrained=nvidia/DeepSeek-R1-FP4,quantization=modelopt_fp4,data_parallel_size=4,enable_expert_parallel=True,tensor_parallel_size=1,enforce_eager=False,max_model_len=2048,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9401|±  |0.0065|
|     |       |strict-match    |     5|exact_match|↑  |0.9386|±  |0.0066|

perf:

Allgather-ReduceScatter:
Successful requests:                     256       
Benchmark duration (s):                  120.41    
Total input tokens:                      32512     
Total generated tokens:                  262144    
Request throughput (req/s):              2.13      
Output token throughput (tok/s):         2177.05   
Total Token throughput (tok/s):          2447.06   
---------------Time to First Token----------------
Mean TTFT (ms):                          1598.62   
Median TTFT (ms):                        1599.74   
P99 TTFT (ms):                           1628.07   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          116.10    
Median TPOT (ms):                        116.10    
P99 TPOT (ms):                           116.11    
---------------Inter-token Latency----------------
Mean ITL (ms):                           116.10    
Median ITL (ms):                         115.87    
P99 ITL (ms):                            124.38

Naive
Successful requests:                     256       
Benchmark duration (s):                  236.75    
Total input tokens:                      32512     
Total generated tokens:                  262144    
Request throughput (req/s):              1.08      
Output token throughput (tok/s):         1107.26   
Total Token throughput (tok/s):          1244.58   
---------------Time to First Token----------------
Mean TTFT (ms):                          2660.87   
Median TTFT (ms):                        2665.06   
P99 TTFT (ms):                           2692.93   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          228.78    
Median TPOT (ms):                        228.78    
P99 TPOT (ms):                           228.79    
---------------Inter-token Latency----------------
Mean ITL (ms):                           228.78    
Median ITL (ms):                         224.91    
P99 ITL (ms):                            271.71

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Shu Wang. <shuw@nvidia.com>

simon-mo · 2025-09-04T22:59:10Z

vllm/envs.py

+    # - "allgather_reducescatter": all2all implementation based on allgather and
+    #  reducescatter


@tlrmchlsmth wdyt if we turn this on as the default?

I think this is a better default.
[edit] I'd recommend making this switch in a followup PR with some testing

yeah, we should make this the default

varun-sundar-rabindranath

Nice and clean change. Thanks @wenscarl

tlrmchlsmth

Don't we need to add a prepare/finalize implementation that uses this backend?

wenscarl · 2025-09-11T17:27:17Z

Don't we need to add a prepare/finalize implementation that uses this backend?

Not a this moment. Since trtllm-moe kernel's API is quite different from what modular kernel supports.

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

tlrmchlsmth

I didn't realize how this was hooked up. Tried it and LGTM!

I also switched the default All2All backend to this one.

simon-mo · 2025-09-11T22:43:04Z

@tlrmchlsmth test should be unrelated? can you confirm?

tlrmchlsmth · 2025-09-11T23:09:18Z

failures might actually be related as well changed the default All2All backend. I’m on my phone so can’t read the logs well right now

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

nvpohanh · 2025-09-15T13:27:17Z

If we can confirm that the performance is good, could we enable it by default so that users do not need to add VLLM_ALL2ALL_BACKEND="allgather_reducescatter" env var?

cc @weireweire

tlrmchlsmth · 2025-09-15T13:37:00Z

If we can confirm that the performance is good, could we enable it by default so that users do not need to add VLLM_ALL2ALL_BACKEND="allgather_reducescatter" env var?

cc @weireweire

Yes, I had this enabled by default but backed out that change in case it was breaking the CI. Distributed tests are still red though, so not sure what the problem is now. Will try to land this PR with the default set to allgather_reducescatter

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

…rtgenmoeag

mergify · 2025-09-17T04:09:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wenscarl.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tlrmchlsmth · 2025-09-17T14:21:54Z

@wenscarl Could the failure be related? The distributed tests are green on a very recent nightly https://buildkite.com/vllm/ci/builds/30655#01994662-1fcb-4cd8-aba7-ee88e6f8608a

Signed-off-by: Shu Wang <shuw@nvidia.com>

wenscarl · 2025-09-18T13:32:29Z

@wenscarl Could the failure be related? The distributed tests are green on a very recent nightly https://buildkite.com/vllm/ci/builds/30655#01994662-1fcb-4cd8-aba7-ee88e6f8608a

I didn't see significant failures. Do I miss anything?

tlrmchlsmth · 2025-09-18T15:47:03Z

🤞 that the distributed tests are green

…t#23964) Signed-off-by: Shu Wang. <shuw@nvidia.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: Shu Wang <shuw@nvidia.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

…t#23964) Signed-off-by: Shu Wang. <shuw@nvidia.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: Shu Wang <shuw@nvidia.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>

…t#23964) Signed-off-by: Shu Wang. <shuw@nvidia.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: Shu Wang <shuw@nvidia.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…t#23964) Signed-off-by: Shu Wang. <shuw@nvidia.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: Shu Wang <shuw@nvidia.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>

…t#23964) Signed-off-by: Shu Wang. <shuw@nvidia.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: Shu Wang <shuw@nvidia.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

zejunchen-zejun · 2025-10-29T09:07:39Z

Hi, @wenscarl
Wonderful work! May I know if this backend can be compatible with the cuda graph? The dispatch(all gather) and combine(reduce scatter) can be captured by cuda graph right?
Thank you!

nvpohanh · 2025-10-29T12:52:50Z

Hi, @wenscarl Wonderful work! May I know if this backend can be compatible with the cuda graph? The dispatch(all gather) and combine(reduce scatter) can be captured by cuda graph right? Thank you!

we have been running with cuda graph and didn't see any issue with allgather_reducescatter

wenscarl marked this pull request as ready for review August 30, 2025 20:44

wenscarl mentioned this pull request Aug 30, 2025

[Feature][Kernel][B200]: FI MoE LL does not use allgatherv and reduce-scatterv for dispatch and combine #22916

Closed

1 task

wenscarl force-pushed the trtgenmoeag branch from f988c13 to 3be1e14 Compare August 30, 2025 22:54

wenscarl added 2 commits August 30, 2025 22:56

Wip

16ba744

Signed-off-by: Shu Wang. <shuw@nvidia.com>

Upd

f26b491

Signed-off-by: Shu Wang. <shuw@nvidia.com>

wenscarl force-pushed the trtgenmoeag branch from 3be1e14 to f26b491 Compare August 30, 2025 22:56

simon-mo reviewed Sep 4, 2025

View reviewed changes

simon-mo approved these changes Sep 4, 2025

View reviewed changes

weireweire mentioned this pull request Sep 8, 2025

DP/EP Support for gpt-oss with deepep-ht comm kernel on SM100 #23608

Merged

varun-sundar-rabindranath approved these changes Sep 9, 2025

View reviewed changes

tlrmchlsmth reviewed Sep 9, 2025

View reviewed changes

mgoin added ready ONLY add when PR is ready to merge/full CI is needed performance Performance-related issues moe labels Sep 11, 2025

tlrmchlsmth added 2 commits September 11, 2025 21:29

default to allgather_reducescatter

e9fe293

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Merge branch 'main' into trtgenmoeag

5dba99e

tlrmchlsmth approved these changes Sep 11, 2025

View reviewed changes

tlrmchlsmth enabled auto-merge (squash) September 11, 2025 21:31

tlrmchlsmth and others added 2 commits September 12, 2025 16:02

Merge branch 'main' into trtgenmoeag

eb438f4

switch default back to naive

30d79b2

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

tlrmchlsmth and others added 4 commits September 15, 2025 13:50

Merge branch 'main' into trtgenmoeag

a89eb17

update

2f9e678

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Merge branch 'trtgenmoeag' of https://github.com/wenscarl/vllm into t…

ee1657f

…rtgenmoeag

Merge branch 'main' into trtgenmoeag

82f5d0d

mergify bot added the needs-rebase label Sep 17, 2025

Merge branch 'main' into trtgenmoeag

e02aa12

Signed-off-by: Shu Wang <shuw@nvidia.com>

auto-merge was automatically disabled September 18, 2025 13:30
Head branch was pushed to by a user without write access

mergify bot removed the needs-rebase label Sep 18, 2025

tlrmchlsmth enabled auto-merge (squash) September 18, 2025 15:46

tlrmchlsmth added this to the v0.11.0 milestone Sep 18, 2025

tlrmchlsmth merged commit 2ea50e9 into vllm-project:main Sep 18, 2025
49 checks passed

		# - "allgather_reducescatter": all2all implementation based on allgather and
		# reducescatter

Uh oh!

Enable Allgather/ReduceScatter backend for NaiveAllToAll #23964

Enable Allgather/ReduceScatter backend for NaiveAllToAll #23964

Uh oh!

Conversation

wenscarl commented Aug 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

simon-mo Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath left a comment

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

wenscarl commented Sep 11, 2025

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

simon-mo commented Sep 11, 2025

Uh oh!

tlrmchlsmth commented Sep 11, 2025

Uh oh!

nvpohanh commented Sep 15, 2025

Uh oh!

tlrmchlsmth commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Sep 17, 2025

Uh oh!

tlrmchlsmth commented Sep 17, 2025

Uh oh!

wenscarl commented Sep 18, 2025

Uh oh!

tlrmchlsmth commented Sep 18, 2025

Uh oh!

Uh oh!

zejunchen-zejun commented Oct 29, 2025

Uh oh!

nvpohanh commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

wenscarl commented Aug 29, 2025 •

edited by github-actions bot

Loading

varun-sundar-rabindranath Sep 9, 2025 •

edited

Loading

tlrmchlsmth commented Sep 15, 2025 •

edited

Loading