Skip to content

Conversation

@bnellnm
Copy link
Contributor

@bnellnm bnellnm commented Sep 9, 2025

Purpose

Fix shared expert overlap with naive all2all backend. Fixes #24530. We were running the shared experts with the dispatched hidden states instead of the original hidden states.

Test Plan

Ran deepseek with VLLM_ALL2ALL_BACKEND=naive

Test Result

works

cc @yewentao256 , @tlrmchlsmth , @simon-mo

Signed-off-by: Bill Nell <bnell@redhat.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug concerning shared expert overlap when using the naive all-to-all backend. The core of the fix involves reordering the dispatch operation to execute after the shared expert computation, which correctly ensures that shared experts operate on the original hidden_states. Additionally, a do_combine flag has been introduced to the reduce_output function, allowing the combine operation to be skipped for shared expert outputs, which is appropriate since they are not part of the dispatch-combine communication flow. The changes are logical, well-implemented, and include a new assertion for improved robustness.

@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 9, 2025
Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|| 0.96|±  |0.0197|
|     |       |strict-match    |     5|exact_match|| 0.96|±  |0.0197|

Verified that this fixes the issue, thanks for the work!

@yewentao256 yewentao256 enabled auto-merge (squash) September 9, 2025 23:55
@simon-mo simon-mo merged commit b23fb78 into vllm-project:main Sep 10, 2025
51 of 53 checks passed
@yewentao256 yewentao256 deleted the fix-24530 branch September 10, 2025 14:26
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: R1 accuracy 0 issue when all 2 all kernel is "naive"

4 participants