[XPU] Fix MOE DP accuracy issue on XPU #25465

faaany · 2025-09-23T09:00:42Z

Purpose

add dispatch and combine methods in XpuCommunicator to fix the MOE model accuracy issue on XPU
add argument to let users disable expert_parallel
make naive the default all2all_backend on XPU

Test Plan

VLLM_WORKER_MULTIPROC_METHOD=spawn python examples/offline_inference/data_parallel.py --enforce-eager --model="ibm-research/PowerMoE-3b" --dp-size=2 --tp-size=2 --disable-expert-parallel

Before:

DP rank 1, Prompt: 'Hello, my name is', Generated text: ' a boatatatatatatatatatatatatatatatatatat'
DP rank 1, Prompt: 'The president of the United States is', Generated text: ' from two of his the fact that it is the gratatatatatatatatat'
DP rank 1, Prompt: 'The capital of France is', Generated text: ' not necessarily to to the capital capital of India. India and France have the same capital capital'
DP rank 1, Prompt: 'The future of AI is', Generated text: '. The now is. Future of AI is. Future of AI is. Future of AI is.'
DP rank 1, Prompt: 'Hello, my name is', Generated text: ' my name,’’,’’,’’,’’,’’,’’'
DP rank 0, Prompt: 'Hello, my name is', Generated text: ' is my,, my name, my my my my my my my my my'
DP rank 0, Prompt: 'The president of the United States is', Generated text: ' is is is is is is a0 to the the the the the the the'
DP rank 0, Prompt: 'The capital of France is', Generated text: ' and France is is  58 and and and and and and\n\n'
DP rank 0, Prompt: 'The future of AI is', Generated text: '””””””””””””””’s'
DP rank 0, Prompt: 'Hello, my name is', Generated text: ' is is is’ my,, me, my\xa0, my way my way'

After:

DP rank 0, Prompt: 'Hello, my name is', Generated text: ' [your name], and I would like to explore the possibility of a career change'
DP rank 0, Prompt: 'The president of the United States is', Generated text: ' the commander in chief of the armed forces of the United States,'
DP rank 0, Prompt: 'The capital of France is', Generated text: ' Paris. The head of state is a president, who is elected'
DP rank 0, Prompt: 'The future of AI is', Generated text: ' bright and exciting. We will use the lessons learned from our models to'
DP rank 0, Prompt: 'Hello, my name is', Generated text: ' Tristan.\nI am a backpacker who has been traveling for'
DP rank 1, Prompt: 'Hello, my name is', Generated text: ' [your name], and I would like to explore the possibility of a career change into the field of'
DP rank 1, Prompt: 'The president of the United States is', Generated text: ' the commander in chief of the armed forces of the United States, including all branches of'
DP rank 1, Prompt: 'The capital of France is', Generated text: ' Paris. The head of state is a president, who is elected by universal suffr'
DP rank 1, Prompt: 'The future of AI is', Generated text: ' bright and exciting. We will use the lessons learned from our models to further enhance our capability'
DP rank 1, Prompt: 'Hello, my name is', Generated text: ' Tristan.\nI am a backpacker who has been traveling for about three years now'

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Fanli Lin <fanli.lin@intel.com>

faaany · 2025-09-23T09:02:20Z

cc @jikunshang @yma11 @chaojun-zhang

gemini-code-assist

Code Review

This pull request addresses an accuracy issue with Mixture-of-Experts (MoE) models on XPU devices when using data parallelism. The core of the fix involves implementing the dispatch and combine communication primitives in the XpuCommunicator, which are essential for expert parallelism. The changes correctly delegate these operations to an all2all_manager. Additionally, the PR makes the all2all backend on XPU robust by defaulting to the naive implementation, which is the only one currently supported, and warns the user if a different backend is configured. The modifications to the data parallelism example script to make enable_expert_parallel a configurable argument is also a good improvement for flexibility. The provided test results clearly demonstrate the effectiveness of the fix. The changes are well-implemented and follow existing patterns in the codebase. Overall, this is a solid contribution to improve XPU support.

Signed-off-by: Fanli Lin <fanli.lin@intel.com>

Signed-off-by: charlifu <charlifu@amd.com>

…ject#352)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

Signed-off-by: gaojc <1055866782@qq.com>

Signed-off-by: xuebwang-amd <xuebwang@amd.com>

faaany added 4 commits September 23, 2025 01:43

add dispatch and combine for xpu

ce99118

Signed-off-by: Fanli Lin <fanli.lin@intel.com>

add ep arg

297d895

Signed-off-by: Fanli Lin <fanli.lin@intel.com>

minor fix

cecd856

Signed-off-by: Fanli Lin <fanli.lin@intel.com>

refine log message

28e7ae2

Signed-off-by: Fanli Lin <fanli.lin@intel.com>

mergify bot added the documentation Improvements or additions to documentation label Sep 23, 2025

gemini-code-assist bot reviewed Sep 23, 2025

View reviewed changes

faaany added 2 commits September 23, 2025 02:42

make ep default to true

5f9476a

Signed-off-by: Fanli Lin <fanli.lin@intel.com>

Merge branch 'main' into fix_moe_acc

0042af9

jikunshang approved these changes Sep 23, 2025

View reviewed changes

jikunshang enabled auto-merge (squash) September 23, 2025 09:57

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 23, 2025

jikunshang merged commit 4c966e4 into vllm-project:main Sep 23, 2025
59 checks passed

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[XPU] Fix MOE DP accuracy issue on XPU (vllm-project#25465)

25a9a9b

charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025

[XPU] Fix MOE DP accuracy issue on XPU (vllm-project#25465)

d1509a2

Signed-off-by: charlifu <charlifu@amd.com>

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Sep 26, 2025

[XPU] Fix MOE DP accuracy issue on XPU (vllm-project#25465) (vllm-pro…

c0d9fc8

…ject#352)

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[XPU] Fix MOE DP accuracy issue on XPU (#25465)

2ab27b7

Signed-off-by: yewentao256 <zhyanwentao@126.com>

gjc0824 pushed a commit to gjc0824/vllm that referenced this pull request Oct 10, 2025

[XPU] Fix MOE DP accuracy issue on XPU (vllm-project#25465)

1a5e0e2

Signed-off-by: gaojc <1055866782@qq.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[XPU] Fix MOE DP accuracy issue on XPU (vllm-project#25465)

afdec57

Signed-off-by: xuebwang-amd <xuebwang@amd.com>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[XPU] Fix MOE DP accuracy issue on XPU (vllm-project#25465)

6c6eb10

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[XPU] Fix MOE DP accuracy issue on XPU (vllm-project#25465)

5bbaffb

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[XPU] Fix MOE DP accuracy issue on XPU (vllm-project#25465)

00e19eb

Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[XPU] Fix MOE DP accuracy issue on XPU #25465

[XPU] Fix MOE DP accuracy issue on XPU #25465

Uh oh!

faaany commented Sep 23, 2025 •

edited by github-actions bot

Loading

Uh oh!

faaany commented Sep 23, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[XPU] Fix MOE DP accuracy issue on XPU #25465

[XPU] Fix MOE DP accuracy issue on XPU #25465

Uh oh!

Conversation

faaany commented Sep 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Uh oh!

faaany commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

faaany commented Sep 23, 2025 •

edited by github-actions bot

Loading

faaany commented Sep 23, 2025 •

edited

Loading