Skip to content

Conversation

@tlrmchlsmth
Copy link
Member

@tlrmchlsmth tlrmchlsmth commented Sep 16, 2025

Purpose

Prior to this PR, in many cases, using TP Attn and EP MoEs with --tensor-parallel-size N --data-parallel-size M --enable-expert-parallel would result in factor N redundant work in the MoE layers.

This PR extends #24134 to other models, and to the naive and allgather_reducescatter All2All backends.

Test Plan

vllm serve {{MODEL}} -tp 2 -dp 2 --enable-expert-parallel --port 8192

lm_eval --model local-completions --tasks gsm8k --model_args model={{MODEL}},base_url={{BASE_URL}}/v1/completions,num_concurrent=50,max_retries=3,tokenized_requests=False --limit 100

Test Result

Qwen/Qwen3-30B-A3B-FP8:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.88|±  |0.0327|
|     |       |strict-match    |     5|exact_match|↑  | 0.94|±  |0.0239|

Qwen/Qwen3-Next-80B-A3B-Instruct (with --enforce-eager due to #25437):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.80|±  |0.0402|
|     |       |strict-match    |     5|exact_match|↑  | 0.74|±  |0.0441|

meta-llama/Llama-4-Scout-17B-16E:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.82|±  |0.0386|
|     |       |strict-match    |     5|exact_match|↑  | 0.82|±  |0.0386|

ibm-granite/granite-4.0-tiny-preview (with --enforce-eager due to #25437 (comment)):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.58|±  |0.0496|
|     |       |strict-match    |     5|exact_match|↑  | 0.55|±  |0.0500|

openai/gpt-oss-20b (main at TP4 is almost the same):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3685|±  |0.0133|
|     |       |strict-match    |     5|exact_match|↑  |0.2365|±  |0.0117|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
@mergify mergify bot added deepseek Related to DeepSeek models qwen Related to Qwen models labels Sep 16, 2025
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
@mergify
Copy link

mergify bot commented Sep 17, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tlrmchlsmth.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 17, 2025
@tlrmchlsmth tlrmchlsmth added this to the v0.11.0 milestone Sep 18, 2025
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
@mergify mergify bot added llama Related to Llama models speculative-decoding labels Sep 21, 2025
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
@mergify mergify bot removed the needs-rebase label Sep 21, 2025
Runs but wrong answer in this case

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
tlrmchlsmth pushed a commit that referenced this pull request Sep 28, 2025
Signed-off-by: Roger Wang <hey@rogerw.io>
simon-mo pushed a commit that referenced this pull request Sep 28, 2025
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
simon-mo pushed a commit that referenced this pull request Sep 28, 2025
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: simon-mo <simon.mo@hey.com>
baonudesifeizhai pushed a commit to baonudesifeizhai/vllm that referenced this pull request Sep 28, 2025
…t#25814)

Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
xuechendi pushed a commit to vllm-project/vllm-gaudi that referenced this pull request Sep 30, 2025
After vllm-project/vllm#24982 merged, sequence
parallel MOE will be turned on when `enable_expert_parallel=True`,
`tp_size > 1` and `dp_size > 1`. Since for Gaudi, there is no choice for
`VLLM_ALL2ALL_BACKEND`, we can not easily bypass it. So this PR aims to
support the feature.

```python
class ParallelConfig:

  @Property
    def use_sequence_parallel_moe(self) -> bool:
        return (envs.VLLM_ALL2ALL_BACKEND
                in ("allgather_reducescatter", "naive",
                    "deepep_high_throughput", "deepep_low_latency")
                and self.enable_expert_parallel
                and self.tensor_parallel_size > 1
                and self.data_parallel_size > 1)

```

Update:
No hard requirement on vllm-project/vllm#25828

---------

Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com>
iboiko-habana pushed a commit to iboiko-habana/vllm-gaudi that referenced this pull request Oct 2, 2025
After vllm-project/vllm#24982 merged, sequence
parallel MOE will be turned on when `enable_expert_parallel=True`,
`tp_size > 1` and `dp_size > 1`. Since for Gaudi, there is no choice for
`VLLM_ALL2ALL_BACKEND`, we can not easily bypass it. So this PR aims to
support the feature.

```python
class ParallelConfig:

  @Property
    def use_sequence_parallel_moe(self) -> bool:
        return (envs.VLLM_ALL2ALL_BACKEND
                in ("allgather_reducescatter", "naive",
                    "deepep_high_throughput", "deepep_low_latency")
                and self.enable_expert_parallel
                and self.tensor_parallel_size > 1
                and self.data_parallel_size > 1)

```

Update:
No hard requirement on vllm-project/vllm#25828

---------

Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com>
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025
…ject#24982)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…ject#24982)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…t#25814)

Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
…ject#24982)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
…t#25814)

Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: simon-mo <simon.mo@hey.com>
shyeh25 pushed a commit to shyeh25/vllm that referenced this pull request Oct 14, 2025
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
shyeh25 pushed a commit to shyeh25/vllm that referenced this pull request Oct 14, 2025
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: simon-mo <simon.mo@hey.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
…ject#24982)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
…ject#24982)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…ject#24982)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…t#25814)

Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models gpt-oss Related to GPT-OSS models llama Related to Llama models multi-modality Related to multi-modality (#4194) qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants