Add the support for the qwen3 next model (a hybrid attention model). #24526

sighingnow · 2025-09-09T17:44:00Z

No description provided.

gemini-code-assist

Code Review

This pull request adds support for the Qwen3-Next model, which is a hybrid attention model with multi-token prediction (MTP) capabilities. The changes are extensive, touching core components like model configuration, scheduling, memory management, and introducing new model implementations and a custom attention backend. Overall, the implementation seems robust. I've identified a critical bug in the dummy run logic that could lead to an IndexError and a high-severity maintainability issue due to code duplication in the speculative decoding configuration. Addressing these will improve the stability and maintainability of the codebase.

gemini-code-assist · 2025-09-09T17:47:03Z

vllm/v1/worker/gpu_model_runner.py

This logic for calculating num_reqs and num_scheduled_tokens_list can lead to an IndexError. When num_tokens < max_query_len, num_reqs will be 0, and num_scheduled_tokens_list will be an empty list. The subsequent access num_scheduled_tokens_list[-1] will raise an IndexError. The previous logic using cdiv was more robust. Consider reverting to a similar logic to handle this edge case correctly.

Suggested change

num_reqs = num_tokens // max_query_len

assert num_reqs <= max_num_reqs, \

"Do not capture num_reqs > max_num_reqs for uniform batch"

num_scheduled_tokens_list = [max_query_len] * num_reqs

if num_tokens % max_query_len != 0:

num_scheduled_tokens_list[-1] = num_tokens % max_query_len

num_scheduled_tokens_list[-1] += num_tokens % max_query_len

num_reqs = cdiv(num_tokens, max_query_len)

assert num_reqs <= max_num_reqs, \

"Do not capture num_reqs > max_num_reqs for uniform batch"

num_scheduled_tokens_list = [max_query_len] * num_reqs

if num_tokens % max_query_len != 0:

num_scheduled_tokens_list[-1] = num_tokens % max_query_len

gemini-code-assist · 2025-09-09T17:47:04Z

vllm/config/__init__.py

This elif block for qwen3_next_mtp is nearly identical to the preceding blocks for ernie_mtp and deepseek_mtp. This code duplication makes the code harder to maintain and increases the risk of bugs if one block is updated but the others are not.

Consider refactoring these blocks to reduce duplication. For example:

MTP_MODELS = { "deepseek_mtp": ("deepseek_mtp", "Deepseek MTP"), "mimo_mtp": ("deepseek_mtp", "Deepseek MTP"), "glm4_moe_mtp": ("deepseek_mtp", "Deepseek MTP"), "ernie_mtp": ("ernie_mtp", "Ernie MTP"), "qwen3_next_mtp": ("qwen3_next_mtp", "Qwen3Next MTP"), } model_type = self.draft_model_config.hf_config.model_type if model_type in MTP_MODELS: method, model_name = MTP_MODELS[model_type] self.method = method if self.num_speculative_tokens > 1: logger.warning( f"All {model_name} models only have " "one layer. Might need some code changes " "to support multiple layers." )

This would replace lines 2221-2250 and make the code more scalable and maintainable for future MTP models.

heheda12345 · 2025-09-09T17:50:49Z

Add ready label to try ci.

tdoublep

One question: do you have anything like a "tiny dev" version of this model that we could add to hybrid models test in CI?

tdoublep · 2025-09-09T19:03:50Z

vllm/model_executor/layers/mamba/mamba_utils.py

I don't know if you need/want it but there is also the mamba_ssm_cache_dtype parameter which gives the option to set the dtype of the temporal state separately to that of the conv state. Some models need this, but if you want to keep them the same that's also fine.

vllm/model_executor/models/qwen3_next.py

vllm/v1/attention/backends/gdn_attn.py

tlrmchlsmth

I am hitting:

(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]   File "/home/tms/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]     return forward_call(*args, **kwargs)
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]   File "<eval_with_key>.2", line 5, in forward
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]     qwen3_next_linear_attention = torch.ops.vllm.qwen3_next_linear_attention(x_3, self_attention_output, 'model.layers.0.linear_attn');  x_3 = self_attention_output = qwen3_next_linear_attention = None
(Worker_TP3 pid=737416) ERROR 09-09 20:18:12 [multiproc_executor.py:654]     return self._op(*args, **kwargs)
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]   File "/home/tms/vllm/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]     return self._op(*args, **kwargs)
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]   File "/home/tms/vllm/vllm/model_executor/models/qwen3_next.py", line 1230, in qwen3_next_linear_attention
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]     self._forward(hidden_states=hidden_states, output=output)
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]   File "/home/tms/vllm/vllm/model_executor/models/qwen3_next.py", line 493, in _forward
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]     mixed_qkv_non_spec = causal_conv1d_update(
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]                          ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]   File "/home/tms/vllm/vllm/model_executor/layers/mamba/ops/causal_conv1d.py", line 909, in causal_conv1d_update
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]     assert (batch, ) == conv_state_indices.shape

serving this with -tp 4 on an H100 system

heheda12345 · 2025-09-09T22:45:58Z

basic model test failure is related:
models/test_registry.py::test_registry_imports[Qwen3NextMTP]
You need to add Qwen3NextMTP to tests/models/registry.py and mark min_transformers_version of both Qwen3NextMTP and Qwen3NextForCausalLM to the transformer release with this model (maybe a future release)

tdoublep

LGTM - will continue debugging any remaining IMAs as follow-up

heheda12345 · 2025-09-11T06:29:15Z

v1-test-others failure seems related. I'm debugging.

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>

heheda12345 · 2025-09-11T07:22:57Z

v1-test-others failure seems related. I'm debugging.

It is a flaky test. Fixing here #24640. Feel free to merge this qwen3 next pr if same failure happens again.

tdoublep · 2025-09-11T07:40:04Z

@sighingnow

With latest main the MTP run now completes successfully:

lm_eval --model local-completions --tasks gsm8k \
     --model_args model=$MODEL_PATH,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=50,max_retries=3,tokenized_requests=False

vllm serve $MODEL_PATH --tensor-parallel-size 4 \
    --speculative-config '{"method": "qwen3_next_mtp","model": "$MODEL_PATH","num_speculative_tokens": 1}'

produces:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8620|±  |0.0095|
|     |       |strict-match    |     5|exact_match|↑  |0.8431|±  |0.0100|

tdoublep · 2025-09-11T07:51:49Z

Also looks good for num_speculative_tokens=2:

vllm serve $MODEL_PATH --tensor-parallel-size 4 \
    --speculative-config '{"method": "qwen3_next_mtp","model": "$MODEL_PATH","num_speculative_tokens": 2}'

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8673|±  |0.0093|
|     |       |strict-match    |     5|exact_match|↑  |0.8469|±  |0.0099|

sighingnow · 2025-09-11T08:04:31Z

Thanks for the feedback, @tdoublep.

vadiklyutiy · 2025-09-11T09:38:32Z

When I tried MTP with random input from vllm bench serve I got the following fail

 vllm serve $MODEL -tp 4 --served-model-name qwen3-next --tokenizer-mode auto --speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 2}'

vllm bench serve   --backend vllm   --model $MODEL  --served-model-name qwen3-next  --endpoint /v1/completions   --dataset-name random   --random-input 2048   --random-output 1024   --max-concurrency 256   --num-prompt 256

(Worker_TP3 pid=2417047) ERROR 09-11 13:34:02 [multiproc_executor.py:654]   File "/home/scratch.vgimpelson_ent/vllm_qwen/vllm/config/__init__.py", line 3380, in pad_for_cudagraph
(Worker_TP3 pid=2417047) ERROR 09-11 13:34:02 [multiproc_executor.py:654]     return self.compilation_config.bs_to_padded_graph_size[batch_size]
(Worker_TP3 pid=2417047) ERROR 09-11 13:34:02 [multiproc_executor.py:654]            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
(Worker_TP3 pid=2417047) ERROR 09-11 13:34:02 [multiproc_executor.py:654] IndexError: list index out of range

vadiklyutiy · 2025-09-11T12:28:50Z

When I tried MTP with random input from vllm bench serve I got the following fail

 vllm serve $MODEL -tp 4 --served-model-name qwen3-next --tokenizer-mode auto --speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 2}'

vllm bench serve   --backend vllm   --model $MODEL  --served-model-name qwen3-next  --endpoint /v1/completions   --dataset-name random   --random-input 2048   --random-output 1024   --max-concurrency 256   --num-prompt 256

(Worker_TP3 pid=2417047) ERROR 09-11 13:34:02 [multiproc_executor.py:654]   File "/home/scratch.vgimpelson_ent/vllm_qwen/vllm/config/__init__.py", line 3380, in pad_for_cudagraph
(Worker_TP3 pid=2417047) ERROR 09-11 13:34:02 [multiproc_executor.py:654]     return self.compilation_config.bs_to_padded_graph_size[batch_size]
(Worker_TP3 pid=2417047) ERROR 09-11 13:34:02 [multiproc_executor.py:654]            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
(Worker_TP3 pid=2417047) ERROR 09-11 13:34:02 [multiproc_executor.py:654] IndexError: list index out of range

The problem somewhere here

https://github.com/vllm-project/vllm/blob/main/vllm/v1/attention/backends/gdn_attn.py#L211-L215

        if (self.use_full_cuda_graph and num_prefills == 0 and num_decodes == 0
                and num_spec_decodes <= self.decode_cudagraph_max_bs):
            num_total_tokens = self.vllm_config.pad_for_cudagraph(
                m.num_actual_tokens)
            batch_size = num_total_tokens // (self.num_spec + 1)

during fail
num_spec_decodes = 228
m.num_actual_tokens = 228*3
self.decode_cudagraph_max_bs = 512

the if passed because 228 < 512, but self.vllm_config.pad_for_cudagraph fails because 228*3 than cudagraph_max_bs

vadiklyutiy · 2025-09-11T12:34:35Z

Moved above to the issue #24660

xli · 2025-09-11T19:28:44Z

Hi, this change breaks tests: kernels/mamba/test_causal_conv1d.py::test_causal_conv1d_update_with_batch_gather
for the case seqlen=3

…llm-project#24526) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

…llm-project#24526) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

sighingnow requested review from DarkLight1337, ProExpertProg, WoosukKwon, alexm-redhat, benchislett, comaniac, hmellor, houseroad, luccafong, mgoin, njhill, robertgshaw2-redhat, simon-mo, tdoublep, tlrmchlsmth, yewentao256, youkaichao, ywang96 and zhuohan123 as code owners September 9, 2025 17:44

mergify bot added documentation Improvements or additions to documentation new-model Requests to new models qwen Related to Qwen models speculative-decoding v1 labels Sep 9, 2025

gemini-code-assist bot reviewed Sep 9, 2025

View reviewed changes

heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 9, 2025

tdoublep reviewed Sep 9, 2025

View reviewed changes

tlrmchlsmth reviewed Sep 9, 2025

View reviewed changes

tdoublep approved these changes Sep 11, 2025

View reviewed changes

Fixes a bug in l2_normal kernel.

558f9f4

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>

youkaichao merged commit e93f4cc into vllm-project:main Sep 11, 2025
7 of 14 checks passed

sighingnow deleted the dev/qwen3-next branch September 11, 2025 07:32

vadiklyutiy mentioned this pull request Sep 11, 2025

[BUG] [Qwen3-next] MPT+CG fail #24660

Closed

theo77186 mentioned this pull request Sep 11, 2025

Feature Request: Qwen3-Next support ggml-org/llama.cpp#15940

Open

4 tasks

yzy1996 mentioned this pull request Sep 12, 2025

[New Model]: Qwen3-Next vllm-project/vllm-ascend#2884

Open

bright8192 mentioned this pull request Sep 12, 2025

[Feature]Qwen3-Next support? ml-explore/mlx#2589

Closed

fhl2000 mentioned this pull request Sep 15, 2025

[Spec-decode] Refoctor cudagraphs for spec-decode;support uniform_alignment of cudagraph sizes. #23679

Open

5 tasks

LucasWilkinson mentioned this pull request Sep 17, 2025

[BugFix] [DP/EP] Fix slow execution when BS <= DP #24963

Closed

yma11 mentioned this pull request Sep 22, 2025

[BugFix][ModelRunner] properly handle unused buffer #25380

Open

MatthewBonanni mentioned this pull request Sep 22, 2025

[BugFix] [DP/EP] Fix slow execution when BS <= DP #25407

Merged

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

Add the support for the qwen3 next model (a hybrid attention model). (v…

855eb7f

…llm-project#24526) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

-            num_reqs = num_tokens // max_query_len
-            assert num_reqs <= max_num_reqs, \
-                "Do not capture num_reqs > max_num_reqs for uniform batch"
-            num_scheduled_tokens_list = [max_query_len] * num_reqs
-            if num_tokens % max_query_len != 0:
-                num_scheduled_tokens_list[-1] = num_tokens % max_query_len
-                num_scheduled_tokens_list[-1] += num_tokens % max_query_len
+            num_reqs = cdiv(num_tokens, max_query_len)
+            assert num_reqs <= max_num_reqs, \
+                "Do not capture num_reqs > max_num_reqs for uniform batch"
+            num_scheduled_tokens_list = [max_query_len] * num_reqs
+            if num_tokens % max_query_len != 0:
+                num_scheduled_tokens_list[-1] = num_tokens % max_query_len

Uh oh!

Add the support for the qwen3 next model (a hybrid attention model). #24526

Add the support for the qwen3 next model (a hybrid attention model). #24526

Uh oh!

Conversation

sighingnow commented Sep 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Sep 9, 2025

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

tdoublep Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlrmchlsmth left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Sep 9, 2025

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Sep 11, 2025

Uh oh!

heheda12345 commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tdoublep commented Sep 11, 2025

Uh oh!

tdoublep commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sighingnow commented Sep 11, 2025

Uh oh!

vadiklyutiy commented Sep 11, 2025

Uh oh!

vadiklyutiy commented Sep 11, 2025

Uh oh!

vadiklyutiy commented Sep 11, 2025

Uh oh!

xli commented Sep 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

tlrmchlsmth left a comment •

edited

Loading

heheda12345 commented Sep 11, 2025 •

edited

Loading

tdoublep commented Sep 11, 2025 •

edited

Loading