Skip to content

Conversation

@sighingnow
Copy link
Collaborator

No description provided.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Qwen3-Next model, which is a hybrid attention model with multi-token prediction (MTP) capabilities. The changes are extensive, touching core components like model configuration, scheduling, memory management, and introducing new model implementations and a custom attention backend. Overall, the implementation seems robust. I've identified a critical bug in the dummy run logic that could lead to an IndexError and a high-severity maintainability issue due to code duplication in the speculative decoding configuration. Addressing these will improve the stability and maintainability of the codebase.

Comment on lines +2566 to +2571
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This logic for calculating num_reqs and num_scheduled_tokens_list can lead to an IndexError. When num_tokens < max_query_len, num_reqs will be 0, and num_scheduled_tokens_list will be an empty list. The subsequent access num_scheduled_tokens_list[-1] will raise an IndexError. The previous logic using cdiv was more robust. Consider reverting to a similar logic to handle this edge case correctly.

Suggested change
num_reqs = num_tokens // max_query_len
assert num_reqs <= max_num_reqs, \
"Do not capture num_reqs > max_num_reqs for uniform batch"
num_scheduled_tokens_list = [max_query_len] * num_reqs
if num_tokens % max_query_len != 0:
num_scheduled_tokens_list[-1] = num_tokens % max_query_len
num_scheduled_tokens_list[-1] += num_tokens % max_query_len
num_reqs = cdiv(num_tokens, max_query_len)
assert num_reqs <= max_num_reqs, \
"Do not capture num_reqs > max_num_reqs for uniform batch"
num_scheduled_tokens_list = [max_query_len] * num_reqs
if num_tokens % max_query_len != 0:
num_scheduled_tokens_list[-1] = num_tokens % max_query_len

Comment on lines +2242 to +2250
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This elif block for qwen3_next_mtp is nearly identical to the preceding blocks for ernie_mtp and deepseek_mtp. This code duplication makes the code harder to maintain and increases the risk of bugs if one block is updated but the others are not.

Consider refactoring these blocks to reduce duplication. For example:

MTP_MODELS = {
    "deepseek_mtp": ("deepseek_mtp", "Deepseek MTP"),
    "mimo_mtp": ("deepseek_mtp", "Deepseek MTP"),
    "glm4_moe_mtp": ("deepseek_mtp", "Deepseek MTP"),
    "ernie_mtp": ("ernie_mtp", "Ernie MTP"),
    "qwen3_next_mtp": ("qwen3_next_mtp", "Qwen3Next MTP"),
}
model_type = self.draft_model_config.hf_config.model_type
if model_type in MTP_MODELS:
    method, model_name = MTP_MODELS[model_type]
    self.method = method
    if self.num_speculative_tokens > 1:
        logger.warning(
            f"All {model_name} models only have "
            "one layer. Might need some code changes "
            "to support multiple layers."
        )

This would replace lines 2221-2250 and make the code more scalable and maintainable for future MTP models.

@heheda12345 heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 9, 2025
@heheda12345
Copy link
Collaborator

Add ready label to try ci.

Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question: do you have anything like a "tiny dev" version of this model that we could add to hybrid models test in CI?

Comment on lines +79 to +80
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you need/want it but there is also the mamba_ssm_cache_dtype parameter which gives the option to set the dtype of the temporal state separately to that of the conv state. Some models need this, but if you want to keep them the same that's also fine.

Copy link
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hitting:

(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]   File "/home/tms/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]     return forward_call(*args, **kwargs)
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]   File "<eval_with_key>.2", line 5, in forward
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]     qwen3_next_linear_attention = torch.ops.vllm.qwen3_next_linear_attention(x_3, self_attention_output, 'model.layers.0.linear_attn');  x_3 = self_attention_output = qwen3_next_linear_attention = None
(Worker_TP3 pid=737416) ERROR 09-09 20:18:12 [multiproc_executor.py:654]     return self._op(*args, **kwargs)
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]   File "/home/tms/vllm/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]     return self._op(*args, **kwargs)
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]   File "/home/tms/vllm/vllm/model_executor/models/qwen3_next.py", line 1230, in qwen3_next_linear_attention
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]     self._forward(hidden_states=hidden_states, output=output)
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]   File "/home/tms/vllm/vllm/model_executor/models/qwen3_next.py", line 493, in _forward
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]     mixed_qkv_non_spec = causal_conv1d_update(
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]                          ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]   File "/home/tms/vllm/vllm/model_executor/layers/mamba/ops/causal_conv1d.py", line 909, in causal_conv1d_update
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654]     assert (batch, ) == conv_state_indices.shape

serving this with -tp 4 on an H100 system

@heheda12345
Copy link
Collaborator

basic model test failure is related:
models/test_registry.py::test_registry_imports[Qwen3NextMTP]
You need to add Qwen3NextMTP to tests/models/registry.py and mark min_transformers_version of both Qwen3NextMTP and Qwen3NextForCausalLM to the transformer release with this model (maybe a future release)

Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - will continue debugging any remaining IMAs as follow-up

@heheda12345
Copy link
Collaborator

v1-test-others failure seems related. I'm debugging.

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
@heheda12345
Copy link
Collaborator

heheda12345 commented Sep 11, 2025

v1-test-others failure seems related. I'm debugging.

It is a flaky test. Fixing here #24640. Feel free to merge this qwen3 next pr if same failure happens again.

@youkaichao youkaichao merged commit e93f4cc into vllm-project:main Sep 11, 2025
7 of 14 checks passed
@sighingnow sighingnow deleted the dev/qwen3-next branch September 11, 2025 07:32
@tdoublep
Copy link
Member

@sighingnow

With latest main the MTP run now completes successfully:

lm_eval --model local-completions --tasks gsm8k \
     --model_args model=$MODEL_PATH,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=50,max_retries=3,tokenized_requests=False
vllm serve $MODEL_PATH --tensor-parallel-size 4 \
    --speculative-config '{"method": "qwen3_next_mtp","model": "$MODEL_PATH","num_speculative_tokens": 1}'

produces:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8620|±  |0.0095|
|     |       |strict-match    |     5|exact_match|↑  |0.8431|±  |0.0100|

@tdoublep
Copy link
Member

tdoublep commented Sep 11, 2025

Also looks good for num_speculative_tokens=2:

vllm serve $MODEL_PATH --tensor-parallel-size 4 \
    --speculative-config '{"method": "qwen3_next_mtp","model": "$MODEL_PATH","num_speculative_tokens": 2}'
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8673|±  |0.0093|
|     |       |strict-match    |     5|exact_match|↑  |0.8469|±  |0.0099|

@sighingnow
Copy link
Collaborator Author

Thanks for the feedback, @tdoublep.

@vadiklyutiy
Copy link
Contributor

When I tried MTP with random input from vllm bench serve I got the following fail

 vllm serve $MODEL -tp 4 --served-model-name qwen3-next --tokenizer-mode auto --speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 2}'
vllm bench serve   --backend vllm   --model $MODEL  --served-model-name qwen3-next  --endpoint /v1/completions   --dataset-name random   --random-input 2048   --random-output 1024   --max-concurrency 256   --num-prompt 256
(Worker_TP3 pid=2417047) ERROR 09-11 13:34:02 [multiproc_executor.py:654]   File "/home/scratch.vgimpelson_ent/vllm_qwen/vllm/config/__init__.py", line 3380, in pad_for_cudagraph
(Worker_TP3 pid=2417047) ERROR 09-11 13:34:02 [multiproc_executor.py:654]     return self.compilation_config.bs_to_padded_graph_size[batch_size]
(Worker_TP3 pid=2417047) ERROR 09-11 13:34:02 [multiproc_executor.py:654]            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
(Worker_TP3 pid=2417047) ERROR 09-11 13:34:02 [multiproc_executor.py:654] IndexError: list index out of range

@vadiklyutiy
Copy link
Contributor

When I tried MTP with random input from vllm bench serve I got the following fail

 vllm serve $MODEL -tp 4 --served-model-name qwen3-next --tokenizer-mode auto --speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 2}'
vllm bench serve   --backend vllm   --model $MODEL  --served-model-name qwen3-next  --endpoint /v1/completions   --dataset-name random   --random-input 2048   --random-output 1024   --max-concurrency 256   --num-prompt 256
(Worker_TP3 pid=2417047) ERROR 09-11 13:34:02 [multiproc_executor.py:654]   File "/home/scratch.vgimpelson_ent/vllm_qwen/vllm/config/__init__.py", line 3380, in pad_for_cudagraph
(Worker_TP3 pid=2417047) ERROR 09-11 13:34:02 [multiproc_executor.py:654]     return self.compilation_config.bs_to_padded_graph_size[batch_size]
(Worker_TP3 pid=2417047) ERROR 09-11 13:34:02 [multiproc_executor.py:654]            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
(Worker_TP3 pid=2417047) ERROR 09-11 13:34:02 [multiproc_executor.py:654] IndexError: list index out of range

The problem somewhere here

https://github.com/vllm-project/vllm/blob/main/vllm/v1/attention/backends/gdn_attn.py#L211-L215

        if (self.use_full_cuda_graph and num_prefills == 0 and num_decodes == 0
                and num_spec_decodes <= self.decode_cudagraph_max_bs):
            num_total_tokens = self.vllm_config.pad_for_cudagraph(
                m.num_actual_tokens)
            batch_size = num_total_tokens // (self.num_spec + 1)

during fail
num_spec_decodes = 228
m.num_actual_tokens = 228*3
self.decode_cudagraph_max_bs = 512

the if passed because 228 < 512, but self.vllm_config.pad_for_cudagraph fails because 228*3 than cudagraph_max_bs

@vadiklyutiy
Copy link
Contributor

Moved above to the issue #24660

@xli
Copy link

xli commented Sep 11, 2025

Hi, this change breaks tests: kernels/mamba/test_causal_conv1d.py::test_causal_conv1d_update_with_batch_gather
for the case seqlen=3

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
…llm-project#24526)

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
dsxsteven pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025
…llm-project#24526)

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
…llm-project#24526)

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…llm-project#24526)

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…llm-project#24526)

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation new-model Requests to new models qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants