-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Add the support for the qwen3 next model (a hybrid attention model). #24526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the Qwen3-Next model, which is a hybrid attention model with multi-token prediction (MTP) capabilities. The changes are extensive, touching core components like model configuration, scheduling, memory management, and introducing new model implementations and a custom attention backend. Overall, the implementation seems robust. I've identified a critical bug in the dummy run logic that could lead to an IndexError and a high-severity maintainability issue due to code duplication in the speculative decoding configuration. Addressing these will improve the stability and maintainability of the codebase.
vllm/v1/worker/gpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic for calculating num_reqs and num_scheduled_tokens_list can lead to an IndexError. When num_tokens < max_query_len, num_reqs will be 0, and num_scheduled_tokens_list will be an empty list. The subsequent access num_scheduled_tokens_list[-1] will raise an IndexError. The previous logic using cdiv was more robust. Consider reverting to a similar logic to handle this edge case correctly.
| num_reqs = num_tokens // max_query_len | |
| assert num_reqs <= max_num_reqs, \ | |
| "Do not capture num_reqs > max_num_reqs for uniform batch" | |
| num_scheduled_tokens_list = [max_query_len] * num_reqs | |
| if num_tokens % max_query_len != 0: | |
| num_scheduled_tokens_list[-1] = num_tokens % max_query_len | |
| num_scheduled_tokens_list[-1] += num_tokens % max_query_len | |
| num_reqs = cdiv(num_tokens, max_query_len) | |
| assert num_reqs <= max_num_reqs, \ | |
| "Do not capture num_reqs > max_num_reqs for uniform batch" | |
| num_scheduled_tokens_list = [max_query_len] * num_reqs | |
| if num_tokens % max_query_len != 0: | |
| num_scheduled_tokens_list[-1] = num_tokens % max_query_len |
vllm/config/__init__.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This elif block for qwen3_next_mtp is nearly identical to the preceding blocks for ernie_mtp and deepseek_mtp. This code duplication makes the code harder to maintain and increases the risk of bugs if one block is updated but the others are not.
Consider refactoring these blocks to reduce duplication. For example:
MTP_MODELS = {
"deepseek_mtp": ("deepseek_mtp", "Deepseek MTP"),
"mimo_mtp": ("deepseek_mtp", "Deepseek MTP"),
"glm4_moe_mtp": ("deepseek_mtp", "Deepseek MTP"),
"ernie_mtp": ("ernie_mtp", "Ernie MTP"),
"qwen3_next_mtp": ("qwen3_next_mtp", "Qwen3Next MTP"),
}
model_type = self.draft_model_config.hf_config.model_type
if model_type in MTP_MODELS:
method, model_name = MTP_MODELS[model_type]
self.method = method
if self.num_speculative_tokens > 1:
logger.warning(
f"All {model_name} models only have "
"one layer. Might need some code changes "
"to support multiple layers."
)This would replace lines 2221-2250 and make the code more scalable and maintainable for future MTP models.
|
Add ready label to try ci. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question: do you have anything like a "tiny dev" version of this model that we could add to hybrid models test in CI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if you need/want it but there is also the mamba_ssm_cache_dtype parameter which gives the option to set the dtype of the temporal state separately to that of the conv state. Some models need this, but if you want to keep them the same that's also fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am hitting:
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654] File "/home/tms/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654] return forward_call(*args, **kwargs)
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654] File "<eval_with_key>.2", line 5, in forward
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654] qwen3_next_linear_attention = torch.ops.vllm.qwen3_next_linear_attention(x_3, self_attention_output, 'model.layers.0.linear_attn'); x_3 = self_attention_output = qwen3_next_linear_attention = None
(Worker_TP3 pid=737416) ERROR 09-09 20:18:12 [multiproc_executor.py:654] return self._op(*args, **kwargs)
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654] File "/home/tms/vllm/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654] return self._op(*args, **kwargs)
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654] File "/home/tms/vllm/vllm/model_executor/models/qwen3_next.py", line 1230, in qwen3_next_linear_attention
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654] self._forward(hidden_states=hidden_states, output=output)
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654] File "/home/tms/vllm/vllm/model_executor/models/qwen3_next.py", line 493, in _forward
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654] mixed_qkv_non_spec = causal_conv1d_update(
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654] File "/home/tms/vllm/vllm/model_executor/layers/mamba/ops/causal_conv1d.py", line 909, in causal_conv1d_update
(Worker_TP1 pid=737414) ERROR 09-09 20:18:12 [multiproc_executor.py:654] assert (batch, ) == conv_state_indices.shape
serving this with -tp 4 on an H100 system
|
basic model test failure is related: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - will continue debugging any remaining IMAs as follow-up
|
v1-test-others failure seems related. I'm debugging. |
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
It is a flaky test. Fixing here #24640. Feel free to merge this qwen3 next pr if same failure happens again. |
|
With latest main the MTP run now completes successfully: produces: |
|
Also looks good for |
|
Thanks for the feedback, @tdoublep. |
|
When I tried MTP with random input from |
The problem somewhere here during fail the if passed because 228 < 512, but self.vllm_config.pad_for_cudagraph fails because 228*3 than |
|
Moved above to the issue #24660 |
|
Hi, this change breaks tests: kernels/mamba/test_causal_conv1d.py::test_causal_conv1d_update_with_batch_gather |
…llm-project#24526) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
…llm-project#24526) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
…llm-project#24526) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
…llm-project#24526) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…llm-project#24526) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
No description provided.