[Feat] Adapted mtp function to Qwen3-next #3918

drslark · 2025-10-31T02:46:36Z

What this PR does / why we need it?

Adapts mtp function to Qwen3-next.

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

Run below codes.

# run with Qwen3-next

prompts = [
    "Who are you?",
]

sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=128)
llm = LLM(model="/home/model/Qwen3-Next-80B-A3B-Instruct",
          tensor_parallel_size=4,
          enforce_eager=True,
          distributed_executor_backend="mp",
          gpu_memory_utilization=0.7,
          max_model_len=4096)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

outputs:

Prompt: 'Who are you?', Generated text: ' I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am designed to answer questions, create text such as stories, official documents, emails, scripts, and more, as well as perform logical reasoning, programming, and other tasks. If you have any questions or need assistance, feel free to let me know anytime!'

# run with Qwen3-next-mtp

prompts = [
    "Who are you?",
]

sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=128)
llm = LLM(model="/home/model/Qwen3-Next-80B-A3B-Instruct",
          tensor_parallel_size=4,
          enforce_eager=True,
          distributed_executor_backend="mp",
          gpu_memory_utilization=0.7,
          speculative_config={
              "method": "qwen3_next_mtp",
              "num_speculative_tokens": 1,
          },
          max_model_len=4096)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Prompt: 'Who are you?', Generated text: ' I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am designed to answer questions, create text such as stories, official documents, emails, scripts, and more, as well as perform logical reasoning, programming, and other tasks. If you have any questions or need assistance, feel free to let me know anytime!'

Qwen3-next and Qwen3-next-mtp have same results now.

vLLM version: v0.11.1rc3
vLLM main: vllm-project/vllm@83f478b

vLLM version: v0.11.0
vLLM main: vllm-project/vllm@83f478b

github-actions · 2025-10-31T02:46:44Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request adapts the Multi-Token Prediction (MTP) speculative decoding function for the Qwen3-next model. The changes involve registering the new MTP model variant, customizing the Qwen3-next model implementation for MTP, and adjusting the model runner logic. My review has identified two main issues. First, in mtp_proposer.py, a hardcoded layer index is used to access attention metadata, which is brittle and should be replaced with a dynamic lookup. Second, and more critically, the logic for determining the attention state in model_runner_v1.py has been inverted, which is a breaking change for other speculative decoding methods. I have provided suggestions to address both issues to improve maintainability and correctness.

vllm_ascend/worker/model_runner_v1.py

vllm_ascend/spec_decode/mtp_proposer.py

github-actions · 2025-10-31T02:48:58Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

whx-sjtu · 2025-11-03T02:21:19Z

vllm_ascend/spec_decode/mtp_proposer.py

-            self.model = DeepSeekMTP(
-                vllm_config=self.vllm_config).to(target_device)
-
+            architecture = self.vllm_config.model_config.architecture


It's better to globally maintain a dictionary mapping from model architectures to specific module classes for better scalability.

Good advice! But i must merge mtp for qwen3-next quickly, so i have no time to test other model architectures fully.
Also, i think a function is also a good choice. I think i will think and modify it in next pr.

Oops, as i commented, the lazy import is crucial.
A global map will cause a patch error!
So, it must be a function here.

You can take vllm's model registration as an example:
https://github.com/vllm-project/vllm/blob/32257297dd4dcb996a0fb4641c2018289d20396b/vllm/model_executor/models/registry.py#L671

Corresponding dict used by them is here:
https://github.com/vllm-project/vllm/blob/32257297dd4dcb996a0fb4641c2018289d20396b/vllm/model_executor/models/registry.py#L57

Yes, now i added two maps in vllm_ascend/spec_decode/mtp_proposer.py.

whx-sjtu · 2025-11-03T02:24:17Z

vllm_ascend/spec_decode/mtp_proposer.py

-            attn_metadata = attn_metadata['model.layers.0.self_attn.attn']
+            architecture = self.vllm_config.model_config.architecture
+            if architecture == "Qwen3NextForCausalLM":
+                attn_metadata = attn_metadata['model.layers.3.self_attn.attn']


ditto here. Don't use hard code. Instead, the global dict can map from model architectures to both their corresponding module classes and attention layer names.

You are right!
But I think a function is a better choice.

Yes, now i added two maps in vllm_ascend/spec_decode/mtp_proposer.py.

yiz-liu · 2025-11-03T09:16:42Z

tests/e2e/multicard/test_qwen3_next_mtp.py

+    with VllmRunner("Qwen/Qwen3-Next-80B-A3B-Instruct",
+                    tensor_parallel_size=4,
+                    max_model_len=4096,
+                    gpu_memory_utilization=0.8,
+                    distributed_executor_backend="mp",
+                    speculative_config={
+                        "method": "qwen3_next_mtp",
+                        "num_speculative_tokens": 1
+                    },
+                    enforce_eager=True) as vllm_model:


Have we tested graph mode yet? That is to say, do not enforce_eager.

Yes, i removed enforce_eager=True and it works fine.

yiz-liu · 2025-11-03T09:20:45Z

vllm_ascend/models/qwen3_next_mtp.py

+@support_torch_compile
+class CustomQwen3NextMTP(Qwen3NextMTP, SupportsPP):
+    packed_modules_mapping = {
+        "qkv_proj": [
+            "q_proj",
+            "k_proj",
+            "v_proj",
+        ],
+        "gate_up_proj": ["up_proj", "down_proj"]
+    }


I see you've add support_torch_compile for both classes, assuming you copy them from vLLM, then all the more reason to test graph mode now.

Yes, i removed enforce_eager=True in test_qwen3_next.py and it works fine.

MengqingCao · 2025-11-03T09:28:16Z

tests/e2e/multicard/test_qwen3_next_mtp.py

+from tests.e2e.conftest import VllmRunner
+
+
+def test_models_distributed_Qwen3_NEXT_MTP_TP4():


plz move this case to https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/multicard/test_qwen3_next.py

I deleted test_qwen3_next_mtp.py and move tests into test_qwen3_next.py.

MengqingCao · 2025-11-03T09:29:48Z

tests/e2e/multicard/test_qwen3_next_mtp.py

+                        "num_speculative_tokens": 1
+                    },
+                    enforce_eager=True) as vllm_model:
+        vllm_model.generate_greedy(example_prompts, max_tokens)


can we test the accept rate in mtp scenario? like what's done in deepseek mtp test case: https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/spec_decode_v1/test_v1_mtp_correctness.py#L74-L86

I have added a third test test_models_distributed_Qwen3_NEXT_MTP_TP4_SIMILARITY to do it.

Signed-off-by: drslark <slarksblood@qq.com>

github-actions · 2025-11-04T12:09:51Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions bot added the module:ops label Oct 31, 2025

gemini-code-assist bot reviewed Oct 31, 2025

View reviewed changes

vllm_ascend/worker/model_runner_v1.py Show resolved Hide resolved

vllm_ascend/spec_decode/mtp_proposer.py Outdated Show resolved Hide resolved

github-actions bot added the merge-conflicts label Oct 31, 2025

drslark force-pushed the main branch 2 times, most recently from 845019f to 678d684 Compare November 1, 2025 11:31

github-actions bot removed the merge-conflicts label Nov 1, 2025

drslark force-pushed the main branch 4 times, most recently from 334e475 to 810d097 Compare November 2, 2025 06:34

github-actions bot added the module:tests label Nov 2, 2025

drslark force-pushed the main branch 2 times, most recently from 0a49317 to f2664cb Compare November 2, 2025 07:18

whx-sjtu suggested changes Nov 3, 2025

View reviewed changes

yiz-liu reviewed Nov 3, 2025

View reviewed changes

MengqingCao reviewed Nov 3, 2025

View reviewed changes

[Feat] Adapted mtp function to Qwen3-next

df3bcf9

Signed-off-by: drslark <slarksblood@qq.com>

drslark force-pushed the main branch from f2664cb to df3bcf9 Compare November 4, 2025 12:09

github-actions bot added the merge-conflicts label Nov 4, 2025

		from tests.e2e.conftest import VllmRunner


		def test_models_distributed_Qwen3_NEXT_MTP_TP4():

[Feat] Adapted mtp function to Qwen3-next #3918

Are you sure you want to change the base?

[Feat] Adapted mtp function to Qwen3-next #3918

Conversation

drslark commented Oct 31, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Oct 31, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Oct 31, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drslark Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

drslark commented Oct 31, 2025 •

edited by github-actions bot

Loading

drslark Nov 3, 2025 •

edited

Loading