Skip to content

Conversation

@drslark
Copy link
Contributor

@drslark drslark commented Oct 31, 2025

What this PR does / why we need it?

Adapts mtp function to Qwen3-next.

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

Run below codes.

# run with Qwen3-next

prompts = [
    "Who are you?",
]

sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=128)
llm = LLM(model="/home/model/Qwen3-Next-80B-A3B-Instruct",
          tensor_parallel_size=4,
          enforce_eager=True,
          distributed_executor_backend="mp",
          gpu_memory_utilization=0.7,
          max_model_len=4096)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

outputs:

Prompt: 'Who are you?', Generated text: ' I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am designed to answer questions, create text such as stories, official documents, emails, scripts, and more, as well as perform logical reasoning, programming, and other tasks. If you have any questions or need assistance, feel free to let me know anytime!'
# run with Qwen3-next-mtp

prompts = [
    "Who are you?",
]

sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=128)
llm = LLM(model="/home/model/Qwen3-Next-80B-A3B-Instruct",
          tensor_parallel_size=4,
          enforce_eager=True,
          distributed_executor_backend="mp",
          gpu_memory_utilization=0.7,
          speculative_config={
              "method": "qwen3_next_mtp",
              "num_speculative_tokens": 1,
          },
          max_model_len=4096)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Prompt: 'Who are you?', Generated text: ' I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am designed to answer questions, create text such as stories, official documents, emails, scripts, and more, as well as perform logical reasoning, programming, and other tasks. If you have any questions or need assistance, feel free to let me know anytime!'

Qwen3-next and Qwen3-next-mtp have same results now.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adapts the Multi-Token Prediction (MTP) speculative decoding function for the Qwen3-next model. The changes involve registering the new MTP model variant, customizing the Qwen3-next model implementation for MTP, and adjusting the model runner logic. My review has identified two main issues. First, in mtp_proposer.py, a hardcoded layer index is used to access attention metadata, which is brittle and should be replaced with a dynamic lookup. Second, and more critically, the logic for determining the attention state in model_runner_v1.py has been inverted, which is a breaking change for other speculative decoding methods. I have provided suggestions to address both issues to improve maintainability and correctness.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@drslark drslark force-pushed the main branch 2 times, most recently from 845019f to 678d684 Compare November 1, 2025 11:31
@drslark drslark force-pushed the main branch 4 times, most recently from 334e475 to 810d097 Compare November 2, 2025 06:34
@drslark drslark force-pushed the main branch 2 times, most recently from 0a49317 to f2664cb Compare November 2, 2025 07:18
self.model = DeepSeekMTP(
vllm_config=self.vllm_config).to(target_device)

architecture = self.vllm_config.model_config.architecture
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to globally maintain a dictionary mapping from model architectures to specific module classes for better scalability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good advice! But i must merge mtp for qwen3-next quickly, so i have no time to test other model architectures fully.
Also, i think a function is also a good choice. I think i will think and modify it in next pr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, as i commented, the lazy import is crucial.
A global map will cause a patch error!
So, it must be a function here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, now i added two maps in vllm_ascend/spec_decode/mtp_proposer.py.

attn_metadata = attn_metadata['model.layers.0.self_attn.attn']
architecture = self.vllm_config.model_config.architecture
if architecture == "Qwen3NextForCausalLM":
attn_metadata = attn_metadata['model.layers.3.self_attn.attn']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto here. Don't use hard code. Instead, the global dict can map from model architectures to both their corresponding module classes and attention layer names.

Copy link
Contributor Author

@drslark drslark Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right!
But I think a function is a better choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, now i added two maps in vllm_ascend/spec_decode/mtp_proposer.py.

Comment on lines 32 to 41
with VllmRunner("Qwen/Qwen3-Next-80B-A3B-Instruct",
tensor_parallel_size=4,
max_model_len=4096,
gpu_memory_utilization=0.8,
distributed_executor_backend="mp",
speculative_config={
"method": "qwen3_next_mtp",
"num_speculative_tokens": 1
},
enforce_eager=True) as vllm_model:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tested graph mode yet? That is to say, do not enforce_eager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, i removed enforce_eager=True and it works fine.

Comment on lines +76 to +85
@support_torch_compile
class CustomQwen3NextMTP(Qwen3NextMTP, SupportsPP):
packed_modules_mapping = {
"qkv_proj": [
"q_proj",
"k_proj",
"v_proj",
],
"gate_up_proj": ["up_proj", "down_proj"]
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you've add support_torch_compile for both classes, assuming you copy them from vLLM, then all the more reason to test graph mode now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, i removed enforce_eager=True in test_qwen3_next.py and it works fine.

from tests.e2e.conftest import VllmRunner


def test_models_distributed_Qwen3_NEXT_MTP_TP4():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted test_qwen3_next_mtp.py and move tests into test_qwen3_next.py.

"num_speculative_tokens": 1
},
enforce_eager=True) as vllm_model:
vllm_model.generate_greedy(example_prompts, max_tokens)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a third test test_models_distributed_Qwen3_NEXT_MTP_TP4_SIMILARITY to do it.

Signed-off-by: drslark <slarksblood@qq.com>
@github-actions
Copy link

github-actions bot commented Nov 4, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants