[v0.7.3]support MTP in deepseek w8a8 quant model #502

mengwei805 · 2025-04-10T13:20:04Z

What this PR does / why we need it?

Add support MTP in deepseek w8a8 quant model.

Does this PR introduce any user-facing change?

The quantized MTP layer of deepseek on the current NPU msmodelslim is not quantized, So the MTP layer in deepseek w8a8 quantization weight is still in bflaot16 format;
The description file generated by the current msmodelslim tool does not have MTP layer information. Please manually add it to quantization_config in config.json and set the value to FLOAT.

How was this patch tested?

local tested

Signed-off-by: mengwei805 <mengwei25@huawei.com>

MengqingCao · 2025-04-11T06:37:16Z

vllm_ascend/models/deepseek_mtp.py

+            inputs_embeds = self.embed_tokens(input_ids)
+        assert inputs_embeds is not None
+        # masking inputs at position 0, as not needed by MTP
+        inputs_embeds = torch.where((positions == 0).unsqueeze(-1),


I noticed that this masking is done like the following in vllm. why we use torch.where here? Is there any benifits?

inputs_embeds[positions == 0] = 0

The original vLLM method cannot be used in torchair, so it is replaced with an equivalent method.
And the writing of inputs_embeds[positions == 0] has poor performance on Ascend devices

Got it, thanks for this explanation

MengqingCao · 2025-04-11T06:48:18Z

Overall LGTM, thanks for this! Please cherry-pick this pr to main when all the review comments are addressed.

mengwei805 · 2025-04-11T08:09:07Z

Overall LGTM, thanks for this! Please cherry-pick this pr to main when all the review comments are addressed.
2025.4.21 refresh:
As 429 is too big, so i submitted a separate PR after 429 was merged, it is #593
old:
This change is expected to be synchronized to the main branch in #429

### What this PR does / why we need it? This PR adopt Mooncake TransferEngine for kv cache register and pull_blocks style disaggregate prefill implementation. ### Does this PR introduce any user-facing change? No ### Dependencies 1. Cann Dependencies Using Mooncake TransferEngine with Ascend Transport requires CANN version 8.2.RC1 or higher.（see detail Mooncake[#502](kvcache-ai/Mooncake#502)） 2. vllm-ascend This PR depends on changes introduced by #950 (modifications to `model_runner_v1`) and #1361 (updates to `schedule`), both of which have been merged into the `v0.9.1-dev` branch and are expected to land in `main` shortly. ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@1c859a1 --------- Signed-off-by: leichao.lc <leichao139636@163.com> Co-authored-by: jianzs <zheng.shoujian@outlook.com> Co-authored-by: zzy-ContiLearn <1831242919@qq.com> Co-authored-by: fems14 <1804143737@qq.com> Co-authored-by: Dreamerleader <2270923832@qq.com> Co-authored-by: chris668899 <15105191595@126.com> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>

### What this PR does / why we need it? This PR adopt Mooncake TransferEngine for kv cache register and pull_blocks style disaggregate prefill implementation. ### Does this PR introduce any user-facing change? No ### Dependencies 1. Cann Dependencies Using Mooncake TransferEngine with Ascend Transport requires CANN version 8.2.RC1 or higher.（see detail Mooncake[vllm-project#502](kvcache-ai/Mooncake#502)） 2. vllm-ascend This PR depends on changes introduced by vllm-project#950 (modifications to `model_runner_v1`) and vllm-project#1361 (updates to `schedule`), both of which have been merged into the `v0.9.1-dev` branch and are expected to land in `main` shortly. ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@1c859a1 --------- Signed-off-by: leichao.lc <leichao139636@163.com> Co-authored-by: jianzs <zheng.shoujian@outlook.com> Co-authored-by: zzy-ContiLearn <1831242919@qq.com> Co-authored-by: fems14 <1804143737@qq.com> Co-authored-by: Dreamerleader <2270923832@qq.com> Co-authored-by: chris668899 <15105191595@126.com> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>

mengwei805 force-pushed the v0.7.3-mtp-quant branch from 3178105 to 615ad7e Compare April 10, 2025 13:29

[v0.7.3]support MTP in deepseek w8a8 quant model

8f9b832

Signed-off-by: mengwei805 <mengwei25@huawei.com>

mengwei805 force-pushed the v0.7.3-mtp-quant branch from 615ad7e to 8f9b832 Compare April 10, 2025 13:31

MengqingCao reviewed Apr 11, 2025

View reviewed changes

wangxiyuan merged commit 0be96f6 into vllm-project:v0.7.3-dev Apr 11, 2025
17 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[v0.7.3]support MTP in deepseek w8a8 quant model #502

[v0.7.3]support MTP in deepseek w8a8 quant model #502

Uh oh!

mengwei805 commented Apr 10, 2025 •

edited

Loading

Uh oh!

MengqingCao Apr 11, 2025

Uh oh!

mengwei805 Apr 11, 2025 •

edited

Loading

Uh oh!

MengqingCao Apr 11, 2025

Uh oh!

MengqingCao commented Apr 11, 2025

Uh oh!

mengwei805 commented Apr 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[v0.7.3]support MTP in deepseek w8a8 quant model #502

[v0.7.3]support MTP in deepseek w8a8 quant model #502

Uh oh!

Conversation

mengwei805 commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MengqingCao Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

mengwei805 Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MengqingCao Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

MengqingCao commented Apr 11, 2025

Uh oh!

mengwei805 commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mengwei805 commented Apr 10, 2025 •

edited

Loading

mengwei805 Apr 11, 2025 •

edited

Loading

mengwei805 commented Apr 11, 2025 •

edited

Loading