Skip to content

Conversation

@mengwei805
Copy link
Collaborator

@mengwei805 mengwei805 commented Apr 10, 2025

What this PR does / why we need it?

Add support MTP in deepseek w8a8 quant model.

Does this PR introduce any user-facing change?

  1. The quantized MTP layer of deepseek on the current NPU msmodelslim is not quantized, So the MTP layer in deepseek w8a8 quantization weight is still in bflaot16 format;
  2. The description file generated by the current msmodelslim tool does not have MTP layer information. Please manually add it to quantization_config in config.json and set the value to FLOAT.

How was this patch tested?

local tested

Signed-off-by: mengwei805 <mengwei25@huawei.com>
inputs_embeds = self.embed_tokens(input_ids)
assert inputs_embeds is not None
# masking inputs at position 0, as not needed by MTP
inputs_embeds = torch.where((positions == 0).unsqueeze(-1),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that this masking is done like the following in vllm. why we use torch.where here? Is there any benifits?

inputs_embeds[positions == 0] = 0

Copy link
Collaborator Author

@mengwei805 mengwei805 Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original vLLM method cannot be used in torchair, so it is replaced with an equivalent method.
And the writing of inputs_embeds[positions == 0] has poor performance on Ascend devices

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for this explanation

@MengqingCao
Copy link
Collaborator

Overall LGTM, thanks for this! Please cherry-pick this pr to main when all the review comments are addressed.

@mengwei805
Copy link
Collaborator Author

mengwei805 commented Apr 11, 2025

Overall LGTM, thanks for this! Please cherry-pick this pr to main when all the review comments are addressed.
2025.4.21 refresh:
As 429 is too big, so i submitted a separate PR after 429 was merged, it is #593
old:
This change is expected to be synchronized to the main branch in #429

@wangxiyuan wangxiyuan merged commit 0be96f6 into vllm-project:v0.7.3-dev Apr 11, 2025
17 of 19 checks passed
wangxiyuan pushed a commit that referenced this pull request Aug 18, 2025
### What this PR does / why we need it?
This PR adopt Mooncake TransferEngine for kv cache register and
pull_blocks style disaggregate prefill implementation.

### Does this PR introduce any user-facing change?
No

### Dependencies
1. Cann Dependencies
Using Mooncake TransferEngine with Ascend Transport requires CANN
version 8.2.RC1 or higher.(see detail
Mooncake[#502](kvcache-ai/Mooncake#502))

2. vllm-ascend
This PR depends on changes introduced by #950 (modifications to
`model_runner_v1`) and #1361 (updates to `schedule`), both of which have
been merged into the `v0.9.1-dev` branch and are expected to land in
`main` shortly.

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@1c859a1

---------

Signed-off-by: leichao.lc <leichao139636@163.com>
Co-authored-by: jianzs <zheng.shoujian@outlook.com>
Co-authored-by: zzy-ContiLearn <1831242919@qq.com>
Co-authored-by: fems14 <1804143737@qq.com>
Co-authored-by: Dreamerleader <2270923832@qq.com>
Co-authored-by: chris668899 <15105191595@126.com>
Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Sep 26, 2025
### What this PR does / why we need it?
This PR adopt Mooncake TransferEngine for kv cache register and
pull_blocks style disaggregate prefill implementation.

### Does this PR introduce any user-facing change?
No

### Dependencies
1. Cann Dependencies
Using Mooncake TransferEngine with Ascend Transport requires CANN
version 8.2.RC1 or higher.(see detail
Mooncake[vllm-project#502](kvcache-ai/Mooncake#502))

2. vllm-ascend
This PR depends on changes introduced by vllm-project#950 (modifications to
`model_runner_v1`) and vllm-project#1361 (updates to `schedule`), both of which have
been merged into the `v0.9.1-dev` branch and are expected to land in
`main` shortly.

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@1c859a1

---------

Signed-off-by: leichao.lc <leichao139636@163.com>
Co-authored-by: jianzs <zheng.shoujian@outlook.com>
Co-authored-by: zzy-ContiLearn <1831242919@qq.com>
Co-authored-by: fems14 <1804143737@qq.com>
Co-authored-by: Dreamerleader <2270923832@qq.com>
Co-authored-by: chris668899 <15105191595@126.com>
Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
### What this PR does / why we need it?
This PR adopt Mooncake TransferEngine for kv cache register and
pull_blocks style disaggregate prefill implementation.

### Does this PR introduce any user-facing change?
No

### Dependencies
1. Cann Dependencies
Using Mooncake TransferEngine with Ascend Transport requires CANN
version 8.2.RC1 or higher.(see detail
Mooncake[vllm-project#502](kvcache-ai/Mooncake#502))

2. vllm-ascend
This PR depends on changes introduced by vllm-project#950 (modifications to
`model_runner_v1`) and vllm-project#1361 (updates to `schedule`), both of which have
been merged into the `v0.9.1-dev` branch and are expected to land in
`main` shortly.

### How was this patch tested?


- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@1c859a1

---------

Signed-off-by: leichao.lc <leichao139636@163.com>
Co-authored-by: jianzs <zheng.shoujian@outlook.com>
Co-authored-by: zzy-ContiLearn <1831242919@qq.com>
Co-authored-by: fems14 <1804143737@qq.com>
Co-authored-by: Dreamerleader <2270923832@qq.com>
Co-authored-by: chris668899 <15105191595@126.com>
Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants