[0.9.1]support deepseek w4a8 quantization #1320

pichangping · 2025-06-20T09:30:52Z

What this PR does / why we need it?

Supports Deepseek-R1 w4a8 quantization.
Since R1 w4a8 uses mixed quantization, only the MOE layer uses w4a8_dynamic quantization, so we added the w4a8_dynamic.py file, which includes the AscendW4A8DynamicFusedMoEMethod class.

Does this PR introduce any user-facing change?

no, use --quantization=ascend is engouh.

How was this patch tested?

1.How to get weights using Modelslim

Installation steps

Use the branch master, the commit id is: 298e175d69b3b855111a1e09bbe2fcd12fdb4e24
git clone https://gitee.com/ascend/msit.git
cd msit/msmodelslim
bash install.sh

The required transformers environment

pip install transformers==4.48.2

Generate w4a8 weights

cd /example/DeepSeek
Command reference: msmodelslim/example/DeepSeek/README.md Execute the pre-check and DeepSeek-R1 w4a8 mix quantization chapter
Reference command：python3 quant_deepseek_w4a8.py --model_path {Original weight path} --save_path {Generate weight path} --mindie_format

Adapt to vllm-ascend

Since mindie_format generates mindie format, some adaptation modifications are needed for vllm-ascend to use it:
quant_model_description_w8a8_dynamic.json rename to quant_model_description.json, and change "group_size": 0 to "group_size": 256
Modification in config.json："model_type":deepseekv2 is changed to "model_type":deepseek_v3 ; quantization_config is removed;

2.How to run w4a8

TP + EP：
python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 --enable_expert_parallel --quantization ascend --port $3 --max-model-len $4 --max-num-seqs $5 --enforce-eager
eg: python -m vllm.entrypoints.openai.api_server --model=/weightpath/w4a8_4_layer --trust-remote-code -tp 4 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 2048 --max-num-seqs 128 --enforce-eager
DP+TP+EP:
python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 -dp $3 --enable_expert_parallel --quantization ascend --port $4 --max-model-len $5 --max-num-seqs $6 --enforce-eager
eg: python -m vllm.entrypoints.openai.api_server --model=/weightpath/w4a8_4_layer --trust-remote-code -tp 2 -dp2 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 2048 --max-num-seqs 128 --enforce-eager

3.Use constraints

export VLLM_USE_V1=1 # v1

Signed-off-by: pichangping <1337510399@qq.com>

wangxiyuan · 2025-06-20T13:22:16Z

vllm_ascend/models/deepseek_v2.py

+    def load_weights(self, weights: Iterable[tuple[str,
+                                                   torch.Tensor]]) -> set[str]:
+        weights = filter(lambda x: ".module." not in x[0], weights)
+        # weights = ((name, data) for name, data in weights if ".module." not in name)


remove the comment code

wangxiyuan · 2025-06-20T13:27:44Z

Please refactor w8w8 and w4a8 code together in the future to make the code clean. Thanks

### What this PR does / why we need it? Supports Deepseek-R1 w4a8 quantization. Since R1 w4a8 uses mixed quantization, only the MOE layer uses w4a8_dynamic quantization, so we added the w4a8_dynamic.py file, which includes the AscendW4A8DynamicFusedMoEMethod class. ### Does this PR introduce _any_ user-facing change? no, use `--quantization=ascend` is engouh. ### How was this patch tested? #### 1.How to get weights using Modelslim ##### Installation steps Use the branch master, the commit id is: 298e175d69b3b855111a1e09bbe2fcd12fdb4e24 git clone https://gitee.com/ascend/msit.git cd msit/msmodelslim bash install.sh ##### The required transformers environment pip install transformers==4.48.2 ##### Generate w4a8 weights cd /example/DeepSeek Command reference: msmodelslim/example/DeepSeek/README.md Execute the [pre-check](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#运行前必检) and [DeepSeek-R1 w4a8 mix quantization](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-r1-w4a8-混合量化前三层-mlpw8a8-dynamic-量化mla共享专家w8a8量化路由专家w4a8-dynamic量化) chapter Reference command：python3 quant_deepseek_w4a8.py --model_path {Original weight path} --save_path {Generate weight path} --mindie_format ##### Adapt to vllm-ascend Since mindie_format generates mindie format, some adaptation modifications are needed for vllm-ascend to use it: `quant_model_description_w8a8_dynamic.json` rename to `quant_model_description.json`, and change `"group_size": 0` to `"group_size": 256` Modification in `config.json`：`"model_type":deepseekv2` is changed to `"model_type":deepseek_v3` ; `quantization_config` is removed; #### 2.How to run w4a8 TP + EP： python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 --enable_expert_parallel --quantization ascend --port $3 --max-model-len $4 --max-num-seqs $5 --enforce-eager eg: python -m vllm.entrypoints.openai.api_server --model=/weightpath/w4a8_4_layer --trust-remote-code -tp 4 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 2048 --max-num-seqs 128 --enforce-eager DP+TP+EP: python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 -dp $3 --enable_expert_parallel --quantization ascend --port $4 --max-model-len $5 --max-num-seqs $6 --enforce-eager eg: python -m vllm.entrypoints.openai.api_server --model=/weightpath/w4a8_4_layer --trust-remote-code -tp 2 -dp2 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 2048 --max-num-seqs 128 --enforce-eager #### 3.Use constraints export VLLM_USE_V1=1 # v1 --------- Signed-off-by: pichangping <1337510399@qq.com>

support deepseek w4a8 quantization

8429b65

Signed-off-by: pichangping <1337510399@qq.com>

github-actions bot added module:tests module:quantization labels Jun 20, 2025

pichangping added 3 commits June 20, 2025 17:50

support deepseek w4a8

72b9387

Signed-off-by: pichangping <1337510399@qq.com>

support deepseek w4a8 quantization

f55f7af

Signed-off-by: pichangping <1337510399@qq.com>

support deepseek w4a8 quantization

682c4fb

Signed-off-by: pichangping <1337510399@qq.com>

wangxiyuan changed the title ~~support deepseek w4a8 quantization~~ [0.9.1]support deepseek w4a8 quantization Jun 20, 2025

wangxiyuan mentioned this pull request Jun 20, 2025

[release] 0.9.1rc1 release checklist #1315

Closed

29 tasks

wangxiyuan approved these changes Jun 20, 2025

View reviewed changes

wangxiyuan merged commit 53ce4a0 into vllm-project:v0.9.1-dev Jun 20, 2025
16 checks passed

Yikun added the no-main label Jul 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[0.9.1]support deepseek w4a8 quantization #1320

[0.9.1]support deepseek w4a8 quantization #1320

Uh oh!

pichangping commented Jun 20, 2025 •

edited by wangxiyuan

Loading

Uh oh!

wangxiyuan Jun 20, 2025

Uh oh!

Uh oh!

wangxiyuan commented Jun 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[0.9.1]support deepseek w4a8 quantization #1320

[0.9.1]support deepseek w4a8 quantization #1320

Uh oh!

Conversation

pichangping commented Jun 20, 2025 • edited by wangxiyuan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

1.How to get weights using Modelslim

Installation steps

The required transformers environment

Generate w4a8 weights

Adapt to vllm-ascend

2.How to run w4a8

3.Use constraints

Uh oh!

wangxiyuan Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wangxiyuan commented Jun 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pichangping commented Jun 20, 2025 •

edited by wangxiyuan

Loading