Skip to content

Conversation

@pichangping
Copy link
Contributor

@pichangping pichangping commented Jun 20, 2025

What this PR does / why we need it?

Supports Deepseek-R1 w4a8 quantization.
Since R1 w4a8 uses mixed quantization, only the MOE layer uses w4a8_dynamic quantization, so we added the w4a8_dynamic.py file, which includes the AscendW4A8DynamicFusedMoEMethod class.

Does this PR introduce any user-facing change?

no, use --quantization=ascend is engouh.

How was this patch tested?

1.How to get weights using Modelslim

Installation steps

Use the branch master, the commit id is: 298e175d69b3b855111a1e09bbe2fcd12fdb4e24
git clone https://gitee.com/ascend/msit.git
cd msit/msmodelslim
bash install.sh

The required transformers environment

pip install transformers==4.48.2

Generate w4a8 weights

cd /example/DeepSeek
Command reference: msmodelslim/example/DeepSeek/README.md Execute the pre-check and DeepSeek-R1 w4a8 mix quantization chapter
Reference command:python3 quant_deepseek_w4a8.py --model_path {Original weight path} --save_path {Generate weight path} --mindie_format

Adapt to vllm-ascend

Since mindie_format generates mindie format, some adaptation modifications are needed for vllm-ascend to use it:
quant_model_description_w8a8_dynamic.json rename to quant_model_description.json, and change "group_size": 0 to "group_size": 256
Modification in config.json"model_type":deepseekv2 is changed to "model_type":deepseek_v3 ; quantization_config is removed;

2.How to run w4a8

TP + EP:
python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 --enable_expert_parallel --quantization ascend --port $3 --max-model-len $4 --max-num-seqs $5 --enforce-eager
eg: python -m vllm.entrypoints.openai.api_server --model=/weightpath/w4a8_4_layer --trust-remote-code -tp 4 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 2048 --max-num-seqs 128 --enforce-eager
DP+TP+EP:
python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 -dp $3 --enable_expert_parallel --quantization ascend --port $4 --max-model-len $5 --max-num-seqs $6 --enforce-eager
eg: python -m vllm.entrypoints.openai.api_server --model=/weightpath/w4a8_4_layer --trust-remote-code -tp 2 -dp2 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 2048 --max-num-seqs 128 --enforce-eager

3.Use constraints

export VLLM_USE_V1=1 # v1

Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: pichangping <1337510399@qq.com>
@wangxiyuan wangxiyuan changed the title support deepseek w4a8 quantization [0.9.1]support deepseek w4a8 quantization Jun 20, 2025
def load_weights(self, weights: Iterable[tuple[str,
torch.Tensor]]) -> set[str]:
weights = filter(lambda x: ".module." not in x[0], weights)
# weights = ((name, data) for name, data in weights if ".module." not in name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the comment code

@wangxiyuan wangxiyuan merged commit 53ce4a0 into vllm-project:v0.9.1-dev Jun 20, 2025
16 checks passed
@wangxiyuan
Copy link
Collaborator

Please refactor w8w8 and w4a8 code together in the future to make the code clean. Thanks

NNUCJ pushed a commit to NNUCJ/vllm-ascend that referenced this pull request Jun 23, 2025
### What this PR does / why we need it?

Supports Deepseek-R1 w4a8 quantization.
Since R1 w4a8 uses mixed quantization, only the MOE layer uses
w4a8_dynamic quantization, so we added the w4a8_dynamic.py file, which
includes the AscendW4A8DynamicFusedMoEMethod class.

### Does this PR introduce _any_ user-facing change?
no, use `--quantization=ascend` is engouh. 

### How was this patch tested?

#### 1.How to get weights using Modelslim

##### Installation steps

Use the branch master, the commit id is:
298e175d69b3b855111a1e09bbe2fcd12fdb4e24
git clone https://gitee.com/ascend/msit.git
cd msit/msmodelslim
bash install.sh

##### The required transformers environment

pip install transformers==4.48.2

##### Generate w4a8 weights

cd /example/DeepSeek
Command reference: msmodelslim/example/DeepSeek/README.md Execute the
[pre-check](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#运行前必检)
and [DeepSeek-R1 w4a8 mix
quantization](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-r1-w4a8-混合量化前三层-mlpw8a8-dynamic-量化mla共享专家w8a8量化路由专家w4a8-dynamic量化)
chapter
Reference command:python3 quant_deepseek_w4a8.py --model_path {Original
weight path} --save_path {Generate weight path} --mindie_format

##### Adapt to vllm-ascend

Since mindie_format generates mindie format, some adaptation
modifications are needed for vllm-ascend to use it:
`quant_model_description_w8a8_dynamic.json` rename to
`quant_model_description.json`, and change `"group_size": 0` to
`"group_size": 256`
Modification in `config.json`:`"model_type":deepseekv2` is changed to
`"model_type":deepseek_v3` ; `quantization_config` is removed;

#### 2.How to run w4a8
TP + EP:
python -m vllm.entrypoints.openai.api_server --model=$1
--trust-remote-code -tp $2 --enable_expert_parallel --quantization
ascend --port $3 --max-model-len $4 --max-num-seqs $5 --enforce-eager
eg: python -m vllm.entrypoints.openai.api_server
--model=/weightpath/w4a8_4_layer --trust-remote-code -tp 4
--enable_expert_parallel --quantization ascend --port 8002
--max-model-len 2048 --max-num-seqs 128 --enforce-eager
DP+TP+EP:
python -m vllm.entrypoints.openai.api_server --model=$1
--trust-remote-code -tp $2 -dp $3 --enable_expert_parallel
--quantization ascend --port $4 --max-model-len $5 --max-num-seqs $6
--enforce-eager
eg: python -m vllm.entrypoints.openai.api_server
--model=/weightpath/w4a8_4_layer --trust-remote-code -tp 2 -dp2
--enable_expert_parallel --quantization ascend --port 8002
--max-model-len 2048 --max-num-seqs 128 --enforce-eager

#### 3.Use constraints
export VLLM_USE_V1=1  # v1

---------

Signed-off-by: pichangping <1337510399@qq.com>
@Yikun Yikun added the no-main label Jul 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants