-
Notifications
You must be signed in to change notification settings - Fork 544
Description
🚀 The feature, motivation and pitch
Purpose: Int8 weight-only quant can squeeze DeepSeek-V2-Lite-chat to just 16 GB, fitting into one 910B4 device, very useful for daily development. Because MoE inference is too memory-bound, W8A16 can be as efficient as W8A8 in theory, as proved by the MARLIN paper.
GPU equivalent: Corresponding to vLLM GPU's GPTQModel backend or AutoAWQ backend.
Code changes needed
The torch_npu.npu_grouped_matmul takes torch.int8 weights and additional antiquant_scale parameter for fused dequant + matmul.
The MoE adaptor already uses npu_grouped_matmul, just need to take the extra antiquant_scale:
vllm-ascend/vllm_ascend/ops/fused_moe.py
Lines 124 to 131 in 5fa70b6
| gate_up_out_list = torch_npu.npu_grouped_matmul( | |
| x=[sorted_hidden_states], | |
| weight=[w1], | |
| split_item=2, | |
| group_list_type=0, | |
| group_type=0, | |
| group_list=expert_tokens, | |
| ) |
On GPU the equivalent kernel is the fused dequant + matmul Marlin kernel, but still a bug for v2-lite's shape: vllm-project/vllm#7075
Obtain model weights
DeepSeek-V2 w8a16 script available in msmodelslim (runs on CPU)
Or using GPTQModel package on DeepSeek (runs on GPU)