Skip to content

[Feature]: Supporting W8A16 and W4A16 weight-only quantization #524

@learning-chip

Description

@learning-chip

🚀 The feature, motivation and pitch

Purpose: Int8 weight-only quant can squeeze DeepSeek-V2-Lite-chat to just 16 GB, fitting into one 910B4 device, very useful for daily development. Because MoE inference is too memory-bound, W8A16 can be as efficient as W8A8 in theory, as proved by the MARLIN paper.

GPU equivalent: Corresponding to vLLM GPU's GPTQModel backend or AutoAWQ backend.

Code changes needed

The torch_npu.npu_grouped_matmul takes torch.int8 weights and additional antiquant_scale parameter for fused dequant + matmul.

The MoE adaptor already uses npu_grouped_matmul, just need to take the extra antiquant_scale:

gate_up_out_list = torch_npu.npu_grouped_matmul(
x=[sorted_hidden_states],
weight=[w1],
split_item=2,
group_list_type=0,
group_type=0,
group_list=expert_tokens,
)

On GPU the equivalent kernel is the fused dequant + matmul Marlin kernel, but still a bug for v2-lite's shape: vllm-project/vllm#7075

Obtain model weights

DeepSeek-V2 w8a16 script available in msmodelslim (runs on CPU)

Or using GPTQModel package on DeepSeek (runs on GPU)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions