Skip to content

Conversation

@TmacAaron
Copy link

@TmacAaron TmacAaron commented Nov 28, 2025

What this PR does / why we need it?

related to #4267

Does this PR introduce any user-facing change?

support w8a16 quantization now

How was this patch tested?

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for W8A16 quantization for dense models. The changes include a new quantization method AscendW8A16LinearMethod and its integration into the existing quantization framework. The implementation is straightforward and looks correct. I have one suggestion to refactor the apply method in vllm_ascend/quantization/w8a16.py to reduce code duplication, which will improve readability and maintainability.

Comment on lines 77 to 95
if is_310p():
# On 300I Duo platform, we need transpose again if
# using nz. This transpose can be skipped in torchair.
output = torch_npu.npu_weight_quant_batchmatmul(
x=x,
weight=layer.weight.data.transpose(0, 1),
antiquant_scale=layer.weight_scale,
antiquant_offset=layer.weight_offset,
bias=bias
)
else:
output = torch_npu.npu_weight_quant_batchmatmul(
x=x,
weight=layer.weight,
antiquant_scale=layer.weight_scale,
antiquant_offset=layer.weight_offset,
bias=bias
)
return output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The if/else block contains duplicated calls to torch_npu.npu_weight_quant_batchmatmul. This can be refactored to improve readability and maintainability by determining the weight tensor first and then making a single call to the function.

        weight = layer.weight
        if is_310p():
            # On 300I Duo platform, we need transpose again if
            # using nz. This transpose can be skipped in torchair.
            weight = layer.weight.data.transpose(0, 1)

        output = torch_npu.npu_weight_quant_batchmatmul(
            x=x,
            weight=weight,
            antiquant_scale=layer.weight_scale,
            antiquant_offset=layer.weight_offset,
            bias=bias
        )
        return output

@realliujiaxu
Copy link
Contributor

Signed-off-by: yyt <yangyit139@gmail.com>
Signed-off-by: yyt <yangyit139@gmail.com>
Signed-off-by: yyt <yangyit139@gmail.com>
@TmacAaron TmacAaron force-pushed the w8a16 branch 2 times, most recently from 8b8a945 to c3c2b2f Compare November 29, 2025 03:50
Signed-off-by: yyt <yangyit139@gmail.com>
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants