[quantization] Add w8a16 quantization support #4541

TmacAaron · 2025-11-28T08:51:00Z

What this PR does / why we need it?

related to #4267

Does this PR introduce any user-facing change?

support w8a16 quantization now

How was this patch tested?

vLLM version: v0.11.2
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

github-actions · 2025-11-28T08:52:05Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces support for W8A16 quantization for dense models. The changes include a new quantization method AscendW8A16LinearMethod and its integration into the existing quantization framework. The implementation is straightforward and looks correct. I have one suggestion to refactor the apply method in vllm_ascend/quantization/w8a16.py to reduce code duplication, which will improve readability and maintainability.

gemini-code-assist · 2025-11-28T08:52:25Z

vllm_ascend/quantization/w8a16.py

+        if is_310p():
+            # On 300I Duo platform, we need transpose again if
+            # using nz. This transpose can be skipped in torchair.
+            output = torch_npu.npu_weight_quant_batchmatmul(
+                x=x,
+                weight=layer.weight.data.transpose(0, 1),
+                antiquant_scale=layer.weight_scale,
+                antiquant_offset=layer.weight_offset,
+                bias=bias
+            )
+        else:
+            output = torch_npu.npu_weight_quant_batchmatmul(
+                x=x,
+                weight=layer.weight,
+                antiquant_scale=layer.weight_scale,
+                antiquant_offset=layer.weight_offset,
+                bias=bias
+            )
+        return output


The if/else block contains duplicated calls to torch_npu.npu_weight_quant_batchmatmul. This can be refactored to improve readability and maintainability by determining the weight tensor first and then making a single call to the function.

weight = layer.weight if is_310p(): # On 300I Duo platform, we need transpose again if # using nz. This transpose can be skipped in torchair. weight = layer.weight.data.transpose(0, 1) output = torch_npu.npu_weight_quant_batchmatmul( x=x, weight=weight, antiquant_scale=layer.weight_scale, antiquant_offset=layer.weight_offset, bias=bias ) return output

realliujiaxu · 2025-11-28T12:37:50Z

please fix DCO and lint, refer to https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/contribution/index.html

Signed-off-by: yyt <yangyit139@gmail.com>

github-actions · 2025-11-30T04:01:40Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions bot added the module:quantization label Nov 28, 2025

gemini-code-assist bot reviewed Nov 28, 2025

View reviewed changes

github-actions bot added the module:tests label Nov 28, 2025

TmacAaron force-pushed the w8a16 branch from 48e9456 to 89ae79b Compare November 29, 2025 01:58

TmacAaron added 2 commits November 29, 2025 02:22

support w8a16 linear method

dbd6aa9

Signed-off-by: yyt <yangyit139@gmail.com>

ad ut and e2e test for w8a16 quantization

0160a4b

Signed-off-by: yyt <yangyit139@gmail.com>

TmacAaron force-pushed the w8a16 branch from 89ae79b to 8331963 Compare November 29, 2025 03:09

adapted to updated vllm-ascend

07f99cb

Signed-off-by: yyt <yangyit139@gmail.com>

TmacAaron force-pushed the w8a16 branch 2 times, most recently from 8b8a945 to c3c2b2f Compare November 29, 2025 03:50

fix lint

ce53d12

Signed-off-by: yyt <yangyit139@gmail.com>

TmacAaron force-pushed the w8a16 branch from c3c2b2f to ce53d12 Compare November 29, 2025 04:01

github-actions bot added the merge-conflicts label Nov 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[quantization] Add w8a16 quantization support #4541

[quantization] Add w8a16 quantization support #4541

Uh oh!

TmacAaron commented Nov 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 28, 2025

Uh oh!

realliujiaxu commented Nov 28, 2025

Uh oh!

github-actions bot commented Nov 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[quantization] Add w8a16 quantization support #4541

Are you sure you want to change the base?

[quantization] Add w8a16 quantization support #4541

Uh oh!

Conversation

TmacAaron commented Nov 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

realliujiaxu commented Nov 28, 2025

Uh oh!

github-actions bot commented Nov 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TmacAaron commented Nov 28, 2025 •

edited by github-actions bot

Loading