Fix INT8 quantization error on Blackwell GPUs (SM100+) #25935

certainly-param · 2025-09-30T08:11:04Z

Description

Fixes #20221

This PR addresses a runtime error when running models with INT8 quantization (e.g., Qwen-2.5-VL-72B-Instruct-FP8-Dynamic) on Blackwell architecture GPUs (SM100+, like RTX 6000 Blackwell).

Root cause: Blackwell GPUs don't support INT8 operations in CUTLASS kernels - only FP8 is supported. The error manifests as:

RuntimeError: Expected a.dtype() == torch::kInt8 to be true, but got false at torch.ops.C.cutlass_scaled_mm

Changes

Early validation in CutlassScaledMMLinearKernel.can_implement():
- Detects INT8 quantization on SM100+ GPUs before model loading
- Returns clear error message with actionable guidance
Improved error message in scaled_mm_helper.hpp:
- Shows SM version number in error
- Suggests FP8 quantization or older GPU architectures
Documentation update in docs/features/quantization/int8.md:
- Added warning about Blackwell GPU limitation
- Directs users to FP8 alternative

Testing

Validated that the error messages are triggered correctly on SM100+ when INT8 quantization is attempted.

Impact

Users get clear, actionable errors early in the process
Documentation prevents confusion about INT8 support on newer GPUs
No impact on existing functionality for older architectures

github-actions · 2025-09-30T08:11:13Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request correctly addresses a runtime error when using INT8 quantization on Blackwell (SM100+) GPUs by adding checks to fail early with informative error messages. The changes are implemented across Python validation logic, C++ kernel error reporting, and documentation, providing a comprehensive fix. The logic is sound and the new error messages are helpful for users. I have one suggestion to improve code consistency and robustness in the Python validation code.

gemini-code-assist · 2025-09-30T08:12:15Z

vllm/model_executor/layers/quantization/kernels/scaled_mm/cutlass.py

+            major, _ = capability
+            compute_cap = major * 10 + (_ if _ < 10 else 0)


For consistency and clarity, it's better to use the existing capability.to_int() method to get the compute capability as an integer. This method is used elsewhere in the codebase (e.g., in w8a8_utils.py) and includes an assertion for the minor version, which makes the code more robust. The current implementation re-implements this logic and handles minor versions >= 10 in a way that might hide issues.

Suggested change

major, _ = capability

compute_cap = major * 10 + (_ if _ < 10 else 0)

compute_cap = capability.to_int()

Blackwell architecture (SM100+) doesn't support INT8 quantization. Added validation in can_implement() to catch this early with a clear error message directing users to FP8 or older GPU architectures. Also updated error message in scaled_mm_helper.hpp for consistency and added docs warning about the limitation. Fixes vllm-project#20221 Signed-off-by: padg9912 <phone.and.desktop@gmail.com>

Use the existing to_int() method instead of manually calculating compute capability. More consistent with rest of codebase. Signed-off-by: padg9912 <phone.and.desktop@gmail.com>

- Remove weight_dtype check (attribute doesn't exist in config) - Fix markdown trailing space - Error handling is done at kernel level instead Signed-off-by: padg9912 <phone.and.desktop@gmail.com>

Config doesn't have weight_dtype attribute. Error checking is properly handled at kernel level in scaled_mm_helper.hpp Signed-off-by: padg9912 <phone.and.desktop@gmail.com>

Signed-off-by: padg9912 <phone.and.desktop@gmail.com>

yewentao256

This gives better log for int8, but not actually fix the problem, is there any chance you can throughly fix it?

certainly-param · 2025-09-30T17:07:26Z

throughly

Yeah, I think the best approach is to auto-convert INT8 to FP8 on Blackwell GPUs. The hardware doesn't have INT8 tensor cores anyway (NVIDIA is moving everyone to FP8), and FP8 actually gives similar or better accuracy with great performance. The conversion would happen automatically at model load with a warning, so users don't need to do anything special - their models just work. The main tradeoff is that it's not "true" INT8 anymore. But given that Blackwell doesn't support INT8 this seems like the most practical fix. I can also add a flag if you prefer users to opt in rather than having it happen automatically...

yewentao256 · 2025-09-30T19:00:19Z

Thanks for the work! I am thinking "auto-convert INT8 to FP8" is not a ideal way to realize it, perhaps we can have another issue discussing it.
For this PR, maybe change the title to "better log for int8" or something, and make it simple

certainly-param · 2025-09-30T19:06:41Z

Thanks for the work! I am thinking "auto-convert INT8 to FP8" is not a ideal way to realize it, perhaps we can have another issue discussing it. For this PR, maybe change the title to "better log for int8" or something, and make it simple

Thanks for the feedback. You are right! Auto-conversion will add complexity and it should be discussed separately. I've simplified this PR to focus just on improving the error messages for now.

In this PR:

Better error message in the C++ kernel that shows the SM version and suggests alternatives
Documentation update explaining the Blackwell limitation

I removed the auto-conversion logic.

yewentao256

LGTM, thanks for the work!

mgoin

Thanks for the improvement

…25935) Signed-off-by: padg9912 <phone.and.desktop@gmail.com>

Signed-off-by: padg9912 <phone.and.desktop@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

…25935) Signed-off-by: padg9912 <phone.and.desktop@gmail.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…25935) Signed-off-by: padg9912 <phone.and.desktop@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…25935) Signed-off-by: padg9912 <phone.and.desktop@gmail.com>

…25935) Signed-off-by: padg9912 <phone.and.desktop@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…25935) Signed-off-by: padg9912 <phone.and.desktop@gmail.com>

certainly-param requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners September 30, 2025 08:11

mergify bot added the documentation Improvements or additions to documentation label Sep 30, 2025

gemini-code-assist bot reviewed Sep 30, 2025

View reviewed changes

certainly-param added 5 commits September 30, 2025 04:41

Use capability.to_int() for consistency

7bfb572

Use the existing to_int() method instead of manually calculating compute capability. More consistent with rest of codebase. Signed-off-by: padg9912 <phone.and.desktop@gmail.com>

Fix pre-commit issues

daaec73

- Remove weight_dtype check (attribute doesn't exist in config) - Fix markdown trailing space - Error handling is done at kernel level instead Signed-off-by: padg9912 <phone.and.desktop@gmail.com>

Remove invalid weight_dtype check

151ea66

Config doesn't have weight_dtype attribute. Error checking is properly handled at kernel level in scaled_mm_helper.hpp Signed-off-by: padg9912 <phone.and.desktop@gmail.com>

Apply clang-format to C++ code

dd0a7b0

Signed-off-by: padg9912 <phone.and.desktop@gmail.com>

certainly-param force-pushed the fix-issue-20221 branch from 8fc9260 to dd0a7b0 Compare September 30, 2025 08:43

yewentao256 reviewed Sep 30, 2025

View reviewed changes

certainly-param force-pushed the fix-issue-20221 branch from 0d4702e to dd0a7b0 Compare September 30, 2025 19:02

yewentao256 approved these changes Sep 30, 2025

View reviewed changes

yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 30, 2025

mgoin approved these changes Sep 30, 2025

View reviewed changes

mgoin enabled auto-merge (squash) September 30, 2025 19:37

vllm-bot merged commit 99028fd into vllm-project:main Oct 1, 2025
80 of 84 checks passed

pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025

Fix INT8 quantization error on Blackwell GPUs (SM100+) (vllm-project#…

2d3e81d

…25935) Signed-off-by: padg9912 <phone.and.desktop@gmail.com>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

Fix INT8 quantization error on Blackwell GPUs (SM100+) (#25935)

cd0bbf5

Signed-off-by: padg9912 <phone.and.desktop@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

Fix INT8 quantization error on Blackwell GPUs (SM100+) (vllm-project#…

bc7feb5

…25935) Signed-off-by: padg9912 <phone.and.desktop@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

Fix INT8 quantization error on Blackwell GPUs (SM100+) (vllm-project#…

2b3e99f

…25935) Signed-off-by: padg9912 <phone.and.desktop@gmail.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

Fix INT8 quantization error on Blackwell GPUs (SM100+) (vllm-project#…

a8fa1b0

…25935) Signed-off-by: padg9912 <phone.and.desktop@gmail.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

Fix INT8 quantization error on Blackwell GPUs (SM100+) (vllm-project#…

5704649

…25935) Signed-off-by: padg9912 <phone.and.desktop@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

Fix INT8 quantization error on Blackwell GPUs (SM100+) (vllm-project#…

abf7512

…25935) Signed-off-by: padg9912 <phone.and.desktop@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix INT8 quantization error on Blackwell GPUs (SM100+) #25935

Fix INT8 quantization error on Blackwell GPUs (SM100+) #25935

Uh oh!

certainly-param commented Sep 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Sep 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 30, 2025

Uh oh!

yewentao256 left a comment

Uh oh!

certainly-param commented Sep 30, 2025

Uh oh!

yewentao256 commented Sep 30, 2025

Uh oh!

certainly-param commented Sep 30, 2025

Uh oh!

yewentao256 left a comment

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		major, _ = capability
		compute_cap = major * 10 + (_ if _ < 10 else 0)

	major, _ = capability
	compute_cap = major * 10 + (_ if _ < 10 else 0)
	compute_cap = capability.to_int()

Uh oh!

Fix INT8 quantization error on Blackwell GPUs (SM100+) #25935

Fix INT8 quantization error on Blackwell GPUs (SM100+) #25935

Uh oh!

Conversation

certainly-param commented Sep 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

github-actions bot commented Sep 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

certainly-param commented Sep 30, 2025

Uh oh!

yewentao256 commented Sep 30, 2025

Uh oh!

certainly-param commented Sep 30, 2025

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

certainly-param commented Sep 30, 2025 •

edited by github-actions bot

Loading