Skip to content

Conversation

@certainly-param
Copy link
Contributor

@certainly-param certainly-param commented Sep 30, 2025

Description

Fixes #20221

This PR addresses a runtime error when running models with INT8 quantization (e.g., Qwen-2.5-VL-72B-Instruct-FP8-Dynamic) on Blackwell architecture GPUs (SM100+, like RTX 6000 Blackwell).

Root cause: Blackwell GPUs don't support INT8 operations in CUTLASS kernels - only FP8 is supported. The error manifests as:

RuntimeError: Expected a.dtype() == torch::kInt8 to be true, but got false at torch.ops.C.cutlass_scaled_mm

Changes

  1. Early validation in CutlassScaledMMLinearKernel.can_implement():

    • Detects INT8 quantization on SM100+ GPUs before model loading
    • Returns clear error message with actionable guidance
  2. Improved error message in scaled_mm_helper.hpp:

    • Shows SM version number in error
    • Suggests FP8 quantization or older GPU architectures
  3. Documentation update in docs/features/quantization/int8.md:

    • Added warning about Blackwell GPU limitation
    • Directs users to FP8 alternative

Testing

Validated that the error messages are triggered correctly on SM100+ when INT8 quantization is attempted.

Impact

  • Users get clear, actionable errors early in the process
  • Documentation prevents confusion about INT8 support on newer GPUs
  • No impact on existing functionality for older architectures

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added the documentation Improvements or additions to documentation label Sep 30, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a runtime error when using INT8 quantization on Blackwell (SM100+) GPUs by adding checks to fail early with informative error messages. The changes are implemented across Python validation logic, C++ kernel error reporting, and documentation, providing a comprehensive fix. The logic is sound and the new error messages are helpful for users. I have one suggestion to improve code consistency and robustness in the Python validation code.

Comment on lines 34 to 35
major, _ = capability
compute_cap = major * 10 + (_ if _ < 10 else 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For consistency and clarity, it's better to use the existing capability.to_int() method to get the compute capability as an integer. This method is used elsewhere in the codebase (e.g., in w8a8_utils.py) and includes an assertion for the minor version, which makes the code more robust. The current implementation re-implements this logic and handles minor versions >= 10 in a way that might hide issues.

Suggested change
major, _ = capability
compute_cap = major * 10 + (_ if _ < 10 else 0)
compute_cap = capability.to_int()

Blackwell architecture (SM100+) doesn't support INT8 quantization.
Added validation in can_implement() to catch this early with a clear
error message directing users to FP8 or older GPU architectures.

Also updated error message in scaled_mm_helper.hpp for consistency
and added docs warning about the limitation.

Fixes vllm-project#20221

Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
Use the existing to_int() method instead of manually calculating
compute capability. More consistent with rest of codebase.

Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
- Remove weight_dtype check (attribute doesn't exist in config)
- Fix markdown trailing space
- Error handling is done at kernel level instead

Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
Config doesn't have weight_dtype attribute. Error checking
is properly handled at kernel level in scaled_mm_helper.hpp

Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives better log for int8, but not actually fix the problem, is there any chance you can throughly fix it?

@certainly-param
Copy link
Contributor Author

throughly

Yeah, I think the best approach is to auto-convert INT8 to FP8 on Blackwell GPUs. The hardware doesn't have INT8 tensor cores anyway (NVIDIA is moving everyone to FP8), and FP8 actually gives similar or better accuracy with great performance. The conversion would happen automatically at model load with a warning, so users don't need to do anything special - their models just work. The main tradeoff is that it's not "true" INT8 anymore. But given that Blackwell doesn't support INT8 this seems like the most practical fix. I can also add a flag if you prefer users to opt in rather than having it happen automatically...

@yewentao256
Copy link
Member

Thanks for the work! I am thinking "auto-convert INT8 to FP8" is not a ideal way to realize it, perhaps we can have another issue discussing it.
For this PR, maybe change the title to "better log for int8" or something, and make it simple

@certainly-param
Copy link
Contributor Author

Thanks for the work! I am thinking "auto-convert INT8 to FP8" is not a ideal way to realize it, perhaps we can have another issue discussing it. For this PR, maybe change the title to "better log for int8" or something, and make it simple

Thanks for the feedback. You are right! Auto-conversion will add complexity and it should be discussed separately. I've simplified this PR to focus just on improving the error messages for now.

In this PR:

  1. Better error message in the C++ kernel that shows the SM version and suggests alternatives
  2. Documentation update explaining the Blackwell limitation

I removed the auto-conversion logic.

Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work!

@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 30, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvement

@mgoin mgoin enabled auto-merge (squash) September 30, 2025 19:37
@vllm-bot vllm-bot merged commit 99028fd into vllm-project:main Oct 1, 2025
80 of 84 checks passed
pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025
…25935)

Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
tomeras91 pushed a commit to tomeras91/vllm that referenced this pull request Oct 6, 2025
…25935)

Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…25935)

Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
…25935)

Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
…25935)

Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…25935)

Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
…25935)

Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

4 participants