-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
Fix INT8 quantization error on Blackwell GPUs (SM100+) #25935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly addresses a runtime error when using INT8 quantization on Blackwell (SM100+) GPUs by adding checks to fail early with informative error messages. The changes are implemented across Python validation logic, C++ kernel error reporting, and documentation, providing a comprehensive fix. The logic is sound and the new error messages are helpful for users. I have one suggestion to improve code consistency and robustness in the Python validation code.
| major, _ = capability | ||
| compute_cap = major * 10 + (_ if _ < 10 else 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency and clarity, it's better to use the existing capability.to_int() method to get the compute capability as an integer. This method is used elsewhere in the codebase (e.g., in w8a8_utils.py) and includes an assertion for the minor version, which makes the code more robust. The current implementation re-implements this logic and handles minor versions >= 10 in a way that might hide issues.
| major, _ = capability | |
| compute_cap = major * 10 + (_ if _ < 10 else 0) | |
| compute_cap = capability.to_int() |
Blackwell architecture (SM100+) doesn't support INT8 quantization. Added validation in can_implement() to catch this early with a clear error message directing users to FP8 or older GPU architectures. Also updated error message in scaled_mm_helper.hpp for consistency and added docs warning about the limitation. Fixes vllm-project#20221 Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
Use the existing to_int() method instead of manually calculating compute capability. More consistent with rest of codebase. Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
- Remove weight_dtype check (attribute doesn't exist in config) - Fix markdown trailing space - Error handling is done at kernel level instead Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
Config doesn't have weight_dtype attribute. Error checking is properly handled at kernel level in scaled_mm_helper.hpp Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
8fc9260 to
dd0a7b0
Compare
yewentao256
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gives better log for int8, but not actually fix the problem, is there any chance you can throughly fix it?
Yeah, I think the best approach is to auto-convert INT8 to FP8 on Blackwell GPUs. The hardware doesn't have INT8 tensor cores anyway (NVIDIA is moving everyone to FP8), and FP8 actually gives similar or better accuracy with great performance. The conversion would happen automatically at model load with a warning, so users don't need to do anything special - their models just work. The main tradeoff is that it's not "true" INT8 anymore. But given that Blackwell doesn't support INT8 this seems like the most practical fix. I can also add a flag if you prefer users to opt in rather than having it happen automatically... |
|
Thanks for the work! I am thinking "auto-convert INT8 to FP8" is not a ideal way to realize it, perhaps we can have another issue discussing it. |
0d4702e to
dd0a7b0
Compare
Thanks for the feedback. You are right! Auto-conversion will add complexity and it should be discussed separately. I've simplified this PR to focus just on improving the error messages for now. In this PR:
I removed the auto-conversion logic. |
yewentao256
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the work!
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the improvement
…25935) Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
Signed-off-by: padg9912 <phone.and.desktop@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
…25935) Signed-off-by: padg9912 <phone.and.desktop@gmail.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…25935) Signed-off-by: padg9912 <phone.and.desktop@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…25935) Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
…25935) Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
…25935) Signed-off-by: padg9912 <phone.and.desktop@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…25935) Signed-off-by: padg9912 <phone.and.desktop@gmail.com>
Description
Fixes #20221
This PR addresses a runtime error when running models with INT8 quantization (e.g., Qwen-2.5-VL-72B-Instruct-FP8-Dynamic) on Blackwell architecture GPUs (SM100+, like RTX 6000 Blackwell).
Root cause: Blackwell GPUs don't support INT8 operations in CUTLASS kernels - only FP8 is supported. The error manifests as:
RuntimeError: Expected a.dtype() == torch::kInt8 to be true, but got false at torch.ops.C.cutlass_scaled_mm
Changes
Early validation in
CutlassScaledMMLinearKernel.can_implement():Improved error message in
scaled_mm_helper.hpp:Documentation update in
docs/features/quantization/int8.md:Testing
Validated that the error messages are triggered correctly on SM100+ when INT8 quantization is attempted.
Impact