[Bug] Fix Weight Loading for Block FP8 Cutlass SM90 #25909

yewentao256 · 2025-09-29T22:14:00Z

Purpose

Fix Weight Loading for Cutlass SM90.

Test Plan

vllm serve deepseek-ai/DeepSeek-V3.2-Exp -tp 8 --max-num-seqs 128 --load-format dummy --enforce_eager

Original

16:37:34 [multiproc_executor.py:671]     output = w8a8_blockscale_func(q_input, weight, x_scale, weight_scale,
(APIServer pid=1) (EngineCore_DP0 pid=293) (Worker_TP3 pid=305) ERROR 09-29 16:37:34 [multiproc_executor.py:671]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) (EngineCore_DP0 pid=293) (Worker_TP3 pid=305) ERROR 09-29 16:37:34 [multiproc_executor.py:671]   File "/opt/vllm-source/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 46, in cutlass_scaled_mm
(APIServer pid=1) (EngineCore_DP0 pid=293) (Worker_TP3 pid=305) ERROR 09-29 16:37:34 [multiproc_executor.py:671]     return ops.cutlass_scaled_mm(
(APIServer pid=1) (EngineCore_DP0 pid=293) (Worker_TP3 pid=305) ERROR 09-29 16:37:34 [multiproc_executor.py:671]            ^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) (EngineCore_DP0 pid=293) (Worker_TP3 pid=305) ERROR 09-29 16:37:34 [multiproc_executor.py:671]   File "/opt/vllm-source/vllm/_custom_ops.py", line 667, in cutlass_scaled_mm
(APIServer pid=1) (EngineCore_DP0 pid=293) (Worker_TP3 pid=305) ERROR 09-29 16:37:34 [multiproc_executor.py:671]     torch.ops._C.cutlass_scaled_mm(out, a, b, scale_a, scale_b, bias)
(APIServer pid=1) (EngineCore_DP0 pid=293) (Worker_TP3 pid=305) ERROR 09-29 16:37:34 [multiproc_executor.py:671]   File "/opt/vllm/lib64/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
(APIServer pid=1) (EngineCore_DP0 pid=293) (Worker_TP3 pid=305) ERROR 09-29 16:37:34 [multiproc_executor.py:671]     return self._op(*args, **kwargs)
(APIServer pid=1) (EngineCore_DP0 pid=293) (Worker_TP3 pid=305) ERROR 09-29 16:37:34 [multiproc_executor.py:671]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) (EngineCore_DP0 pid=293) (Worker_TP3 pid=305) ERROR 09-29 16:37:34 [multiproc_executor.py:671] RuntimeError: b_scale_group_shape must be [128, 128].
(APIServer pid=1) (EngineCore_DP0 pid=293) (Worker_TP3 pid=305) ERROR 09-29 16:37:34 [multiproc_executor.py:671]

Now

(APIServer pid=3511336) INFO 09-29 22:03:11 [launcher.py:42] Route: /invocations, Methods: POST
(APIServer pid=3511336) INFO 09-29 22:03:11 [launcher.py:42] Route: /metrics, Methods: GET
(APIServer pid=3511336) INFO:     Started server process [3511336]
(APIServer pid=3511336) INFO:     Waiting for application startup.
(APIServer pid=3511336) INFO:     Application startup complete.

Signed-off-by: yewentao256 <zhyanwentao@126.com>

gemini-code-assist

Code Review

This pull request correctly fixes a bug in the weight loading logic for FP8 block-quantized models on SM90 GPUs, which could cause a runtime error. The refactoring also improves code clarity. However, I've identified a related issue where torch.bfloat16 is hardcoded when checking if DeepGEMM should be used. This could lead to incorrect behavior for models using other data types like float16. I've suggested a fix to use the layer's original data type instead.

vllm/model_executor/layers/quantization/utils/fp8_utils.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

mgoin

Thanks for fixing this, makes sense why it failed

youkaichao

thanks for fixing it!

Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: simon-mo <simon.mo@hey.com>

Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: simon-mo <simon.mo@hey.com>

Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

fix weight loading for cutlass sm90

d381fc9

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners September 29, 2025 22:14

gemini-code-assist bot reviewed Sep 29, 2025

View reviewed changes

vllm/model_executor/layers/quantization/utils/fp8_utils.py Outdated Show resolved Hide resolved

Update vllm/model_executor/layers/quantization/utils/fp8_utils.py

6398ba9

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

heheda12345 added this to the v0.11.0 Cherry Picks milestone Sep 29, 2025

mgoin approved these changes Sep 29, 2025

View reviewed changes

mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed deepseek Related to DeepSeek models labels Sep 29, 2025

youkaichao approved these changes Sep 29, 2025

View reviewed changes

mgoin changed the title ~~[Bug] Fix Weight Loading for Cutlass SM90~~ [Bug] Fix Weight Loading for Block FP8 Cutlass SM90 Sep 29, 2025

youkaichao merged commit 89e4050 into main Sep 30, 2025
54 checks passed

youkaichao deleted the wentao-fix-weight-loading branch September 30, 2025 01:15

youkaichao mentioned this pull request Oct 5, 2025

RuntimeError: Worker failed with error 'b_scale_group_shape must be [128, 128] vllm-project/recipes#73

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Bug] Fix Weight Loading for Block FP8 Cutlass SM90 #25909

[Bug] Fix Weight Loading for Block FP8 Cutlass SM90 #25909

Uh oh!

yewentao256 commented Sep 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mgoin left a comment

Uh oh!

youkaichao left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Uh oh!

[Bug] Fix Weight Loading for Block FP8 Cutlass SM90 #25909

[Bug] Fix Weight Loading for Block FP8 Cutlass SM90 #25909

Uh oh!

Conversation

yewentao256 commented Sep 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

youkaichao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yewentao256 commented Sep 29, 2025 •

edited by github-actions bot

Loading