[Perf] Apply torch.compile for `per_block_cast_to_fp8` #24611

yewentao256 · 2025-09-10T22:34:59Z

Purpose

Develop for #24607

Test Plan

Unit test & benchmark

python benchmark_per_block_cast_to_fp8.py --use-ue8m0

GPU: NVIDIA B200
dtype=bf16, UE8M0=True, block=(128x128)
         Shape | Baseline (ms) | Triton (ms) | Speedup | Y equal | Y maxdiff | S equal | S maxdiff
--------------------------------------------------------------------------------------
       128x128 |        0.090 |       0.028 |    3.20 |    True |  0.00e+00 |    True |  0.00e+00
     1024x1024 |        0.087 |       0.028 |    3.12 |    True |  0.00e+00 |    True |  0.00e+00
     2048x4096 |        0.157 |       0.028 |    5.57 |    True |  0.00e+00 |    True |  0.00e+00
     4096x4096 |        0.252 |       0.037 |    6.81 |    True |  0.00e+00 |    True |  0.00e+00
     4096x8192 |        0.464 |       0.068 |    6.84 |    True |  0.00e+00 |    True |  0.00e+00
     8192x4096 |        0.464 |       0.068 |    6.80 |    True |  0.00e+00 |    True |  0.00e+00
     3000x4097 |        0.255 |       0.053 |    4.85 |    True |  0.00e+00 |    True |  0.00e+00
     7168x7168 |        0.679 |       0.098 |    6.91 |    True |  0.00e+00 |    True |  0.00e+00
   16384x32768 |        6.294 |       0.952 |    6.61 |    True |  0.00e+00 |    True |  0.00e+00

python benchmark_per_block_cast_to_fp8.py

GPU: NVIDIA B200
dtype=bf16, UE8M0=False, block=(128x128)
         Shape | Baseline (ms) | Triton (ms) | Speedup | Y equal | Y maxdiff | S equal | S maxdiff
--------------------------------------------------------------------------------------
       128x128 |        0.064 |       0.028 |    2.26 |    True |  0.00e+00 |    True |  0.00e+00
     1024x1024 |        0.064 |       0.027 |    2.35 |    True |  0.00e+00 |    True |  0.00e+00
     2048x4096 |        0.134 |       0.029 |    4.59 |    True |  0.00e+00 |    True |  0.00e+00
     4096x4096 |        0.229 |       0.037 |    6.13 |   False |  1.95e-03 |    True |  0.00e+00
     4096x8192 |        0.439 |       0.068 |    6.51 |   False |  1.95e-03 |    True |  0.00e+00
     8192x4096 |        0.439 |       0.068 |    6.44 |   False |  1.95e-03 |    True |  0.00e+00
     3000x4097 |        0.231 |       0.032 |    7.33 |    True |  0.00e+00 |    True |  0.00e+00
     7168x7168 |        0.654 |       0.098 |    6.64 |   False |  1.95e-03 |    True |  0.00e+00
   16384x32768 |        6.269 |       0.969 |    6.47 |   False |  1.95e-03 |    True |  0.00e+00

Accuracy

VLLM_USE_DEEP_GEMM=1 lm_eval --model vllm --model_args "pretrained=Qwen/Qwen3-30B-A3B-FP8,max_model_len=32768,enforce_eager=True" --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto

# now
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8529|±  |0.0098|
|     |       |strict-match    |     5|exact_match|↑  |0.8855|±  |0.0088|
# main
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8522|±  |0.0098|
|     |       |strict-match    |     5|exact_match|↑  |0.8863|±  |0.0087|

Signed-off-by: yewentao256 <zhyanwentao@126.com>

gemini-code-assist

Code Review

This pull request introduces a high-performance Triton kernel for per_block_cast_to_fp8, delivering impressive speedups as shown in the benchmarks. The implementation is solid and the inclusion of a benchmark script is appreciated. I've identified a critical bug in the benchmark script that causes incorrect reporting and a minor inefficiency in the Triton kernel itself. Addressing these will improve the quality and correctness of this contribution.

benchmarks/kernels/benchmark_per_block_cast_to_fp8.py

vllm/utils/deep_gemm.py

Signed-off-by: yewentao256 <zhyanwentao@126.com>

ProExpertProg · 2025-09-10T23:06:02Z

Is this comparing to torch compiled kernel or just eager torch?

yewentao256 · 2025-09-11T00:55:45Z

Is this comparing to torch compiled kernel or just eager torch?

Compared to the eager torch. Details at per_block_cast_to_fp8_baseline

mgoin · 2025-09-11T01:05:59Z

What is the result of making this function faster? It seems to only be used in tests or in requant_weight_ue8m0_inplace, which is called AOT during process_weights_after_loading. Not during inference.

vllm/vllm/model_executor/layers/quantization/utils/fp8_utils.py

Lines 789 to 791 in cc99baf

    
           # Re-quantise using power-of-two scaling (UE8M0). 
        
           w_requant, s_requant = per_block_cast_to_fp8(w_dq, [block_m, block_k], 
        
                                                        use_ue8m0=True)

Search: https://github.com/search?q=repo%3Avllm-project%2Fvllm%20per_block_cast_to_fp8&type=code

If there isn't a measurable difference, then I'd prefer to stick to torch and at most torch.compile it

vllm/utils/deep_gemm.py

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 · 2025-09-11T15:08:24Z

@mgoin The main benefits of doing this is decreasing the model loading time

vllm serve --model="deepseek-ai/DeepSeek-R1" --max-num-seqs 512 --data-parallel-size 8 --enable-expert-parallel --gpu-memory-utilization 0.9 --port 9256 --disable-log-requests --no-enable-prefix-caching --enforce_eager

This branch

(EngineCore_DP5 pid=4102557) INFO 09-11 11:46:47 [gpu_model_runner.py:2286] Model loading took 95.9683 GiB and 28.218114 seconds
(EngineCore_DP2 pid=4102554) INFO 09-11 11:46:47 [gpu_model_runner.py:2286] Model loading took 95.9683 GiB and 29.514149 seconds
(EngineCore_DP6 pid=4102558) INFO 09-11 11:46:47 [gpu_model_runner.py:2286] Model loading took 95.9683 GiB and 28.131574 seconds
(EngineCore_DP4 pid=4102556) INFO 09-11 11:46:47 [gpu_model_runner.py:2286] Model loading took 95.9683 GiB and 28.061149 seconds
(EngineCore_DP7 pid=4102559) INFO 09-11 11:46:47 [gpu_model_runner.py:2286] Model loading took 95.9683 GiB and 28.636173 seconds
(EngineCore_DP1 pid=4102553) INFO 09-11 11:46:47 [gpu_model_runner.py:2286] Model loading took 95.9683 GiB and 28.506170 seconds
(EngineCore_DP3 pid=4102555) INFO 09-11 11:46:47 [gpu_model_runner.py:2286] Model loading took 95.9683 GiB and 29.540192 seconds
(EngineCore_DP0 pid=4102552) INFO 09-11 11:46:47 [gpu_model_runner.py:2286] Model loading took 95.9683 GiB and 28.072259 seconds

Main

(EngineCore_DP1 pid=3243946) INFO 09-11 07:48:19 [gpu_model_runner.py:2251] Model loading took 95.9683 GiB and 27.890547 seconds
(EngineCore_DP2 pid=3243947) INFO 09-11 07:48:20 [gpu_model_runner.py:2251] Model loading took 95.9683 GiB and 28.551232 seconds
(EngineCore_DP4 pid=3243949) INFO 09-11 07:48:20 [gpu_model_runner.py:2251] Model loading took 95.9683 GiB and 29.112749 seconds
(EngineCore_DP7 pid=3243952) INFO 09-11 07:48:20 [gpu_model_runner.py:2251] Model loading took 95.9683 GiB and 29.540662 seconds
(EngineCore_DP3 pid=3243948) INFO 09-11 07:48:21 [gpu_model_runner.py:2251] Model loading took 95.9683 GiB and 29.962358 seconds
(EngineCore_DP0 pid=3243945) INFO 09-11 07:48:21 [gpu_model_runner.py:2251] Model loading took 95.9683 GiB and 29.903493 seconds
(EngineCore_DP5 pid=3243950) INFO 09-11 07:48:21 [gpu_model_runner.py:2251] Model loading took 95.9683 GiB and 30.417099 seconds
(EngineCore_DP6 pid=3243951) INFO 09-11 07:48:22 [gpu_model_runner.py:2251] Model loading took 95.9683 GiB and 31.128308 seconds

yewentao256 · 2025-09-12T15:02:05Z

@ProExpertProg CC

mgoin

Looks like there is not a noticeable benefit on model loading time, so my feeling is still to stick to torch.
We have too much complexity in vLLM, so we have to take simplicity and portable logic when we can get it. Thank you for the work, but we should focus on more impactful problems

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256

@mgoin OK, let's keep it simple, I removed all of the triton code and simply add @torch.compile

vllm/utils/deep_gemm.py

Signed-off-by: yewentao256 <zhyanwentao@126.com>

…24611) Signed-off-by: yewentao256 <zhyanwentao@126.com>

…24611) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: yewentao256 <zhyanwentao@126.com>

…24611) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: gaojc <1055866782@qq.com>

…24611) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…24611) Signed-off-by: yewentao256 <zhyanwentao@126.com>

…24611) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

triton-kernel-for-per-block-cast-to-fp8

32f5943

Signed-off-by: yewentao256 <zhyanwentao@126.com>

mergify bot added the performance Performance-related issues label Sep 10, 2025

gemini-code-assist bot reviewed Sep 10, 2025

View reviewed changes

benchmarks/kernels/benchmark_per_block_cast_to_fp8.py Outdated Show resolved Hide resolved

benchmarks/kernels/benchmark_per_block_cast_to_fp8.py Outdated Show resolved Hide resolved

vllm/utils/deep_gemm.py Outdated Show resolved Hide resolved

yewentao256 added 2 commits September 10, 2025 15:37

fix return type error

84634e5

Signed-off-by: yewentao256 <zhyanwentao@126.com>

fix as comments suggests

36980a3

Signed-off-by: yewentao256 <zhyanwentao@126.com>

vadimkantorov reviewed Sep 11, 2025

View reviewed changes

vllm/utils/deep_gemm.py Show resolved Hide resolved

yewentao256 added 2 commits September 11, 2025 07:43

use cuda event to record time

7a8bdcc

Signed-off-by: yewentao256 <zhyanwentao@126.com>

increase warps to 8

c46efa7

Signed-off-by: yewentao256 <zhyanwentao@126.com>

Merge branch 'main' into wye-triton-kernel-for-per-block-cast-to-fp8

33a488c

mgoin requested changes Sep 17, 2025

View reviewed changes

yewentao256 added 3 commits September 18, 2025 07:59

Merge branch 'main' into wye-triton-kernel-for-per-block-cast-to-fp8

e973e82

use torch compile

ee238e0

Signed-off-by: yewentao256 <zhyanwentao@126.com>

remove benchmark

a02eb58

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 commented Sep 18, 2025

View reviewed changes

yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 18, 2025

mgoin reviewed Sep 18, 2025

View reviewed changes

vllm/utils/deep_gemm.py Outdated Show resolved Hide resolved

mgoin changed the title ~~[Feature] Triton Kernel for per_block_cast_to_fp8, 6x faster~~ [Perf] Apply torch.compile for per_block_cast_to_fp8 Sep 18, 2025

yewentao256 and others added 2 commits September 18, 2025 11:23

update according to comemnts

e2258db

Signed-off-by: yewentao256 <zhyanwentao@126.com>

Merge branch 'main' into wye-triton-kernel-for-per-block-cast-to-fp8

949b60e

mgoin approved these changes Sep 23, 2025

View reviewed changes

mgoin merged commit 9949aa2 into vllm-project:main Sep 23, 2025
40 checks passed

yewentao256 deleted the wye-triton-kernel-for-per-block-cast-to-fp8 branch September 23, 2025 14:41

yewentao256 mentioned this pull request Sep 24, 2025

[Feature]: Optimize per_block_cast_to_fp8 #24607

Closed

1 task

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Perf] Apply torch.compile for per_block_cast_to_fp8 (vllm-project#…

006cc7b

…24611) Signed-off-by: yewentao256 <zhyanwentao@126.com>

charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025

[Perf] Apply torch.compile for per_block_cast_to_fp8 (vllm-project#…

fd9423a

…24611) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: charlifu <charlifu@amd.com>

yewentao256 added a commit that referenced this pull request Oct 3, 2025

[Perf] Apply torch.compile for per_block_cast_to_fp8 (#24611)

e6c22d2

Signed-off-by: yewentao256 <zhyanwentao@126.com>

gjc0824 pushed a commit to gjc0824/vllm that referenced this pull request Oct 10, 2025

[Perf] Apply torch.compile for per_block_cast_to_fp8 (vllm-project#…

9f53489

…24611) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: gaojc <1055866782@qq.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[Perf] Apply torch.compile for per_block_cast_to_fp8 (vllm-project#…

eff0b80

…24611) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[Perf] Apply torch.compile for per_block_cast_to_fp8 (vllm-project#…

ef7731c

…24611) Signed-off-by: yewentao256 <zhyanwentao@126.com>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Perf] Apply torch.compile for per_block_cast_to_fp8 (vllm-project#…

cdf5dbb

…24611) Signed-off-by: yewentao256 <zhyanwentao@126.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[Perf] Apply torch.compile for per_block_cast_to_fp8 (vllm-project#…

bff632a

…24611) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Perf] Apply torch.compile for `per_block_cast_to_fp8` #24611

[Perf] Apply torch.compile for `per_block_cast_to_fp8` #24611

Uh oh!

yewentao256 commented Sep 10, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ProExpertProg commented Sep 10, 2025

Uh oh!

yewentao256 commented Sep 11, 2025

Uh oh!

mgoin commented Sep 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

yewentao256 commented Sep 11, 2025 •

edited

Loading

Uh oh!

yewentao256 commented Sep 12, 2025

Uh oh!

mgoin left a comment

Uh oh!

yewentao256 left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[Perf] Apply torch.compile for per_block_cast_to_fp8 #24611

[Perf] Apply torch.compile for per_block_cast_to_fp8 #24611

Uh oh!

Conversation

yewentao256 commented Sep 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Unit test & benchmark

Accuracy

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ProExpertProg commented Sep 10, 2025

Uh oh!

yewentao256 commented Sep 11, 2025

Uh oh!

mgoin commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

yewentao256 commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yewentao256 commented Sep 12, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Perf] Apply torch.compile for `per_block_cast_to_fp8` #24611

[Perf] Apply torch.compile for `per_block_cast_to_fp8` #24611

yewentao256 commented Sep 10, 2025 •

edited by github-actions bot

Loading

mgoin commented Sep 11, 2025 •

edited

Loading

yewentao256 commented Sep 11, 2025 •

edited

Loading