[Refactor] Remove Duplicate `per_block_cast_to_fp8`, Remove Dependencies of DeepGEMM #21787

yewentao256 · 2025-07-28T21:21:42Z

Purpose

Remove Duplicate per_block_cast_to_fp8
Remove Dependencies of DeepGEMM
Fix the bug with current benchmark script
- We can also optimize the gemm through the benchmark result
This is the first step for optimizing weights quantization using triton/cuda kernel

Test

On B200 with DeepGEMM installed.

Origin:

Warning: please use at least NVCC 12.9 for the best DeepGEMM performanceTraceback (most recent call last):
  File "/data/vllm-community-homes/vllm-user-6/vllm/benchmarks/kernels/deepgemm/benchmark_fp8_block_dense_gemm.py", line 10, in <module>
    from deep_gemm import calc_diff, ceil_div, get_col_major_tma_aligned_tensor
ImportError: cannot import name 'calc_diff' from 'deep_gemm' (/data/vllm-community-homes/vllm-user-6/.venv/lib/python3.12/site-packages/deep_gemm/__init__.py)

Now:

~/vllm/benchmarks/kernels/deepgemm$ python benchmark_fp8_block_dense_gemm.py
===== PERFORMANCE COMPARISON =====

DeepGEMM Implementation:
+------+-------+-------+-----------+--------+--------+
| m    | n     | k     | Time (μs) | TFLOPS | GB/s   |
+------+-------+-------+-----------+--------+--------+
|   64 | 24576 |  1536 | 37.5      | 128.9  | 1093.6 |
|   64 | 32768 |   512 | 37.2      | 57.7   | 564.1  |
|   64 |  7168 | 16384 | 58.6      | 256.5  | 2037.7 |
|   64 |  4096 |  7168 | 37.5      | 100.2  | 808.7  |
|   64 |  7168 |  2048 | 37.7      | 49.8   | 417.0  |
|  128 | 24576 |  1536 | 37.7      | 256.4  | 1173.9 |
|  128 | 32768 |   512 | 37.6      | 114.3  | 671.3  |
|  128 |  7168 | 16384 | 59.4      | 506.3  | 2044.0 |
|  128 |  4096 |  7168 | 37.6      | 199.8  | 832.7  |
|  128 |  7168 |  2048 | 37.8      | 99.4   | 443.7  |
| 4096 | 24576 |  1536 | 188.7     | 1638.4 | 1300.0 |
| 4096 | 32768 |   512 | 144.2     | 953.3  | 1992.8 |
| 4096 |  7168 | 16384 | 508.5     | 1891.9 | 478.4  |
| 4096 |  4096 |  7168 | 135.3     | 1778.2 | 682.2  |
| 4096 |  7168 |  2048 | 74.9      | 1606.3 | 1092.5 |
+------+-------+-------+-----------+--------+--------+

vLLM Triton Implementation:
+------+-------+-------+-----------+--------+--------+--------------+
| m    | n     | k     | Time (μs) | TFLOPS | GB/s   | vs DeepGEMM  |
+------+-------+-------+-----------+--------+--------+--------------+
|   64 | 24576 |  1536 | 30.3      | 159.3  | 1351.2 | 1.24x faster |
|   64 | 32768 |   512 | 33.1      | 64.9   | 635.1  | 1.13x faster |
|   64 |  7168 | 16384 | 78.7      | 190.9  | 1516.5 | 0.74x slower |
|   64 |  4096 |  7168 | 39.9      | 94.1   | 759.9  | 0.94x slower |
|   64 |  7168 |  2048 | 32.9      | 57.1   | 477.8  | 1.15x faster |
|  128 | 24576 |  1536 | 29.8      | 323.8  | 1482.4 | 1.26x faster |
|  128 | 32768 |   512 | 32.7      | 131.2  | 770.5  | 1.15x faster |
|  128 |  7168 | 16384 | 110.6     | 271.8  | 1097.1 | 0.54x slower |
|  128 |  4096 |  7168 | 39.8      | 189.0  | 787.6  | 0.95x slower |
|  128 |  7168 |  2048 | 33.1      | 113.4  | 506.1  | 1.14x faster |
| 4096 | 24576 |  1536 | 588.9     | 525.1  | 416.7  | 0.32x slower |
| 4096 | 32768 |   512 | 308.4     | 445.6  | 931.5  | 0.47x slower |
| 4096 |  7168 | 16384 | 1335.5    | 720.4  | 182.2  | 0.38x slower |
| 4096 |  4096 |  7168 | 421.2     | 571.1  | 219.1  | 0.32x slower |
| 4096 |  7168 |  2048 | 189.1     | 636.1  | 432.6  | 0.40x slower |
+------+-------+-------+-----------+--------+--------+--------------+

vLLM CUTLASS Implementation:
+------+-------+-------+-----------+--------+--------+--------------+--------------+
| m    | n     | k     | Time (μs) | TFLOPS | GB/s   | vs DeepGEMM  | vs Triton    |
+------+-------+-------+-----------+--------+--------+--------------+--------------+
|   64 | 24576 |  1536 | 19.0      | 254.6  | 2160.3 | 1.98x faster | 1.60x faster |
|   64 | 32768 |   512 | 14.3      | 149.8  | 1465.1 | 2.60x faster | 2.31x faster |
|   64 |  7168 | 16384 | 37.8      | 398.0  | 3161.3 | 1.55x faster | 2.08x faster |
|   64 |  4096 |  7168 | 18.4      | 203.8  | 1645.3 | 2.03x faster | 2.17x faster |
|   64 |  7168 |  2048 | 10.8      | 174.8  | 1462.9 | 3.51x faster | 3.06x faster |
|  128 | 24576 |  1536 | 18.4      | 524.5  | 2400.9 | 2.05x faster | 1.62x faster |
|  128 | 32768 |   512 | 14.3      | 299.5  | 1759.5 | 2.62x faster | 2.28x faster |
|  128 |  7168 | 16384 | 44.5      | 675.0  | 2725.0 | 1.33x faster | 2.48x faster |
|  128 |  4096 |  7168 | 18.4      | 407.5  | 1698.5 | 2.04x faster | 2.16x faster |
|  128 |  7168 |  2048 | 10.8      | 349.0  | 1558.0 | 3.51x faster | 3.08x faster |
| 4096 | 24576 |  1536 | 270.1     | 1144.8 | 908.4  | 0.70x slower | 2.18x faster |
| 4096 | 32768 |   512 | 150.8     | 911.7  | 1905.9 | 0.96x slower | 2.05x faster |
| 4096 |  7168 | 16384 | 724.9     | 1327.3 | 335.6  | 0.70x slower | 1.84x faster |
| 4096 |  4096 |  7168 | 162.7     | 1478.5 | 567.2  | 0.83x slower | 2.59x faster |
| 4096 |  7168 |  2048 | 92.9      | 1293.9 | 880.0  | 0.81x slower | 2.03x faster |
+------+-------+-------+-----------+--------+--------+--------------+--------------+

===== AVERAGE PERFORMANCE =====
+----------------+------------+----------+---------------+
| Implementation | Avg TFLOPS | Avg GB/s | Avg Time (ms) |
+----------------+------------+----------+---------------+
| DeepGEMM       | 642.50     | 1042.18  | 0.10          |
| vLLM Triton    | 299.58     | 771.09   | 0.22          |
| vLLM CUTLASS   | 639.51     | 1642.26  | 0.11          |
+----------------+------------+----------+---------------+

===== AVERAGE SPEEDUPS =====
+-----------------------------+--------------+
| Comparison                  | Speedup      |
+-----------------------------+--------------+
| DeepGEMM vs vLLM Triton     | 1.60x faster |
| DeepGEMM vs vLLM CUTLASS    | 0.74x slower |
| vLLM CUTLASS vs vLLM Triton | 2.24x faster |
+-----------------------------+--------------+

===== ACCURACY COMPARISON =====
+----------------+-----------------------+
| Implementation | Avg Diff vs Reference |
+----------------+-----------------------+
| DeepGEMM       | 0.000714              |
| vLLM Triton    | 0.000714              |
| vLLM CUTLASS   | 0.000714              |
+----------------+-----------------------+

Signed-off-by: yewentao256 <zhyanwentao@126.com>

github-actions · 2025-07-28T21:21:53Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request refactors the code by centralizing the per_block_cast_to_fp8 utility and removing its dependencies on the deep_gemm library, which improves maintainability. The changes are consistent across the codebase. There is one issue in test_deepgemm.py where the block_size was not being passed to the new utility function, which could lead to incorrect test behavior. A suggestion has been provided to fix it.

tests/kernels/moe/test_deepgemm.py

Signed-off-by: yewentao256 <zhyanwentao@126.com>

…ies of DeepGEMM (vllm-project#21787) Signed-off-by: yewentao256 <zhyanwentao@126.com>

…ies of DeepGEMM (vllm-project#21787) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

…ies of DeepGEMM (vllm-project#21787) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Noam Gat <noamgat@gmail.com>

…ies of DeepGEMM (vllm-project#21787) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

…ies of DeepGEMM (vllm-project#21787) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

…ies of DeepGEMM (vllm-project#21787) Signed-off-by: yewentao256 <zhyanwentao@126.com>

refactor per block cast to fp8

7f1a55a

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 requested review from WoosukKwon and tlrmchlsmth as code owners July 28, 2025 21:21

mergify bot added the performance Performance-related issues label Jul 28, 2025

gemini-code-assist bot reviewed Jul 28, 2025

View reviewed changes

tests/kernels/moe/test_deepgemm.py Outdated Show resolved Hide resolved

yewentao256 and others added 2 commits July 28, 2025 14:26

fix through gemini

2a813db

Signed-off-by: yewentao256 <zhyanwentao@126.com>

Merge branch 'vllm-project:main' into wye-refactor-per-block-cast-to-fp8

9d95d09

mgoin approved these changes Jul 31, 2025

View reviewed changes

mgoin enabled auto-merge (squash) July 31, 2025 02:10

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 31, 2025

yewentao256 mentioned this pull request Jul 31, 2025

[Feature][Kernel] Blocked FP8 CUTLASS MoE for Hopper #19983

Open

Merge branch 'vllm-project:main' into wye-refactor-per-block-cast-to-fp8

e0b3a9d

mgoin merged commit 3700642 into vllm-project:main Aug 1, 2025
44 checks passed

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[Refactor] Remove Duplicate per_block_cast_to_fp8, Remove Dependenc…

fc0ced4

…ies of DeepGEMM (vllm-project#21787) Signed-off-by: yewentao256 <zhyanwentao@126.com>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[Refactor] Remove Duplicate per_block_cast_to_fp8, Remove Dependenc…

b7dcc19

…ies of DeepGEMM (vllm-project#21787) Signed-off-by: yewentao256 <zhyanwentao@126.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[Refactor] Remove Duplicate per_block_cast_to_fp8, Remove Dependenc…

fac550d

…ies of DeepGEMM (vllm-project#21787) Signed-off-by: yewentao256 <zhyanwentao@126.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Refactor] Remove Duplicate `per_block_cast_to_fp8`, Remove Dependencies of DeepGEMM #21787

[Refactor] Remove Duplicate `per_block_cast_to_fp8`, Remove Dependencies of DeepGEMM #21787

Uh oh!

yewentao256 commented Jul 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[Refactor] Remove Duplicate per_block_cast_to_fp8, Remove Dependencies of DeepGEMM #21787

[Refactor] Remove Duplicate per_block_cast_to_fp8, Remove Dependencies of DeepGEMM #21787

Uh oh!

Conversation

yewentao256 commented Jul 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test

Uh oh!

github-actions bot commented Jul 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Refactor] Remove Duplicate `per_block_cast_to_fp8`, Remove Dependencies of DeepGEMM #21787

[Refactor] Remove Duplicate `per_block_cast_to_fp8`, Remove Dependencies of DeepGEMM #21787

yewentao256 commented Jul 28, 2025 •

edited by github-actions bot

Loading