Skip to content

Conversation

@yewentao256
Copy link
Member

@yewentao256 yewentao256 commented Jul 28, 2025

Purpose

  • Remove Duplicate per_block_cast_to_fp8
  • Remove Dependencies of DeepGEMM
  • Fix the bug with current benchmark script
    • We can also optimize the gemm through the benchmark result
  • This is the first step for optimizing weights quantization using triton/cuda kernel

Test

On B200 with DeepGEMM installed.

Origin:

Warning: please use at least NVCC 12.9 for the best DeepGEMM performanceTraceback (most recent call last):
  File "/data/vllm-community-homes/vllm-user-6/vllm/benchmarks/kernels/deepgemm/benchmark_fp8_block_dense_gemm.py", line 10, in <module>
    from deep_gemm import calc_diff, ceil_div, get_col_major_tma_aligned_tensor
ImportError: cannot import name 'calc_diff' from 'deep_gemm' (/data/vllm-community-homes/vllm-user-6/.venv/lib/python3.12/site-packages/deep_gemm/__init__.py)

Now:

~/vllm/benchmarks/kernels/deepgemm$ python benchmark_fp8_block_dense_gemm.py
===== PERFORMANCE COMPARISON =====

DeepGEMM Implementation:
+------+-------+-------+-----------+--------+--------+
| m    | n     | k     | Time (μs) | TFLOPS | GB/s   |
+------+-------+-------+-----------+--------+--------+
|   64 | 24576 |  1536 | 37.5      | 128.9  | 1093.6 |
|   64 | 32768 |   512 | 37.2      | 57.7   | 564.1  |
|   64 |  7168 | 16384 | 58.6      | 256.5  | 2037.7 |
|   64 |  4096 |  7168 | 37.5      | 100.2  | 808.7  |
|   64 |  7168 |  2048 | 37.7      | 49.8   | 417.0  |
|  128 | 24576 |  1536 | 37.7      | 256.4  | 1173.9 |
|  128 | 32768 |   512 | 37.6      | 114.3  | 671.3  |
|  128 |  7168 | 16384 | 59.4      | 506.3  | 2044.0 |
|  128 |  4096 |  7168 | 37.6      | 199.8  | 832.7  |
|  128 |  7168 |  2048 | 37.8      | 99.4   | 443.7  |
| 4096 | 24576 |  1536 | 188.7     | 1638.4 | 1300.0 |
| 4096 | 32768 |   512 | 144.2     | 953.3  | 1992.8 |
| 4096 |  7168 | 16384 | 508.5     | 1891.9 | 478.4  |
| 4096 |  4096 |  7168 | 135.3     | 1778.2 | 682.2  |
| 4096 |  7168 |  2048 | 74.9      | 1606.3 | 1092.5 |
+------+-------+-------+-----------+--------+--------+

vLLM Triton Implementation:
+------+-------+-------+-----------+--------+--------+--------------+
| m    | n     | k     | Time (μs) | TFLOPS | GB/s   | vs DeepGEMM  |
+------+-------+-------+-----------+--------+--------+--------------+
|   64 | 24576 |  1536 | 30.3      | 159.3  | 1351.2 | 1.24x faster |
|   64 | 32768 |   512 | 33.1      | 64.9   | 635.1  | 1.13x faster |
|   64 |  7168 | 16384 | 78.7      | 190.9  | 1516.5 | 0.74x slower |
|   64 |  4096 |  7168 | 39.9      | 94.1   | 759.9  | 0.94x slower |
|   64 |  7168 |  2048 | 32.9      | 57.1   | 477.8  | 1.15x faster |
|  128 | 24576 |  1536 | 29.8      | 323.8  | 1482.4 | 1.26x faster |
|  128 | 32768 |   512 | 32.7      | 131.2  | 770.5  | 1.15x faster |
|  128 |  7168 | 16384 | 110.6     | 271.8  | 1097.1 | 0.54x slower |
|  128 |  4096 |  7168 | 39.8      | 189.0  | 787.6  | 0.95x slower |
|  128 |  7168 |  2048 | 33.1      | 113.4  | 506.1  | 1.14x faster |
| 4096 | 24576 |  1536 | 588.9     | 525.1  | 416.7  | 0.32x slower |
| 4096 | 32768 |   512 | 308.4     | 445.6  | 931.5  | 0.47x slower |
| 4096 |  7168 | 16384 | 1335.5    | 720.4  | 182.2  | 0.38x slower |
| 4096 |  4096 |  7168 | 421.2     | 571.1  | 219.1  | 0.32x slower |
| 4096 |  7168 |  2048 | 189.1     | 636.1  | 432.6  | 0.40x slower |
+------+-------+-------+-----------+--------+--------+--------------+

vLLM CUTLASS Implementation:
+------+-------+-------+-----------+--------+--------+--------------+--------------+
| m    | n     | k     | Time (μs) | TFLOPS | GB/s   | vs DeepGEMM  | vs Triton    |
+------+-------+-------+-----------+--------+--------+--------------+--------------+
|   64 | 24576 |  1536 | 19.0      | 254.6  | 2160.3 | 1.98x faster | 1.60x faster |
|   64 | 32768 |   512 | 14.3      | 149.8  | 1465.1 | 2.60x faster | 2.31x faster |
|   64 |  7168 | 16384 | 37.8      | 398.0  | 3161.3 | 1.55x faster | 2.08x faster |
|   64 |  4096 |  7168 | 18.4      | 203.8  | 1645.3 | 2.03x faster | 2.17x faster |
|   64 |  7168 |  2048 | 10.8      | 174.8  | 1462.9 | 3.51x faster | 3.06x faster |
|  128 | 24576 |  1536 | 18.4      | 524.5  | 2400.9 | 2.05x faster | 1.62x faster |
|  128 | 32768 |   512 | 14.3      | 299.5  | 1759.5 | 2.62x faster | 2.28x faster |
|  128 |  7168 | 16384 | 44.5      | 675.0  | 2725.0 | 1.33x faster | 2.48x faster |
|  128 |  4096 |  7168 | 18.4      | 407.5  | 1698.5 | 2.04x faster | 2.16x faster |
|  128 |  7168 |  2048 | 10.8      | 349.0  | 1558.0 | 3.51x faster | 3.08x faster |
| 4096 | 24576 |  1536 | 270.1     | 1144.8 | 908.4  | 0.70x slower | 2.18x faster |
| 4096 | 32768 |   512 | 150.8     | 911.7  | 1905.9 | 0.96x slower | 2.05x faster |
| 4096 |  7168 | 16384 | 724.9     | 1327.3 | 335.6  | 0.70x slower | 1.84x faster |
| 4096 |  4096 |  7168 | 162.7     | 1478.5 | 567.2  | 0.83x slower | 2.59x faster |
| 4096 |  7168 |  2048 | 92.9      | 1293.9 | 880.0  | 0.81x slower | 2.03x faster |
+------+-------+-------+-----------+--------+--------+--------------+--------------+

===== AVERAGE PERFORMANCE =====
+----------------+------------+----------+---------------+
| Implementation | Avg TFLOPS | Avg GB/s | Avg Time (ms) |
+----------------+------------+----------+---------------+
| DeepGEMM       | 642.50     | 1042.18  | 0.10          |
| vLLM Triton    | 299.58     | 771.09   | 0.22          |
| vLLM CUTLASS   | 639.51     | 1642.26  | 0.11          |
+----------------+------------+----------+---------------+

===== AVERAGE SPEEDUPS =====
+-----------------------------+--------------+
| Comparison                  | Speedup      |
+-----------------------------+--------------+
| DeepGEMM vs vLLM Triton     | 1.60x faster |
| DeepGEMM vs vLLM CUTLASS    | 0.74x slower |
| vLLM CUTLASS vs vLLM Triton | 2.24x faster |
+-----------------------------+--------------+

===== ACCURACY COMPARISON =====
+----------------+-----------------------+
| Implementation | Avg Diff vs Reference |
+----------------+-----------------------+
| DeepGEMM       | 0.000714              |
| vLLM Triton    | 0.000714              |
| vLLM CUTLASS   | 0.000714              |
+----------------+-----------------------+

Signed-off-by: yewentao256 <zhyanwentao@126.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the performance Performance-related issues label Jul 28, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the code by centralizing the per_block_cast_to_fp8 utility and removing its dependencies on the deep_gemm library, which improves maintainability. The changes are consistent across the codebase. There is one issue in test_deepgemm.py where the block_size was not being passed to the new utility function, which could lead to incorrect test behavior. A suggestion has been provided to fix it.

@mgoin mgoin enabled auto-merge (squash) July 31, 2025 02:10
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 31, 2025
@mgoin mgoin merged commit 3700642 into vllm-project:main Aug 1, 2025
44 checks passed
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
…ies of DeepGEMM (vllm-project#21787)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
…ies of DeepGEMM (vllm-project#21787)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
noamgat pushed a commit to noamgat/vllm that referenced this pull request Aug 9, 2025
…ies of DeepGEMM (vllm-project#21787)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Noam Gat <noamgat@gmail.com>
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
…ies of DeepGEMM (vllm-project#21787)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Paul Pak <paulpak58@gmail.com>
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
…ies of DeepGEMM (vllm-project#21787)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
…ies of DeepGEMM (vllm-project#21787)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
…ies of DeepGEMM (vllm-project#21787)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants