Skip to content

Conversation

@CalebDu
Copy link
Contributor

@CalebDu CalebDu commented May 10, 2025

[1] refactor permute/unpermute kernel

  • remove token_expert_indices input dependence and output align to triton kernel
  • refine parameter and remove unused parameter

[2] integrate permute/unpermute kernel into deepgemm moe
[3] fix clamp causes wrong argsort result, cc @bnellnm

sorted_token_ids = sorted_token_ids.clamp(max=num_tokens - 1)
expert_ids = torch.repeat_interleave(expert_ids, block_m, dim=0)
inv_perm = torch.argsort(sorted_token_ids)[:num_tokens]

I will attach performance data in H100 later.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@CalebDu
Copy link
Contributor Author

CalebDu commented May 10, 2025

benchmark setup

python3 benchmarks/kernels/benchmark_moe.py --dtype=fp8_w8a8 -tp ${TP} --model deepseek-ai/DeepSeek-V3 --trust-remote-code --use-deep-gemm

How to switch python permute/unpermute implementation
https://github.com/CalebDu/vllm/blob/78b24807ae473ddf395a88df47de78033c16f515/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py#L289-L296
https://github.com/CalebDu/vllm/blob/78b24807ae473ddf395a88df47de78033c16f515/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py#L326C1-L331C59
comment out above customized permute/unpermute kernel

Performance in H100

Batch DeepGemm (python permute)TP=1 DeepGemm (customized permute)TP=1 DeepGemm (python permute)TP=2 DeepGemm (customized permute)TP=2 DeepGemm (python permute)TP=4 DeepGemm (customized permute)TP=4 DeepGemm (python permute)TP=8 DeepGemm (customized permute)TP=8
128 6415.07 us 5506.54 us 3815.51 us 2943.05 us 2609.36 us 1727.12 us 1921.54 us 1067.80 us
256 6699.19 us 5631.23 us 4021.14 us 3088.93 us 2781.02 us 1868.46 us 2083.59 us 1149.40 us
512 6936.77 us 5673.94 us 4293.68 us 3279.71 us 3066.71 us 2037.61 us 2278.12 us 1259.98 us
1024 7786.69 us 5745.72 us 4848.20 us 3673.02 us 3515.67 us 2301.33 us 2720.41 us 1543.88 us
1536 8544.97 us 5967.12 us 5383.77 us 4015.44 us 4021.17 us 2643.02 us 3203.08 us 1811.40 us
2048 9283.23 us 6090.98 us 5905.14 us 4439.83 us 4515.79 us 2926.30 us 3657.95 us 2093.40 us
3072 10828.49 us 6454.82 us 7159.85 us 5183.76 us 5516.52 us 3554.23 us 4625.18 us 2639.90 us
4096 12025.13 us 7546.53 us 8233.51 us 5857.30 us 6508.91 us 4093.26 us 5654.33 us 3213.34 us

re-bench with triton kernel (unit us)

Batch DeepGemm (python permute)TP=1 DeepGemm (customized permute)TP=1 triton 3.1 TP=1 triton 3.3 TP=1 DeepGemm (python permute)TP=2 DeepGemm (customized permute)TP=2 triton 3.1 TP=2 triton 3.3 TP=2 DeepGemm (python permute)TP=4 DeepGemm (customized permute)TP=4 triton 3.1 TP=4 triton 3.3 TP=4 DeepGemm (python permute)TP=8 DeepGemm (customized permute)TP=8 triton 3.1 TP=8 triton 3.3 TP=8
1 3602.511845 2806.304037 509.6886411 212.2419837 2348.581474 1554.745921 266.5411837 129.3510083 1705.783838 889.8526421 153.9972478 94.89859194 1418.308803 592.1240635 109.9713603 81.0199999
2 3743.248371 2914.416389 890.0548458 397.8605115 2414.282902 1583.914309 479.982368 219.3356488 1740.465818 924.0305281 290.6471677 150.0224319 1424.533503 607.9918714 198.1224321 125.114976
4 3871.733936 3072.978024 1592.366722 692.1821451 2489.166077 1677.939661 849.5276814 388.4738874 1778.695871 964.1947212 485.1181769 214.6598412 1458.306659 635.9201288 292.2567046 152.41456
8 4138.583206 3303.94302 2832.352234 1233.090274 2590.923071 1774.462336 1471.922564 632.3694725 1848.656734 1034.374914 794.5269709 377.4500155 1487.092036 676.3227525 473.5459514 212.51165
16 4597.348591 3818.039204 4888.197403 2121.795744 2852.478691 2053.513956 2478.800838 1073.305606 2031.305277 1208.633279 1377.914532 579.9425592 1560.760032 742.2308807 740.6688633 380.269857
24 4977.569126 4183.773739 6777.195034 2817.677444 3061.664288 2237.602074 3408.021435 1408.687361 2138.71962 1312.531584 1731.006407 755.8986874 1627.074625 797.8651834 937.5368023 423.110177
32 5230.934456 4425.215775 7806.68235 3352.562454 3224.481319 2360.478848 3976.369732 1666.52599 2220.697437 1387.366114 2065.560162 864.5977573 1672.673853 852.5864677 1168.152614 497.871904
48 5727.715935 4929.910007 9916.409409 4108.009949 3459.024086 2606.061342 4855.721756 2026.838238 2388.482313 1554.173407 2593.146175 1047.236092 1764.040802 930.4103041 1303.255808 605.584416
64 6016.515842 5198.458054 11005.09344 4563.136818 3587.003098 2736.59622 5392.553787 2253.668245 2472.654427 1630.939261 2783.693981 1169.77389 1827.59853 979.3085451 1477.136861 650.448993
96 6321.76685 5458.433441 12071.88416 5024.149139 3768.062248 2887.671818 5813.436424 2477.5888 2562.929989 1709.074177 2992.637955 1267.99488 1883.062052 1030.712481 1608.455294 713.206977
128 6385.871418 5585.22073 12426.95809 5208.960396 3773.556168 2953.472858 6208.605339 2583.885637 2611.552597 1744.339996 3119.137667 1325.251584 1932.514338 1058.438753 1588.297062 754.122849
256 6702.019875 5804.147552 12395.02485 5417.877174 3971.844341 3068.303297 6324.666134 2705.896151 2772.217916 1852.885178 3220.112226 1402.01593 2058.58242 1170.369566 1718.393171 807.857249
512 7051.680382 6053.466328 13117.78777 7816.970879 4293.289478 3310.572479 6644.185196 4066.989346 3075.320736 2037.641121 3421.452259 2237.669184 2267.978533 1265.267967 1811.261024 1341.55734
1024 7784.900734 6614.192299 13313.14198 8117.559082 4878.837082 3677.688358 6797.295891 4259.552326 3508.984188 2304.154964 3541.739574 2360.527557 2745.706972 1536.857345 1890.983395 1430.57545
1536 8418.245331 7043.992645 13784.26308 8481.814522 5431.642624 4012.459137 7018.565506 4471.622993 4007.832886 2609.715874 3674.470848 2482.022118 3180.825912 1822.045153 1989.906879 1526.61708
2048 9282.124054 7711.00795 19847.0266 11405.15988 6026.811966 4440.580986 10098.08009 5998.260666 4513.037003 2964.057407 5223.041172 3292.029198 3672.447021 2086.210842 2816.100096 1953.85206
3072 10832.41057 8821.013359 27409.59357 15059.13861 7139.339951 5179.818336 13843.49413 7970.718567 5516.910252 3547.923496 7180.629387 4363.311146 4633.855934 2637.93833 3860.537186 2620.38295
4096 12191.91438 9577.756935 33881.41974 18705.30284 8246.168228 5723.187714 17242.07616 9866.959175 6593.480827 4200.067154 8927.522995 5375.271305 5650.963947 3206.51663 4784.900856 3189.82803

Triton3.1->Triton3.3 rapidly speedups triton fused moe kernel. I think deepgemm moe is slow than triton when M is small because deepgemm group gemm does not support skip invalid block by m_indices=-1. So deepgemm group gemm must do gemm computation for all permuted_align_row_size, even if many blocks are all pad invalid data (no routed token).

permuted_row_size = (permuted_row_size + n_expert *
(align_block_size - 1) + align_block_size -
1) // align_block_size * align_block_size

@mergify
Copy link

mergify bot commented May 13, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @CalebDu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 13, 2025
@CalebDu CalebDu force-pushed the permute_integration branch from 78b2480 to 4dc1a00 Compare May 16, 2025 14:48
@mergify mergify bot removed the needs-rebase label May 16, 2025
@CalebDu CalebDu force-pushed the permute_integration branch from e20416d to 3f967f5 Compare May 16, 2025 15:08
@CalebDu
Copy link
Contributor Author

CalebDu commented May 20, 2025

re-bench after #15956 (unit us)

Batch Modularized DeepGemm (python permute)TP=1 Modularized DeepGemm (only customized permute)TP=1 No Modularized DeepGemm (customized permute/unpermute)TP=1 triton 3.1 TP=1 triton 3.3 TP=1 Modularized DeepGemm (python permute)TP=2 Modularized DeepGemm (only customized permute)TP=2 No Modularized DeepGemm (customized permute/unpermute)TP=2 triton 3.1 TP=2 triton 3.3 TP=2 Modularized DeepGemm (python permute)TP=4 Modularized DeepGemm (only customized permute)TP=4 No Modularized DeepGemm (customized permute/unpermute)TP=4 triton 3.1 TP=4 triton 3.3 TP=4 Modularized DeepGemm (python permute)TP=8 Modularized DeepGemm (only customized permute)TP=8 No Modularized DeepGemm (customized permute/unpermute)TP=8 triton 3.1 TP=8 triton 3.3 TP=8
128 6671.109512 5897.525105 5585.22073 12426.95809 5208.960396 4002.299603 3190.863388 2953.472858 6208.605339 2583.885637 2768.742664 1968.474529 1744.339996 3119.137667 1325.251584 2081.477089 1273.735426 1058.438753 1588.297062 754.122849
256 6955.67598 6151.763477 5804.147552 12395.02485 5417.877174 4172.148056 3404.222172 3068.303297 6324.666134 2705.896151 2959.185539 2150.495935 1852.885178 3220.112226 1402.01593 2245.785723 1451.807136 1170.369566 1718.393171 807.857249
512 7276.883553 6592.895737 6053.466328 13117.78777 7816.970879 4504.225529 3740.077301 3310.572479 6644.185196 4066.989346 3240.641705 2481.417816 2037.641121 3421.452259 2237.669184 2455.443848 1697.28368 1265.267967 1811.261024 1341.55734
1024 8083.115433 7422.999832 6614.192299 13313.14198 8117.559082 5129.564442 4454.771965 3677.688358 6797.295891 4259.552326 3717.302071 3038.868734 2304.154964 3541.739574 2360.527557 2952.14736 2259.740549 1536.857345 1890.983395 1430.57545
1536 8883.149696 8261.534912 7043.992645 13784.26308 8481.814522 5722.705036 5038.666527 4012.459137 7018.565506 4471.622993 4281.013062 3635.37709 2609.715874 3674.470848 2482.022118 3405.941376 2800.218332 1822.045153 1989.906879 1526.61708
2048 9670.223793 9100.05587 7711.00795 19847.0266 11405.15988 6226.204876 5658.101665 4440.580986 10098.08009 5998.260666 4760.465324 4202.461323 2964.057407 5223.041172 3292.029198 3877.262089 3298.064209 2086.210842 2816.100096 1953.85206
3072 11096.54414 10726.87976 8821.013359 27409.59357 15059.13861 7456.444565 6907.287178 5179.818336 13843.49413 7970.718567 5795.136898 5284.919239 3547.923496 7180.629387 4363.311146 4900.731537 4375.66518 2637.93833 3860.537186 2620.38295
4096 12252.84682 11754.60801 9577.756935 33881.41974 18705.30284 8382.622299 8035.390305 5723.187714 17242.07616 9866.959175 6811.048393 6384.404938 4200.067154 8927.522995 5375.271305 5897.022053 5522.352989 3206.51663 4784.900856 3189.82803

I only integrate customized permute kernel into modularized fused moe, because customized unpermute kernel needs extra parameter inv_perm. It will break modularized interface in Experts and MoEPrepareAndFinalizeNoEP. So I integrate permute kernel firstly and attach customized permute+unpermute performance data in above table.
@bnellnm @tlrmchlsmth

@CalebDu CalebDu force-pushed the permute_integration branch from 22a9f55 to 5b24c2c Compare May 20, 2025 13:25
@mergify
Copy link

mergify bot commented May 23, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @CalebDu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 23, 2025
Comment on lines 130 to 137
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can moe_permute get topk internally so that it doesn't need to be passed in as a separate argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, get topk from topk_ids.

Comment on lines +154 to +158
Copy link
Contributor

@bnellnm bnellnm May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the scale permutation also be done by the CUDA kernel?

Then it seems permuted_idx would not be required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scale permutation is with small cost. There is no huge peformance gain to fuse it into cuda kernel. So I want to just place it in python for flexibility. And remove permuted_idx from return value.

@bnellnm
Copy link
Contributor

bnellnm commented May 27, 2025

LGTM. Just had a few minor comments.

@CalebDu CalebDu force-pushed the permute_integration branch from 5b24c2c to 759cd54 Compare May 29, 2025 12:52
@mergify mergify bot removed the needs-rebase label May 29, 2025
@CalebDu
Copy link
Contributor Author

CalebDu commented May 29, 2025

@bnellnm I update code with your review. I notice latestDeepGemm supports Grouped GEMM skip useless computation for unaligned M in deepseek-ai/DeepGEMM@8dfa329. So I test fill -1 in m_indices and it passes moe test. I want to add check DeepGemm version in runtime to support skip useless computation with -1. But DeepGemm has no __version__ to check. I still have to fill invalid row with 0 in m_indices.

@bnellnm
Copy link
Contributor

bnellnm commented Jun 2, 2025

Modularized DeepGemm (only customized permute)TP=8

@CalebDu what's the difference between the Modularized DeepGemm (only customized permute) and No Modularized DeepGemm (customized permute/unpermute columns?

Is it basically DeepGemm w/custom permute vs. DeepGemm w/custom permute+unpermute? There shouldn't be any significant perf differences between modularized and un-modularized kernels.

Copy link
Contributor

@bnellnm bnellnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @CalebDu

@CalebDu
Copy link
Contributor Author

CalebDu commented Jun 3, 2025

what's the difference between the Modularized DeepGemm (only customized permute) and No Modularized DeepGemm (customized permute/unpermute columns?

Modularized DeepGemm (only customized permute) only integrates customized permute kernel after your modularized pr without breaking modularized interface. No Modularized DeepGemm (customized permute/unpermute) integrates both permute and unpermute kernel before your modularized pr.

Is it basically DeepGemm w/custom permute vs. DeepGemm w/custom permute+unpermute? There shouldn't be any significant perf differences between modularized and un-modularized kernels.

Perf differences comes from unpermute kernel. Based on permute/unpermute kernel benckmark #14568 (comment), customized unpermute kernel has great performance gain(thousands of us) when batch is large.

@mergify
Copy link

mergify bot commented Jun 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @CalebDu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 3, 2025
@CalebDu CalebDu force-pushed the permute_integration branch from 759cd54 to 62b4abb Compare June 4, 2025 14:15
@mergify mergify bot removed the needs-rebase label Jun 4, 2025
CalebDu added 3 commits July 24, 2025 18:04
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
@CalebDu CalebDu force-pushed the permute_integration branch from 54de5d3 to 8e17c51 Compare July 25, 2025 01:05
@CalebDu
Copy link
Contributor Author

CalebDu commented Jul 25, 2025

Hey @CalebDu - can you try rebasing the PR. I think that should make all the tests pass. Thanks 🙌

CI has 3 failure about Distributed Tests (4 GPUs), TPU V1 Test and Entrypoints Test (API Server). It seems unrelated to this PR according to CI log.

@varun-sundar-rabindranath
Copy link
Contributor

@CalebDu I have triggered a retry on the failed tests. What is the commit on main this PR is rebased on ? I can check if those tests also fail on main and make an argument for a force merge. Thanks.

@varun-sundar-rabindranath
Copy link
Contributor

I verified that commit fe56180c7f9088361db86c1ef40dbc54fa931e1f on main that this PR is based off of also fails the tpu-v1-test. @CalebDu can you update the PR description and title so we can merge this. Thanks.

@CalebDu
Copy link
Contributor Author

CalebDu commented Jul 27, 2025

I verified that commit fe56180c7f9088361db86c1ef40dbc54fa931e1f on main that this PR is based off of also fails the tpu-v1-test. @CalebDu can you update the PR description and title so we can merge this. Thanks.

What should I update in PR description and title?Describing thattpu-v1-test fails in main branch?

@varun-sundar-rabindranath
Copy link
Contributor

I verified that commit fe56180c7f9088361db86c1ef40dbc54fa931e1f on main that this PR is based off of also fails the tpu-v1-test. @CalebDu can you update the PR description and title so we can merge this. Thanks.

What should I update in PR description and title?Describing thattpu-v1-test fails in main branch?

Since this PR is not doing any integration,

  • can you change the title integrate permute/unpermute kernel into deepgemm moe to Fix CUDA permute/unpermute for use with DeepGemm MoE
  • remove point 2 [2] integrate permute/unpermute kernel into deepgemm moe from the PR description.
    Thanks.

@CalebDu CalebDu changed the title [kernel] integrate permute/unpermute kernel into deepgemm moe Fix CUDA permute/unpermute for use with DeepGemm Mo Jul 27, 2025
@CalebDu CalebDu changed the title Fix CUDA permute/unpermute for use with DeepGemm Mo Fix CUDA permute/unpermute for use with DeepGemm Moe Jul 27, 2025
@CalebDu
Copy link
Contributor Author

CalebDu commented Jul 27, 2025

I verified that commit fe56180c7f9088361db86c1ef40dbc54fa931e1f on main that this PR is based off of also fails the tpu-v1-test. @CalebDu can you update the PR description and title so we can merge this. Thanks.

What should I update in PR description and title?Describing thattpu-v1-test fails in main branch?

Since this PR is not doing any integration,

  • can you change the title integrate permute/unpermute kernel into deepgemm moe to Fix CUDA permute/unpermute for use with DeepGemm MoE
  • remove point 2 [2] integrate permute/unpermute kernel into deepgemm moe from the PR description.
    Thanks.

Done

@vllm-bot vllm-bot merged commit 57c22e5 into vllm-project:main Jul 27, 2025
100 of 102 checks passed
liuyumoye pushed a commit to liuyumoye/vllm that referenced this pull request Jul 31, 2025
)

Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
HsChen-sys pushed a commit to HsChen-sys/vllm that referenced this pull request Aug 1, 2025
)

Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025
)

Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
Signed-off-by: x22x22 <wadeking@qq.com>
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
)

Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
heyselbi pushed a commit to heyselbi/vllm that referenced this pull request Aug 8, 2025
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
)

Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
)

Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
Signed-off-by: Paul Pak <paulpak58@gmail.com>
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
)

Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants