Fix CUDA permute/unpermute for use with DeepGemm Moe #17934

CalebDu · 2025-05-10T03:06:45Z

[1] refactor permute/unpermute kernel

remove token_expert_indices input dependence and output align to triton kernel
refine parameter and remove unused parameter

[2] ~~integrate permute/unpermute kernel into deepgemm moe~~
[3] fix clamp causes wrong argsort result, cc @bnellnm

vllm/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py

Lines 84 to 86 in 7042cc9

    
           sorted_token_ids = sorted_token_ids.clamp(max=num_tokens - 1) 
        
           expert_ids = torch.repeat_interleave(expert_ids, block_m, dim=0) 
        
           inv_perm = torch.argsort(sorted_token_ids)[:num_tokens]

I will attach performance data in H100 later.

github-actions · 2025-05-10T03:06:53Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

CalebDu · 2025-05-10T12:27:00Z

benchmark setup

python3 benchmarks/kernels/benchmark_moe.py --dtype=fp8_w8a8 -tp ${TP} --model deepseek-ai/DeepSeek-V3 --trust-remote-code --use-deep-gemm

How to switch python permute/unpermute implementation
https://github.com/CalebDu/vllm/blob/78b24807ae473ddf395a88df47de78033c16f515/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py#L289-L296
https://github.com/CalebDu/vllm/blob/78b24807ae473ddf395a88df47de78033c16f515/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py#L326C1-L331C59
comment out above customized permute/unpermute kernel

Performance in H100

Batch	DeepGemm (python permute)TP=1	DeepGemm (customized permute)TP=1	DeepGemm (python permute)TP=2	DeepGemm (customized permute)TP=2	DeepGemm (python permute)TP=4	DeepGemm (customized permute)TP=4	DeepGemm (python permute)TP=8	DeepGemm (customized permute)TP=8
128	6415.07 us	5506.54 us	3815.51 us	2943.05 us	2609.36 us	1727.12 us	1921.54 us	1067.80 us
256	6699.19 us	5631.23 us	4021.14 us	3088.93 us	2781.02 us	1868.46 us	2083.59 us	1149.40 us
512	6936.77 us	5673.94 us	4293.68 us	3279.71 us	3066.71 us	2037.61 us	2278.12 us	1259.98 us
1024	7786.69 us	5745.72 us	4848.20 us	3673.02 us	3515.67 us	2301.33 us	2720.41 us	1543.88 us
1536	8544.97 us	5967.12 us	5383.77 us	4015.44 us	4021.17 us	2643.02 us	3203.08 us	1811.40 us
2048	9283.23 us	6090.98 us	5905.14 us	4439.83 us	4515.79 us	2926.30 us	3657.95 us	2093.40 us
3072	10828.49 us	6454.82 us	7159.85 us	5183.76 us	5516.52 us	3554.23 us	4625.18 us	2639.90 us
4096	12025.13 us	7546.53 us	8233.51 us	5857.30 us	6508.91 us	4093.26 us	5654.33 us	3213.34 us

re-bench with triton kernel (unit us)

Batch	DeepGemm (python permute)TP=1	DeepGemm (customized permute)TP=1	triton 3.1 TP=1	triton 3.3 TP=1	DeepGemm (python permute)TP=2	DeepGemm (customized permute)TP=2	triton 3.1 TP=2	triton 3.3 TP=2	DeepGemm (python permute)TP=4	DeepGemm (customized permute)TP=4	triton 3.1 TP=4	triton 3.3 TP=4	DeepGemm (python permute)TP=8	DeepGemm (customized permute)TP=8	triton 3.1 TP=8	triton 3.3 TP=8
1	3602.511845	2806.304037	509.6886411	212.2419837	2348.581474	1554.745921	266.5411837	129.3510083	1705.783838	889.8526421	153.9972478	94.89859194	1418.308803	592.1240635	109.9713603	81.0199999
2	3743.248371	2914.416389	890.0548458	397.8605115	2414.282902	1583.914309	479.982368	219.3356488	1740.465818	924.0305281	290.6471677	150.0224319	1424.533503	607.9918714	198.1224321	125.114976
4	3871.733936	3072.978024	1592.366722	692.1821451	2489.166077	1677.939661	849.5276814	388.4738874	1778.695871	964.1947212	485.1181769	214.6598412	1458.306659	635.9201288	292.2567046	152.41456
8	4138.583206	3303.94302	2832.352234	1233.090274	2590.923071	1774.462336	1471.922564	632.3694725	1848.656734	1034.374914	794.5269709	377.4500155	1487.092036	676.3227525	473.5459514	212.51165
16	4597.348591	3818.039204	4888.197403	2121.795744	2852.478691	2053.513956	2478.800838	1073.305606	2031.305277	1208.633279	1377.914532	579.9425592	1560.760032	742.2308807	740.6688633	380.269857
24	4977.569126	4183.773739	6777.195034	2817.677444	3061.664288	2237.602074	3408.021435	1408.687361	2138.71962	1312.531584	1731.006407	755.8986874	1627.074625	797.8651834	937.5368023	423.110177
32	5230.934456	4425.215775	7806.68235	3352.562454	3224.481319	2360.478848	3976.369732	1666.52599	2220.697437	1387.366114	2065.560162	864.5977573	1672.673853	852.5864677	1168.152614	497.871904
48	5727.715935	4929.910007	9916.409409	4108.009949	3459.024086	2606.061342	4855.721756	2026.838238	2388.482313	1554.173407	2593.146175	1047.236092	1764.040802	930.4103041	1303.255808	605.584416
64	6016.515842	5198.458054	11005.09344	4563.136818	3587.003098	2736.59622	5392.553787	2253.668245	2472.654427	1630.939261	2783.693981	1169.77389	1827.59853	979.3085451	1477.136861	650.448993
96	6321.76685	5458.433441	12071.88416	5024.149139	3768.062248	2887.671818	5813.436424	2477.5888	2562.929989	1709.074177	2992.637955	1267.99488	1883.062052	1030.712481	1608.455294	713.206977
128	6385.871418	5585.22073	12426.95809	5208.960396	3773.556168	2953.472858	6208.605339	2583.885637	2611.552597	1744.339996	3119.137667	1325.251584	1932.514338	1058.438753	1588.297062	754.122849
256	6702.019875	5804.147552	12395.02485	5417.877174	3971.844341	3068.303297	6324.666134	2705.896151	2772.217916	1852.885178	3220.112226	1402.01593	2058.58242	1170.369566	1718.393171	807.857249
512	7051.680382	6053.466328	13117.78777	7816.970879	4293.289478	3310.572479	6644.185196	4066.989346	3075.320736	2037.641121	3421.452259	2237.669184	2267.978533	1265.267967	1811.261024	1341.55734
1024	7784.900734	6614.192299	13313.14198	8117.559082	4878.837082	3677.688358	6797.295891	4259.552326	3508.984188	2304.154964	3541.739574	2360.527557	2745.706972	1536.857345	1890.983395	1430.57545
1536	8418.245331	7043.992645	13784.26308	8481.814522	5431.642624	4012.459137	7018.565506	4471.622993	4007.832886	2609.715874	3674.470848	2482.022118	3180.825912	1822.045153	1989.906879	1526.61708
2048	9282.124054	7711.00795	19847.0266	11405.15988	6026.811966	4440.580986	10098.08009	5998.260666	4513.037003	2964.057407	5223.041172	3292.029198	3672.447021	2086.210842	2816.100096	1953.85206
3072	10832.41057	8821.013359	27409.59357	15059.13861	7139.339951	5179.818336	13843.49413	7970.718567	5516.910252	3547.923496	7180.629387	4363.311146	4633.855934	2637.93833	3860.537186	2620.38295
4096	12191.91438	9577.756935	33881.41974	18705.30284	8246.168228	5723.187714	17242.07616	9866.959175	6593.480827	4200.067154	8927.522995	5375.271305	5650.963947	3206.51663	4784.900856	3189.82803

Triton3.1->Triton3.3 rapidly speedups triton fused moe kernel. I think deepgemm moe is slow than triton when M is small because deepgemm group gemm does not support skip invalid block by m_indices=-1. So deepgemm group gemm must do gemm computation for all permuted_align_row_size, even if many blocks are all pad invalid data (no routed token).

vllm/vllm/model_executor/layers/fused_moe/moe_permute_unpermute.py

Lines 50 to 52 in d74e5f3

    
           permuted_row_size = (permuted_row_size + n_expert * 
        
                                (align_block_size - 1) + align_block_size - 
        
                                1) // align_block_size * align_block_size

mergify · 2025-05-13T11:41:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @CalebDu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

CalebDu · 2025-05-20T13:21:20Z

re-bench after #15956 (unit us)

Batch	Modularized DeepGemm (python permute)TP=1	Modularized DeepGemm (only customized permute)TP=1	No Modularized DeepGemm (customized permute/unpermute)TP=1	triton 3.1 TP=1	triton 3.3 TP=1	Modularized DeepGemm (python permute)TP=2	Modularized DeepGemm (only customized permute)TP=2	No Modularized DeepGemm (customized permute/unpermute)TP=2	triton 3.1 TP=2	triton 3.3 TP=2	Modularized DeepGemm (python permute)TP=4	Modularized DeepGemm (only customized permute)TP=4	No Modularized DeepGemm (customized permute/unpermute)TP=4	triton 3.1 TP=4	triton 3.3 TP=4	Modularized DeepGemm (python permute)TP=8	Modularized DeepGemm (only customized permute)TP=8	No Modularized DeepGemm (customized permute/unpermute)TP=8	triton 3.1 TP=8	triton 3.3 TP=8
128	6671.109512	5897.525105	5585.22073	12426.95809	5208.960396	4002.299603	3190.863388	2953.472858	6208.605339	2583.885637	2768.742664	1968.474529	1744.339996	3119.137667	1325.251584	2081.477089	1273.735426	1058.438753	1588.297062	754.122849
256	6955.67598	6151.763477	5804.147552	12395.02485	5417.877174	4172.148056	3404.222172	3068.303297	6324.666134	2705.896151	2959.185539	2150.495935	1852.885178	3220.112226	1402.01593	2245.785723	1451.807136	1170.369566	1718.393171	807.857249
512	7276.883553	6592.895737	6053.466328	13117.78777	7816.970879	4504.225529	3740.077301	3310.572479	6644.185196	4066.989346	3240.641705	2481.417816	2037.641121	3421.452259	2237.669184	2455.443848	1697.28368	1265.267967	1811.261024	1341.55734
1024	8083.115433	7422.999832	6614.192299	13313.14198	8117.559082	5129.564442	4454.771965	3677.688358	6797.295891	4259.552326	3717.302071	3038.868734	2304.154964	3541.739574	2360.527557	2952.14736	2259.740549	1536.857345	1890.983395	1430.57545
1536	8883.149696	8261.534912	7043.992645	13784.26308	8481.814522	5722.705036	5038.666527	4012.459137	7018.565506	4471.622993	4281.013062	3635.37709	2609.715874	3674.470848	2482.022118	3405.941376	2800.218332	1822.045153	1989.906879	1526.61708
2048	9670.223793	9100.05587	7711.00795	19847.0266	11405.15988	6226.204876	5658.101665	4440.580986	10098.08009	5998.260666	4760.465324	4202.461323	2964.057407	5223.041172	3292.029198	3877.262089	3298.064209	2086.210842	2816.100096	1953.85206
3072	11096.54414	10726.87976	8821.013359	27409.59357	15059.13861	7456.444565	6907.287178	5179.818336	13843.49413	7970.718567	5795.136898	5284.919239	3547.923496	7180.629387	4363.311146	4900.731537	4375.66518	2637.93833	3860.537186	2620.38295
4096	12252.84682	11754.60801	9577.756935	33881.41974	18705.30284	8382.622299	8035.390305	5723.187714	17242.07616	9866.959175	6811.048393	6384.404938	4200.067154	8927.522995	5375.271305	5897.022053	5522.352989	3206.51663	4784.900856	3189.82803

I only integrate customized permute kernel into modularized fused moe, because customized unpermute kernel needs extra parameter inv_perm. It will break modularized interface in Experts and MoEPrepareAndFinalizeNoEP. So I integrate permute kernel firstly and attach customized permute+unpermute performance data in above table.
@bnellnm @tlrmchlsmth

mergify · 2025-05-23T01:59:58Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @CalebDu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/model_executor/layers/fused_moe/deep_gemm_moe.py

bnellnm · 2025-05-27T17:49:51Z

vllm/model_executor/layers/fused_moe/deep_gemm_moe.py

Can moe_permute get topk internally so that it doesn't need to be passed in as a separate argument?

ok, get topk from topk_ids.

bnellnm · 2025-05-27T17:51:38Z

vllm/model_executor/layers/fused_moe/moe_permute_unpermute.py

Can the scale permutation also be done by the CUDA kernel?

Then it seems permuted_idx would not be required?

Scale permutation is with small cost. There is no huge peformance gain to fuse it into cuda kernel. So I want to just place it in python for flexibility. And remove permuted_idx from return value.

bnellnm · 2025-05-27T17:53:25Z

LGTM. Just had a few minor comments.

CalebDu · 2025-05-29T13:14:10Z

@bnellnm I update code with your review. I notice latestDeepGemm supports Grouped GEMM skip useless computation for unaligned M in deepseek-ai/DeepGEMM@8dfa329. So I test fill -1 in m_indices and it passes moe test. I want to add check DeepGemm version in runtime to support skip useless computation with -1. But DeepGemm has no __version__ to check. I still have to fill invalid row with 0 in m_indices.

bnellnm · 2025-06-02T17:32:55Z

Modularized DeepGemm (only customized permute)TP=8

@CalebDu what's the difference between the Modularized DeepGemm (only customized permute) and No Modularized DeepGemm (customized permute/unpermute columns?

Is it basically DeepGemm w/custom permute vs. DeepGemm w/custom permute+unpermute? There shouldn't be any significant perf differences between modularized and un-modularized kernels.

bnellnm

LGTM! Thanks @CalebDu

CalebDu · 2025-06-03T03:43:58Z

what's the difference between the Modularized DeepGemm (only customized permute) and No Modularized DeepGemm (customized permute/unpermute columns?

Modularized DeepGemm (only customized permute) only integrates customized permute kernel after your modularized pr without breaking modularized interface. No Modularized DeepGemm (customized permute/unpermute) integrates both permute and unpermute kernel before your modularized pr.

Is it basically DeepGemm w/custom permute vs. DeepGemm w/custom permute+unpermute? There shouldn't be any significant perf differences between modularized and un-modularized kernels.

Perf differences comes from unpermute kernel. Based on permute/unpermute kernel benckmark #14568 (comment), customized unpermute kernel has great performance gain(thousands of us) when batch is large.

mergify · 2025-06-03T22:31:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @CalebDu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>

CalebDu · 2025-07-25T13:12:47Z

Hey @CalebDu - can you try rebasing the PR. I think that should make all the tests pass. Thanks 🙌

CI has 3 failure about Distributed Tests (4 GPUs), TPU V1 Test and Entrypoints Test (API Server). It seems unrelated to this PR according to CI log.

varun-sundar-rabindranath · 2025-07-26T12:46:47Z

@CalebDu I have triggered a retry on the failed tests. What is the commit on main this PR is rebased on ? I can check if those tests also fail on main and make an argument for a force merge. Thanks.

varun-sundar-rabindranath · 2025-07-27T08:54:56Z

I verified that commit fe56180c7f9088361db86c1ef40dbc54fa931e1f on main that this PR is based off of also fails the tpu-v1-test. @CalebDu can you update the PR description and title so we can merge this. Thanks.

CalebDu · 2025-07-27T09:07:50Z

I verified that commit fe56180c7f9088361db86c1ef40dbc54fa931e1f on main that this PR is based off of also fails the tpu-v1-test. @CalebDu can you update the PR description and title so we can merge this. Thanks.

What should I update in PR description and title？Describing thattpu-v1-test fails in main branch？

varun-sundar-rabindranath · 2025-07-27T10:00:43Z

I verified that commit fe56180c7f9088361db86c1ef40dbc54fa931e1f on main that this PR is based off of also fails the tpu-v1-test. @CalebDu can you update the PR description and title so we can merge this. Thanks.

What should I update in PR description and title？Describing thattpu-v1-test fails in main branch？

Since this PR is not doing any integration,

can you change the title integrate permute/unpermute kernel into deepgemm moe to Fix CUDA permute/unpermute for use with DeepGemm MoE
remove point 2 [2] integrate permute/unpermute kernel into deepgemm moe from the PR description.
Thanks.

CalebDu · 2025-07-27T10:38:01Z

I verified that commit fe56180c7f9088361db86c1ef40dbc54fa931e1f on main that this PR is based off of also fails the tpu-v1-test. @CalebDu can you update the PR description and title so we can merge this. Thanks.

What should I update in PR description and title？Describing thattpu-v1-test fails in main branch？

Since this PR is not doing any integration,

can you change the title integrate permute/unpermute kernel into deepgemm moe to Fix CUDA permute/unpermute for use with DeepGemm MoE

remove point 2 [2] integrate permute/unpermute kernel into deepgemm moe from the PR description.
Thanks.

Done

) Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>

) Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn> Signed-off-by: x22x22 <wadeking@qq.com>

) Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>

) Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

) Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn> Signed-off-by: Paul Pak <paulpak58@gmail.com>

) Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

) Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>

CalebDu requested review from WoosukKwon and tlrmchlsmth as code owners May 10, 2025 03:06

CalebDu force-pushed the permute_integration branch from db26b2d to d71322b Compare May 10, 2025 03:31

CalebDu mentioned this pull request May 10, 2025

[Misc] Clean up unused vars in moe_permute_unpermute_kernel #17927

Closed

CalebDu force-pushed the permute_integration branch from d71322b to 78b2480 Compare May 10, 2025 12:10

mergify bot added the needs-rebase label May 13, 2025

CalebDu force-pushed the permute_integration branch from 78b2480 to 4dc1a00 Compare May 16, 2025 14:48

mergify bot removed the needs-rebase label May 16, 2025

CalebDu force-pushed the permute_integration branch from e20416d to 3f967f5 Compare May 16, 2025 15:08

CalebDu force-pushed the permute_integration branch from 22a9f55 to 5b24c2c Compare May 20, 2025 13:25

mergify bot added the needs-rebase label May 23, 2025

bnellnm reviewed May 27, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/deep_gemm_moe.py Outdated Show resolved Hide resolved

bnellnm reviewed May 27, 2025

View reviewed changes

CalebDu force-pushed the permute_integration branch from 5b24c2c to 759cd54 Compare May 29, 2025 12:52

mergify bot removed the needs-rebase label May 29, 2025

bnellnm approved these changes Jun 2, 2025

View reviewed changes

mergify bot added the needs-rebase label Jun 3, 2025

CalebDu force-pushed the permute_integration branch from 759cd54 to 62b4abb Compare June 4, 2025 14:15

mergify bot removed the needs-rebase label Jun 4, 2025

CalebDu added 3 commits July 24, 2025 18:04

1. integrate moe_permute firstly after refactored moe

1dc6110

Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>

update with bnell's review

58fe99e

Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>

add n_local_expert for moe_permute

8e17c51

Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>

CalebDu force-pushed the permute_integration branch from 54de5d3 to 8e17c51 Compare July 25, 2025 01:05

CalebDu changed the title ~~[kernel] integrate permute/unpermute kernel into deepgemm moe~~ Fix CUDA permute/unpermute for use with DeepGemm Mo Jul 27, 2025

CalebDu changed the title ~~Fix CUDA permute/unpermute for use with DeepGemm Mo~~ Fix CUDA permute/unpermute for use with DeepGemm Moe Jul 27, 2025

vllm-bot merged commit 57c22e5 into vllm-project:main Jul 27, 2025
100 of 102 checks passed

liuyumoye pushed a commit to liuyumoye/vllm that referenced this pull request Jul 31, 2025

Fix CUDA permute/unpermute for use with DeepGemm Moe (vllm-project#17934

b5c3a37

) Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>

HsChen-sys pushed a commit to HsChen-sys/vllm that referenced this pull request Aug 1, 2025

Fix CUDA permute/unpermute for use with DeepGemm Moe (vllm-project#17934

c75a71f

) Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>

x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025

Fix CUDA permute/unpermute for use with DeepGemm Moe (vllm-project#17934

ba10d6d

) Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn> Signed-off-by: x22x22 <wadeking@qq.com>

Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025

Fix CUDA permute/unpermute for use with DeepGemm Moe (vllm-project#17934

a161876

) Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

Fix CUDA permute/unpermute for use with DeepGemm Moe (vllm-project#17934

a224240

) Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>

heyselbi pushed a commit to heyselbi/vllm that referenced this pull request Aug 8, 2025

Fix CUDA permute/unpermute for use with DeepGemm Moe (vllm-project#17934

a2cffea

) Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>

heyselbi mentioned this pull request Aug 8, 2025

SYNC RHOAI 2.24 to midstream v0.10.0.1 release red-hat-data-services/vllm#175

Merged

jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025

Fix CUDA permute/unpermute for use with DeepGemm Moe (vllm-project#17934

6c4e04b

) Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

Fix CUDA permute/unpermute for use with DeepGemm Moe (vllm-project#17934

b01ebbf

) Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn> Signed-off-by: Paul Pak <paulpak58@gmail.com>

diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025

Fix CUDA permute/unpermute for use with DeepGemm Moe (vllm-project#17934

c5f9980

) Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

shixianc mentioned this pull request Aug 17, 2025

[Kernel] CUTLASS MoE FP8: Integrate cuda moe permute/unpermute #23045

Merged

4 tasks

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

Fix CUDA permute/unpermute for use with DeepGemm Moe (vllm-project#17934

f101529

) Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

Fix CUDA permute/unpermute for use with DeepGemm Moe (vllm-project#17934

0922355

) Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>

	sorted_token_ids = sorted_token_ids.clamp(max=num_tokens - 1)
	expert_ids = torch.repeat_interleave(expert_ids, block_m, dim=0)
	inv_perm = torch.argsort(sorted_token_ids)[:num_tokens]

Uh oh!

Uh oh!

Fix CUDA permute/unpermute for use with DeepGemm Moe #17934

Fix CUDA permute/unpermute for use with DeepGemm Moe #17934

Uh oh!

Conversation

CalebDu commented May 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 10, 2025

Uh oh!

CalebDu commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

benchmark setup

Performance in H100

re-bench with triton kernel (unit us)

Uh oh!

mergify bot commented May 13, 2025

Uh oh!

CalebDu commented May 20, 2025

re-bench after #15956 (unit us)

Uh oh!

mergify bot commented May 23, 2025

Uh oh!

Uh oh!

bnellnm May 27, 2025

Choose a reason for hiding this comment

Uh oh!

CalebDu May 29, 2025

Choose a reason for hiding this comment

Uh oh!

bnellnm May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CalebDu May 29, 2025

Choose a reason for hiding this comment

Uh oh!

bnellnm commented May 27, 2025

Uh oh!

CalebDu commented May 29, 2025

Uh oh!

bnellnm commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bnellnm left a comment

Choose a reason for hiding this comment

Uh oh!

CalebDu commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Jun 3, 2025

Uh oh!

CalebDu commented Jul 25, 2025

Uh oh!

varun-sundar-rabindranath commented Jul 26, 2025

Uh oh!

varun-sundar-rabindranath commented Jul 27, 2025

Uh oh!

CalebDu commented Jul 27, 2025

Uh oh!

varun-sundar-rabindranath commented Jul 27, 2025

Uh oh!

CalebDu commented Jul 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

CalebDu commented May 10, 2025 •

edited by github-actions bot

Loading

CalebDu commented May 10, 2025 •

edited

Loading

bnellnm May 27, 2025 •

edited

Loading

bnellnm commented Jun 2, 2025 •

edited

Loading

CalebDu commented Jun 3, 2025 •

edited

Loading