Skip to content

[TRITON] Add MoE GEMM a4w4 kernel#1358

Merged
nsusanto merged 21 commits intomainfrom
nsusanto/moe_gemm_a4w4
Jan 12, 2026
Merged

[TRITON] Add MoE GEMM a4w4 kernel#1358
nsusanto merged 21 commits intomainfrom
nsusanto/moe_gemm_a4w4

Conversation

@nsusanto
Copy link
Contributor

@nsusanto nsusanto commented Nov 6, 2025

Motivation

This PR adds a new kernel for mxfp4 x mxfp4 mxfp4GEMM. Both weights and attention must use mx4 scaling, static fp4 scaling is unimplemented since no models are using it. This kernel config is tuned for DeepSeek R1-0528 FP4 shapes.

Test Plan

Test cases are implemented in aiter/op_test/triton_tests/test_moe_gemm_a4w8.py

Test Result

python bench_moe_gemm_a4w4.py --shape 7168 4096 --experts 128 4 --op-regex .swiglu.

batch: 1 | Total latency (us): 106.52 | Kernel latency (us): 26.76 | TFLOPS: 8.779 | TBPS: 2.20
batch: 2 | Total latency (us): 119.73 | Kernel latency (us): 31.65 | TFLOPS: 14.84 | TBPS: 3.71
batch: 4 | Total latency (us): 137.05 | Kernel latency (us): 39.57 | TFLOPS: 23.74 | TBPS: 5.57
batch: 8 | Total latency (us): 188.48 | Kernel latency (us): 75.66 | TFLOPS: 24.83 | TBPS: 5.43
batch: 16 | Total latency (us): 272.97 | Kernel latency (us): 129.83 | TFLOPS: 28.95 | TBPS: 6.11
batch: 32 | Total latency (us): 383.48 | Kernel latency (us): 200.33 | TFLOPS: 37.52 | TBPS: 6.23
batch: 64 | Total latency (us): 490.69 | Kernel latency (us): 270.23 | TFLOPS: 55.63 | TBPS: 6.14
batch: 128 | Total latency (us): 560.20 | Kernel latency (us): 296.46 | TFLOPS: 101.4 | TBPS: 6.30
batch: 256 | Total latency (us): 614.14 | Kernel latency (us): 313.82 | TFLOPS: 191.6 | TBPS: 6.00
batch: 1024 | Total latency (us): 963.69 | Kernel latency (us): 464.46 | TFLOPS: 517.8 | TBPS: 4.09
batch: 4096 | Total latency (us): 1295.88 | Kernel latency (us): 585.66 | TFLOPS: 1643. | TBPS: 3.35
batch: 8192 | Total latency (us): 1959.17 | Kernel latency (us): 894.56 | TFLOPS: 2151. | TBPS: 2.28

Submission Checklist

@nsusanto nsusanto force-pushed the nsusanto/moe_gemm_a4w4 branch 2 times, most recently from c554aca to a902874 Compare November 12, 2025 20:08
@nsusanto nsusanto force-pushed the nsusanto/moe_gemm_a4w4 branch 4 times, most recently from 2e7c3ba to f35b2bb Compare December 3, 2025 19:32
@nsusanto nsusanto force-pushed the nsusanto/moe_gemm_a4w4 branch from 0e9a430 to aa64f61 Compare December 12, 2025 19:04
@nsusanto nsusanto requested a review from a team December 12, 2025 19:04
@nsusanto nsusanto force-pushed the nsusanto/moe_gemm_a4w4 branch 2 times, most recently from 5f33013 to 8111ed1 Compare December 12, 2025 19:07
@nsusanto nsusanto force-pushed the nsusanto/moe_gemm_a4w4 branch from 8111ed1 to 77bfdb1 Compare December 15, 2025 18:33
@nsusanto nsusanto force-pushed the nsusanto/moe_gemm_a4w4 branch 2 times, most recently from 37029fb to 4610e4d Compare December 22, 2025 16:14
@nsusanto nsusanto force-pushed the nsusanto/moe_gemm_a4w4 branch from 4610e4d to 8353f61 Compare December 22, 2025 16:18
@nsusanto nsusanto force-pushed the nsusanto/moe_gemm_a4w4 branch from f45a0e7 to 0dd7dbe Compare January 6, 2026 21:49
@nsusanto nsusanto force-pushed the nsusanto/moe_gemm_a4w4 branch from 0dd7dbe to e11d836 Compare January 6, 2026 21:54
@nsusanto nsusanto force-pushed the nsusanto/moe_gemm_a4w4 branch from 5c11894 to a4a859a Compare January 8, 2026 16:04
@lburzawa lburzawa self-requested a review January 9, 2026 12:40
Copy link
Contributor

@azaidy azaidy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@nsusanto nsusanto merged commit 9eecdec into main Jan 12, 2026
19 checks passed
@nsusanto nsusanto deleted the nsusanto/moe_gemm_a4w4 branch January 12, 2026 15:02
zhuyuhua-v pushed a commit that referenced this pull request Jan 14, 2026
* Implement a4w4 moe kernel

* tune testcase for a4w4 based on deepseek r1 shapes

* refactor activation quant to use deepseek fp4 quant

* skip a4w4 unit tests on MI300

* Add layer1/layer2 suffix for easier profiling

* Add --num-weight-inits flag to average MoE benchmark results
---------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants