Merged
Conversation
c554aca to
a902874
Compare
2e7c3ba to
f35b2bb
Compare
0e9a430 to
aa64f61
Compare
5f33013 to
8111ed1
Compare
8111ed1 to
77bfdb1
Compare
37029fb to
4610e4d
Compare
4610e4d to
8353f61
Compare
f45a0e7 to
0dd7dbe
Compare
0dd7dbe to
e11d836
Compare
5c11894 to
a4a859a
Compare
lburzawa
approved these changes
Jan 12, 2026
zhuyuhua-v
pushed a commit
that referenced
this pull request
Jan 14, 2026
* Implement a4w4 moe kernel * tune testcase for a4w4 based on deepseek r1 shapes * refactor activation quant to use deepseek fp4 quant * skip a4w4 unit tests on MI300 * Add layer1/layer2 suffix for easier profiling * Add --num-weight-inits flag to average MoE benchmark results ---------
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This PR adds a new kernel for mxfp4 x mxfp4 mxfp4GEMM. Both weights and attention must use mx4 scaling, static fp4 scaling is unimplemented since no models are using it. This kernel config is tuned for DeepSeek R1-0528 FP4 shapes.
Test Plan
Test cases are implemented in aiter/op_test/triton_tests/test_moe_gemm_a4w8.py
Test Result
python bench_moe_gemm_a4w4.py --shape 7168 4096 --experts 128 4 --op-regex .swiglu.
batch: 1 | Total latency (us): 106.52 | Kernel latency (us): 26.76 | TFLOPS: 8.779 | TBPS: 2.20
batch: 2 | Total latency (us): 119.73 | Kernel latency (us): 31.65 | TFLOPS: 14.84 | TBPS: 3.71
batch: 4 | Total latency (us): 137.05 | Kernel latency (us): 39.57 | TFLOPS: 23.74 | TBPS: 5.57
batch: 8 | Total latency (us): 188.48 | Kernel latency (us): 75.66 | TFLOPS: 24.83 | TBPS: 5.43
batch: 16 | Total latency (us): 272.97 | Kernel latency (us): 129.83 | TFLOPS: 28.95 | TBPS: 6.11
batch: 32 | Total latency (us): 383.48 | Kernel latency (us): 200.33 | TFLOPS: 37.52 | TBPS: 6.23
batch: 64 | Total latency (us): 490.69 | Kernel latency (us): 270.23 | TFLOPS: 55.63 | TBPS: 6.14
batch: 128 | Total latency (us): 560.20 | Kernel latency (us): 296.46 | TFLOPS: 101.4 | TBPS: 6.30
batch: 256 | Total latency (us): 614.14 | Kernel latency (us): 313.82 | TFLOPS: 191.6 | TBPS: 6.00
batch: 1024 | Total latency (us): 963.69 | Kernel latency (us): 464.46 | TFLOPS: 517.8 | TBPS: 4.09
batch: 4096 | Total latency (us): 1295.88 | Kernel latency (us): 585.66 | TFLOPS: 1643. | TBPS: 3.35
batch: 8192 | Total latency (us): 1959.17 | Kernel latency (us): 894.56 | TFLOPS: 2151. | TBPS: 2.28
Submission Checklist