Skip to content

Comments

[TRITON] Support gfx1201 for triton gemm_a8w8_blockscale#1829

Open
big-yellow-duck wants to merge 29 commits intoROCm:mainfrom
EmbeddedLLM:support_gfx1201_min
Open

[TRITON] Support gfx1201 for triton gemm_a8w8_blockscale#1829
big-yellow-duck wants to merge 29 commits intoROCm:mainfrom
EmbeddedLLM:support_gfx1201_min

Conversation

@big-yellow-duck
Copy link

@big-yellow-duck big-yellow-duck commented Jan 13, 2026

Motivation

This adds preliminary support for gfx1201 to use gemm_a8w8_blockscale from triton which is used in Qwen/Qwen3-0.6B-FP8

Moving forward, more triton kernels can be tuned to optimize the performance of gfx1201.

Technical Details

  • Added a base tuning script that is adaptable to other operations.
  • Added a tuning script to tune the triton kernel parameters for gemm_a8w8_blockscale.
  • the tuning script benchmarks different kernel parameter such as num_warps and waves_per_eu to find the optimal execution time for a set of operations.

Test Plan

test the tuned configs using aiter/op_tests/triton_tests/gemm/basic/test_gemm_a8w8_blockscale.py

pytest op_tests/triton_tests/gemm/basic/test_gemm_a8w8_blockscale.py

Test Result

126 tests have passed
2 skipped, (where N or K don't meet preshuffle kernel constraints: N must be multiple of 16, K must be multiple of 32)

Submission Checklist

big-yellow-duck and others added 14 commits January 5, 2026 08:07
Co-authored-by: NAME Amir Balwel amoooori04@gmail.com
Co-authoured-by: Amir Balwel amoooori04@gmail.com
Co-authored-by: Amir Balwel <amoooori04@gmail.com>
Co-authored-by: Jeff Aw <jeffaw99@hotmail.com>

Signed-off-by: Amir Balwel <amoooori04@gmail.com>
Co-authored-by: Jeff Aw <jeffaw99@hotmail.com>

Signed-off-by: Amir Balwel <amoooori04@gmail.com>
 Co-authored-by: Amir Balwel <amoooori04@gmail.com>
@big-yellow-duck big-yellow-duck changed the title Support gfx1201 min Support gfx1201 for triton gemm_a8w8_blockscale Jan 16, 2026
@big-yellow-duck big-yellow-duck marked this pull request as ready for review January 23, 2026 02:50
@big-yellow-duck big-yellow-duck requested a review from a team January 23, 2026 02:50
@azaidy azaidy changed the title Support gfx1201 for triton gemm_a8w8_blockscale [TRITON] Support gfx1201 for triton gemm_a8w8_blockscale Jan 23, 2026
@azaidy azaidy requested review from azaidy and vgokhale January 23, 2026 03:29
@big-yellow-duck
Copy link
Author

ttft_comparison latency_comparison benchmark_comparison

Using aiter gemm_w8a8 kernels in vllm shows performance uplift for gfx1201 when running Qwen3-0.6B-FP8 at higher input and ouput tokens,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants