[Triton] A8W8 blockscale GEMM tuning for Qwen3#1195
Open
anhminhnguyenhoang wants to merge 23 commits intomainfrom
Open
[Triton] A8W8 blockscale GEMM tuning for Qwen3#1195anhminhnguyenhoang wants to merge 23 commits intomainfrom
anhminhnguyenhoang wants to merge 23 commits intomainfrom
Conversation
vgokhale
reviewed
Nov 20, 2025
Contributor
vgokhale
left a comment
There was a problem hiding this comment.
You have a ton of changes that your editor presumably auto added. Can you revert all changes except the ones to the blockscale GEMM?
dfc4e1d to
0726067
Compare
Author
@vgokhale all clean now |
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Facilitate FP8 blockscale GEMM configs for Qwen3 model and performance speedup through optimizing the block configs and switching from loading BLOCK_SIZE_N FP32 scale factors of B tensor to BLOCK_SIZE_N / GROUP_N unique scaling factors and perform group broadcasting.
Main branch
This branch
A finer grain set of GEMM configs has been provisioned to maximize gains of various M size of QWEN3 model shapes.
The current kernel do BLOCK_SIZE_N loading for the FP32 scale factors of B tensor, which could potentially be wasteful. On the other hands, there are actually only BLOCK_SIZE_N / 128 unique scaling factors being loaded considering the blockscale shape (128, 128) is used in
op_tests/op_benchmarks/triton/bench_gemm_a8w8_blockscale.py.Adopting the idea in [Triton] e2e fused MoE for small N and fp8 blockscale MoE benching #1126 , scalars of the BLOCK_SIZE_N / 128 unique values in the current tile are loaded to then get group broadcasted (similar to torch repeat interleave). This potentially reduce wait time on
tl.load.