[dev/perf][Update] update dev/perf branch to main branch commit 1ec04f7 by zhuyuhua-v · Pull Request #1838 · ROCm/aiter

zhuyuhua-v · 2026-01-14T03:36:49Z

Motivation

[dev/perf][Update] update dev/perf branch to main branch commit 1ec04f7

Author: azaidy
Date: Tue Jan 13 21:12:51 2026 -0500
Fix INT4 QR TP8 boundary condition (#1834)

Technical Details

This update differs from the main branch in two key aspects:

It retains the Alibaba-specific modifications in dev/perf moe. These changes have not been merged into main due to significant merge conflicts and because the main branch is expected to gradually adopt CKTile instead.
It preserves the optimizations in dev/perf allreduce, which are still under development and have not yet been fully completed or merged into main.

Test Plan

https://github.com/ROCm/rocActions/actions/workflows/vllm-benchmark-workflow.yaml

Test Result

* trying to fix pypi won't handle git dependencies * fix format * fix comments typo

* remove_iris_from_setup * Update setup.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * update * update * update --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* add one shot pa kernel * fix buffer load in sliding window kernel * fix typo * revert --------- Co-authored-by: root <root@hjbog-srdc-24.amd.com>

Signed-off-by: Double Young <yang.yang2@amd.com>

/lgtm The customer has tested the code. It can work. * topk uplift v1 * topk add api for choose topk_v1 or topk_v2 --------- Co-authored-by: yonshuai <yonshuai@amd.com> Co-authored-by: yongshuai <yongshuai@amd.com>

* Remove the input parameter "out" in gemm_a4w4 * update * format --------- Co-authored-by: valarLip <Lingpeng.Jin@amd.com>

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

…sistency (#1686)

* add fmoe co with tilesize 32x128 * add ps co * fix pertoken co bug * add co to csv * add 128ntile logic for one stage asm * fix mem fault during perf turn * en vs for pertoken kernel --------- Co-authored-by: feifei14119 <feiw@amd.com> Co-authored-by: zufayu <zufayu@amd.com>

* Introduce new grid config strategy for compatibility with cases that hdim is small. * add launch bound to make sure that occu is always 8 * follow Copilot the suggestions

* enhance prebuild logic * ATen.h build issues * bug fix * bug fix II * bug fix III --------- Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* QR cap implemented to limit QR to prefill * test git config * Fix to genericize qr comm cap * Incorrect cap number

* open mla mtp and remove some logs * fix qlen dense 128,N * fix hint * support sparse qlen input = 1 * change default splits

* add two fp4 tune shapes and tuned config * change 32800 to 65536 to cover all cases between 32768 to 65536 as per feedback

* support moe a8w8 splitk (#1654) * Add support to a8w8_ck_moe_blk_gemm1 splitk * add switch and add some logging * tiny fix * update ck 3rd party and add some logging * add AITER_HEURISTIC_ONLY env * update ck * add condition to bypass tuned cfg * change bypass type * fix * fix removed log * upate ck submodule * fix lint * force to run tests --------- Co-authored-by: oscar <huaiguxu@amd.com> * Zan/moe a8w4 (#1655) * update * update * update quant * ut ready * update quant type * compile pass * python3 op_tests/test_moe_2stage.py -t 16 -e 1 -k 1 -dim 256,256 ready * update aiter dipatcher for bf16&fp8 * support a16 a8 dispatch * finish quant & sort * update aiter framework for a8w4 moe * update ck * update * update * update for atom * update --------- Co-authored-by: Zzz9990 <Zzz9990> Co-authored-by: root <root@hjbog-srdc-24.amd.com> * update ck * fix dispatch * fix too much logging * update * update ck * update ck * fix ruff code style * revert aiter-test yaml * fix ci * fix ci * fix ci * add mocked tuned result and decoding cfg token to next power of 2 * Update tuned_fmoe.csv remove duplicate * remove hack dtype * fix black * unique index * add empty arg to ck_moe_stage1 * resolve bias into lru cache * rename bypass cfg to AITER_BYPASS_TUNE_CONFIG --------- Co-authored-by: oscar <huaiguxu@amd.com> Co-authored-by: Zzz9990 <zanzhang@amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: felix <felix.li@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

* bf16_gemm_clean_in_kl * update * update * update * update

* fix tuner * Update gradlib/gradlib/GemmTuner.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-authored-by: Lin, Soga <soga.lin@amd.com> Co-authored-by: sogalin <39478626+sogalin@users.noreply.github.com>

* fix llvm issue * fix copilot

* maybe fix build * update ck submodule past 3359 * Update to ROCm/composable_kernel@e339101 --------- Co-authored-by: Ding, Yi <yi.ding@amd.com>

* add gen_fake for MLA RoPE operator * fix code stype * sync logic in fake with actual function * fix black error * fix black error again * remove in-place argument * fix pytest of fused_kv_cache

* add kernel and config * Update aiter/ops/triton/gemm_a16wfp4.py Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update op_tests/triton_tests/gemm/basic/test_gemm_a16wfp4.py Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * format * black format * clean * fix * update config * fix api * black format --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* add weight preshuffling for triton fp8 blockscale gemm * add config interface * add x_scale shuffle * import * add default config for gfx942 * fix get_config return * fix * Added tuned configs for gemm a8w8 blockscale preshuffled * Fixed tuned configs keys * resolve comments * resolve comments * Created a fused_kv_proj_cat kernel * Created tests for the fused_kv_proj_cat kernel * Renamed kernel * Renamed R block size * Ran black formatter * UT comments * move test file * fix * fix get_arch * Implemented preshuffled GEMM + split + cat * Ran black formatter * Moved gemm to new folders * Fixed merge * Added transpose_scale parameter * Added tests for fused reduce rms quant with transpose_scale * Use ck from main * Updated imports to follow dir structure --------- Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com>

* add weight preshuffling for triton fp8 blockscale gemm * add config interface * add x_scale shuffle * import * add default config for gfx942 * fix get_config return * fix * Added tuned configs for gemm a8w8 blockscale preshuffled * Fixed tuned configs keys * resolve comments * resolve comments * update config * add kernel, add config, standard return_y_pp flag * update config * fix * update config and UT --------- Co-authored-by: Farel Lukas <farlukas@amd.com>

* mv fmoe tune to csrc/ck_gemm_moe_2stages_codegen * rename fmoe tune.py to gemm_moe_tune.py * update tune readme for more usages * update splitK info and fix splitK error * Apply suggestions from code review Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix splitK error in cktile tune and clean mp_tuner log * fix tune hang when get result error * fix mp_tuner error --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* add a8w8 fp8 tune support * add q_dtype_w to deal with different type and refine config csv file --------- Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: yzhou103 <Ying.Zhou2@amd.com>

* fix fuse_qk_rope_concat_and_cache_mla in rocm-6.4.1 * update * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * update format * revert set interface change * use gmem in opus.h to replace ck_tile::buffer_view --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* bug fix * fix flag * update --------- Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>

…on (#1787) * Support profiler pa and reduce kernel and remove p99 method to use test_common * format code * add more info for pa * reformat * add bs info * use torch schedule to wram up * remove some feature in csv * refactor * test(paps): enhance paps ut Signed-off-by: Double Young <yang.yang2@amd.com> * fix(paps): fix format * fix(paps): fix format * fix(paps): fix format --------- Signed-off-by: Double Young <yang.yang2@amd.com> Co-authored-by: Double Young <yang.yang2@amd.com>

* Implement a4w4 moe kernel * tune testcase for a4w4 based on deepseek r1 shapes * refactor activation quant to use deepseek fp4 quant * skip a4w4 unit tests on MI300 * Add layer1/layer2 suffix for easier profiling * Add --num-weight-inits flag to average MoE benchmark results ---------

* Add a8w8 blockscale MoE * Add XCD swizzle to a8w8 blockscale * Tune num_xcds for a8w8 blockscale * Fix skewed logit routing * add layer1/layer2 suffixes to kernel names for easier profiling * Add --num-weight-inits flag to average MoE benchmark results

…overhead", fullgraph=True) (#1794) * add log and mutex for test * add thread_local * value capture args * fix the Explicit assignment forces evaluation order and prevents compiler from reordering operations * delete some logs

… in Batch Prefill kernel (#1754) * add page size 16 to test and op * add num_total_pages to kernel parameter * add is_sglang parameter * chang is_sglang to is_sglang_layout * kv last page size=16 pass * pass kv_last_page_lens to kernel * add parameters check before calling kernel * change kv layout to [page_num, page_size, nhead, hdim] * adopt the changes of struct fmha_fwd_batch_prefill_traits * change kv cache memory layout to [num_blocks, num_kv_heads, head_size/8, block_size, 8], [num_blocks, num_kv_heads, block_size/8, head_size, 8] * [FMHA] Integrate vLLM block table support and enforce vectorized KV layout Updated `mha_batch_prefill` API and tests to support vLLM-style block tables alongside SGLang-style page tables, while enforcing the new hardware-optimized 5D vectorized KV cache layout. **Key Changes:** * **API**: Added `block_table` and `seqlen_k` arguments to python/C++ interfaces. * **Layout Enforcement**: Added strict checks for 5D vectorized KV layout (swizzled x=8) in host bindings and python wrappers. * **CodeGen**: Automatically select `VLLM_BLOCK_TABLE_2D` or `SGLANG_PAGE_TABLE_1D` trait based on input arguments. * **Tests**: Added `test_batch_prefill_vllm` to verify block table correctness and updated existing tests to use the vectorized layout. * update CK * update ck * adopt api changes from fmha_batch_prefill_traits * add support for linear kv cache layout * update api * Refactor the test code by gathering the different test functions into one * update ck * update ck * Add profile measurements for batch prefill function * update ck * fix style * fix style * [FMHA] Support 3D linear layout (page_size=1) and non-contiguous KV tensors in batch prefill - Enable 3D [N, H, D] K/V tensors for batch prefill, treating as linear layout with page_size=1. - Relax contiguity checks to only require the last dimension to be contiguous. - Update C++ stride calculations for 3D, 4D, and 5D layouts. - Add tests for 3D layout and non-contiguous KV cache. * update ck --------- Co-authored-by: ltqin <letaoqin@amd.com>

* Standardize pattern of GMM kernel config file * Refactor `get_gemm_config` calls to pass hardcoded strings as config names * Add Python script to select which Triton tests to run * Write tests to run to environment file * Remove timestamps from logging messages GitHub already has a "show timestamps" feature, so logging timestamps isn't adding anything. * [CI] Run test selection script on CI job * Add benchmarks and test selection script to Triton paths filter. * Install NetworkX dependency of test selection script. * Fetch target branch from remote. * Run test selection script. * Comment out writing to `GITHUB_ENV` file The 1st stage of test selection script is dry-run only. We'll evaluate its correctness in the wild over time and, later on, fully enable it.

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

yuguo68 and others added 30 commits January 13, 2026 12:39

fix some compile issues with -Werror (#1657)

dd9f15d

Make setup.py PyPI-compatible by separating git dependencies (#1653)

26d6f1c

* trying to fix pypi won't handle git dependencies * fix format * fix comments typo

fix sink error for asm fmha (#1652)

5581246

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

add guard in case pynccl init failed (#1671)

27d34df

One shot pa (#1670)

042c4d5

* add one shot pa kernel * fix buffer load in sliding window kernel * fix typo * revert --------- Co-authored-by: root <root@hjbog-srdc-24.amd.com>

fix(pa_ps): fix pa_ps_asm .co for gfx950 (#1669)

268c664

Signed-off-by: Double Young <yang.yang2@amd.com>

modify test_bf16gemm_test (#1678)

6c483be

Fix Ruff command in pre-checks (#1675)

8cb913d

fix mha bwd golden perf issue (#1666)

14005b3

topk uplift v1 (#1662)

6cf7955

/lgtm The customer has tested the code. It can work. * topk uplift v1 * topk add api for choose topk_v1 or topk_v2 --------- Co-authored-by: yonshuai <yonshuai@amd.com> Co-authored-by: yongshuai <yongshuai@amd.com>

fix missing return in mha_bwd (#1688)

f82ad46

Remove the input parameter "out" in gemm_a4w4 (#1679)

e661bbb

* Remove the input parameter "out" in gemm_a4w4 * update * format --------- Co-authored-by: valarLip <Lingpeng.Jin@amd.com>

fwd v3 hd192 optimize inst alignment for causal mode (#1663)

a6ee2e6

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

fix swa case mismatch (#1694)

87e1855

fixing the fp4 gemm tune script Exception caused by tile_m name incon…

84bb348

…sistency (#1686)

CI: Migrate Triton tests to aiter-1gpu-runner (#1690)

04f800b

Optimize RoPE in the cases that hdim is small. (#1698)

f48673d

* Introduce new grid config strategy for compatibility with cases that hdim is small. * add launch bound to make sure that occu is always 8 * follow Copilot the suggestions

rm garbage from whl (#1696)

84840b0

enhance prebuild logic (#1672)

9c1107e

* enhance prebuild logic * ATen.h build issues * bug fix * bug fix II * bug fix III --------- Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

LLfp4 qr cap for atom (#1673)

0a303e8

* QR cap implemented to limit QR to prefill * test git config * Fix to genericize qr comm cap * Incorrect cap number

[MLA] MLA conditions rewrite (#1665)

9eaed8c

* open mla mtp and remove some logs * fix qlen dense 128,N * fix hint * support sparse qlen input = 1 * change default splits

fix dp causal (#1677)

4beb056

add two fp4 tune shapes and tuned config (#1687)

a7d40b2

* add two fp4 tune shapes and tuned config * change 32800 to 65536 to cover all cases between 32768 to 65536 as per feedback

bf16_gemm_clean_in_kl (#1700)

1dcbd71

* bf16_gemm_clean_in_kl * update * update * update * update

add gen_fake for 4 gemm operators (#1456)

0d829a5

Co-authored-by: Lin, Soga <soga.lin@amd.com> Co-authored-by: sogalin <39478626+sogalin@users.noreply.github.com>

fix llvm issue (#1703)

8af3f24

* fix llvm issue * fix copilot

tenpercent and others added 28 commits January 14, 2026 03:10

maybe fix build (#1786)

3880282

* maybe fix build * update ck submodule past 3359 * Update to ROCm/composable_kernel@e339101 --------- Co-authored-by: Ding, Yi <yi.ding@amd.com>

add fake for MLA RoPE operator (#1714)

b17a9b3

* add gen_fake for MLA RoPE operator * fix code stype * sync logic in fake with actual function * fix black error * fix black error again * remove in-place argument * fix pytest of fused_kv_cache

[Triton] fix get_config return

06d1b6f

fix log rccl version (#1806)

3e40d41

add a8w8 fp8 ck gemm tune support (#1782)

9f1fcce

* add a8w8 fp8 tune support * add q_dtype_w to deal with different type and refine config csv file --------- Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: yzhou103 <Ying.Zhou2@amd.com>

Optimize the performance of quick_allreduce (#1816)

4698bbe

fix moe fp4 dual prebuild issues (#1775)

fecdebb

* bug fix * fix flag * update --------- Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>

update moe tile config (#1810)

084a5a5

update AITER_ASM_DIR (#1812)

8177662

Add device guard for hipbsolgemm kernel launch (#1824)

bdcfb87

support atom pa ps reduce shape (#1820)

ce95543

CI: Add fmoe in auto tuning pipeline (#1827)

18b2652

CI: Fix Triton tests on main branch (#1828)

25a5ea0

Fix INT4 QR TP8 boundary condition (#1834)

053ef0d

sync 3rdparty/composable_kernel to main branch c9f112b

53aaad3

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

minor fix for moe_ck2stages

fd3a9dc

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

fix: export AiterDistEnv and fix undefined args.dtype

bd3c163

zhuyuhua-v merged commit 044604f into dev/perf Jan 16, 2026

zhuyuhua-v deleted the yuhua/dev/perf branch January 16, 2026 06:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dev/perf][Update] update dev/perf branch to main branch commit 1ec04f7#1838

[dev/perf][Update] update dev/perf branch to main branch commit 1ec04f7#1838
zhuyuhua-v merged 122 commits intodev/perffrom
yuhua/dev/perf

zhuyuhua-v commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

zhuyuhua-v commented Jan 14, 2026

Motivation

Technical Details

Test Plan

Test Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants