[dev/perf][Update] update dev/perf branch to main branch commit 1ec04f7#1838
Merged
zhuyuhua-v merged 122 commits intodev/perffrom Jan 16, 2026
Merged
[dev/perf][Update] update dev/perf branch to main branch commit 1ec04f7#1838zhuyuhua-v merged 122 commits intodev/perffrom
zhuyuhua-v merged 122 commits intodev/perffrom
Conversation
* trying to fix pypi won't handle git dependencies * fix format * fix comments typo
* remove_iris_from_setup * Update setup.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * update * update * update --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
* add one shot pa kernel * fix buffer load in sliding window kernel * fix typo * revert --------- Co-authored-by: root <root@hjbog-srdc-24.amd.com>
Signed-off-by: Double Young <yang.yang2@amd.com>
/lgtm The customer has tested the code. It can work. * topk uplift v1 * topk add api for choose topk_v1 or topk_v2 --------- Co-authored-by: yonshuai <yonshuai@amd.com> Co-authored-by: yongshuai <yongshuai@amd.com>
* Remove the input parameter "out" in gemm_a4w4 * update * format --------- Co-authored-by: valarLip <Lingpeng.Jin@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
* add fmoe co with tilesize 32x128 * add ps co * fix pertoken co bug * add co to csv * add 128ntile logic for one stage asm * fix mem fault during perf turn * en vs for pertoken kernel --------- Co-authored-by: feifei14119 <feiw@amd.com> Co-authored-by: zufayu <zufayu@amd.com>
* Introduce new grid config strategy for compatibility with cases that hdim is small. * add launch bound to make sure that occu is always 8 * follow Copilot the suggestions
* enhance prebuild logic * ATen.h build issues * bug fix * bug fix II * bug fix III --------- Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
* QR cap implemented to limit QR to prefill * test git config * Fix to genericize qr comm cap * Incorrect cap number
* open mla mtp and remove some logs * fix qlen dense 128,N * fix hint * support sparse qlen input = 1 * change default splits
* add two fp4 tune shapes and tuned config * change 32800 to 65536 to cover all cases between 32768 to 65536 as per feedback
* support moe a8w8 splitk (#1654) * Add support to a8w8_ck_moe_blk_gemm1 splitk * add switch and add some logging * tiny fix * update ck 3rd party and add some logging * add AITER_HEURISTIC_ONLY env * update ck * add condition to bypass tuned cfg * change bypass type * fix * fix removed log * upate ck submodule * fix lint * force to run tests --------- Co-authored-by: oscar <huaiguxu@amd.com> * Zan/moe a8w4 (#1655) * update * update * update quant * ut ready * update quant type * compile pass * python3 op_tests/test_moe_2stage.py -t 16 -e 1 -k 1 -dim 256,256 ready * update aiter dipatcher for bf16&fp8 * support a16 a8 dispatch * finish quant & sort * update aiter framework for a8w4 moe * update ck * update * update * update for atom * update --------- Co-authored-by: Zzz9990 <Zzz9990> Co-authored-by: root <root@hjbog-srdc-24.amd.com> * update ck * fix dispatch * fix too much logging * update * update ck * update ck * fix ruff code style * revert aiter-test yaml * fix ci * fix ci * fix ci * add mocked tuned result and decoding cfg token to next power of 2 * Update tuned_fmoe.csv remove duplicate * remove hack dtype * fix black * unique index * add empty arg to ck_moe_stage1 * resolve bias into lru cache * rename bypass cfg to AITER_BYPASS_TUNE_CONFIG --------- Co-authored-by: oscar <huaiguxu@amd.com> Co-authored-by: Zzz9990 <zanzhang@amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: felix <felix.li@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
* bf16_gemm_clean_in_kl * update * update * update * update
* fix tuner * Update gradlib/gradlib/GemmTuner.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Lin, Soga <soga.lin@amd.com> Co-authored-by: sogalin <39478626+sogalin@users.noreply.github.com>
* fix llvm issue * fix copilot
* maybe fix build * update ck submodule past 3359 * Update to ROCm/composable_kernel@e339101 --------- Co-authored-by: Ding, Yi <yi.ding@amd.com>
* add gen_fake for MLA RoPE operator * fix code stype * sync logic in fake with actual function * fix black error * fix black error again * remove in-place argument * fix pytest of fused_kv_cache
* add kernel and config * Update aiter/ops/triton/gemm_a16wfp4.py Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update op_tests/triton_tests/gemm/basic/test_gemm_a16wfp4.py Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * format * black format * clean * fix * update config * fix api * black format --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* add weight preshuffling for triton fp8 blockscale gemm * add config interface * add x_scale shuffle * import * add default config for gfx942 * fix get_config return * fix * Added tuned configs for gemm a8w8 blockscale preshuffled * Fixed tuned configs keys * resolve comments * resolve comments * Created a fused_kv_proj_cat kernel * Created tests for the fused_kv_proj_cat kernel * Renamed kernel * Renamed R block size * Ran black formatter * UT comments * move test file * fix * fix get_arch * Implemented preshuffled GEMM + split + cat * Ran black formatter * Moved gemm to new folders * Fixed merge * Added transpose_scale parameter * Added tests for fused reduce rms quant with transpose_scale * Use ck from main * Updated imports to follow dir structure --------- Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com>
* add weight preshuffling for triton fp8 blockscale gemm * add config interface * add x_scale shuffle * import * add default config for gfx942 * fix get_config return * fix * Added tuned configs for gemm a8w8 blockscale preshuffled * Fixed tuned configs keys * resolve comments * resolve comments * update config * add kernel, add config, standard return_y_pp flag * update config * fix * update config and UT --------- Co-authored-by: Farel Lukas <farlukas@amd.com>
* mv fmoe tune to csrc/ck_gemm_moe_2stages_codegen * rename fmoe tune.py to gemm_moe_tune.py * update tune readme for more usages * update splitK info and fix splitK error * Apply suggestions from code review Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix splitK error in cktile tune and clean mp_tuner log * fix tune hang when get result error * fix mp_tuner error --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* add a8w8 fp8 tune support * add q_dtype_w to deal with different type and refine config csv file --------- Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: yzhou103 <Ying.Zhou2@amd.com>
* fix fuse_qk_rope_concat_and_cache_mla in rocm-6.4.1 * update * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * update format * revert set interface change * use gmem in opus.h to replace ck_tile::buffer_view --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* bug fix * fix flag * update --------- Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
…on (#1787) * Support profiler pa and reduce kernel and remove p99 method to use test_common * format code * add more info for pa * reformat * add bs info * use torch schedule to wram up * remove some feature in csv * refactor * test(paps): enhance paps ut Signed-off-by: Double Young <yang.yang2@amd.com> * fix(paps): fix format * fix(paps): fix format * fix(paps): fix format --------- Signed-off-by: Double Young <yang.yang2@amd.com> Co-authored-by: Double Young <yang.yang2@amd.com>
* Implement a4w4 moe kernel * tune testcase for a4w4 based on deepseek r1 shapes * refactor activation quant to use deepseek fp4 quant * skip a4w4 unit tests on MI300 * Add layer1/layer2 suffix for easier profiling * Add --num-weight-inits flag to average MoE benchmark results ---------
* Add a8w8 blockscale MoE * Add XCD swizzle to a8w8 blockscale * Tune num_xcds for a8w8 blockscale * Fix skewed logit routing * add layer1/layer2 suffixes to kernel names for easier profiling * Add --num-weight-inits flag to average MoE benchmark results
…overhead", fullgraph=True) (#1794) * add log and mutex for test * add thread_local * value capture args * fix the Explicit assignment forces evaluation order and prevents compiler from reordering operations * delete some logs
… in Batch Prefill kernel (#1754) * add page size 16 to test and op * add num_total_pages to kernel parameter * add is_sglang parameter * chang is_sglang to is_sglang_layout * kv last page size=16 pass * pass kv_last_page_lens to kernel * add parameters check before calling kernel * change kv layout to [page_num, page_size, nhead, hdim] * adopt the changes of struct fmha_fwd_batch_prefill_traits * change kv cache memory layout to [num_blocks, num_kv_heads, head_size/8, block_size, 8], [num_blocks, num_kv_heads, block_size/8, head_size, 8] * [FMHA] Integrate vLLM block table support and enforce vectorized KV layout Updated `mha_batch_prefill` API and tests to support vLLM-style block tables alongside SGLang-style page tables, while enforcing the new hardware-optimized 5D vectorized KV cache layout. **Key Changes:** * **API**: Added `block_table` and `seqlen_k` arguments to python/C++ interfaces. * **Layout Enforcement**: Added strict checks for 5D vectorized KV layout (swizzled x=8) in host bindings and python wrappers. * **CodeGen**: Automatically select `VLLM_BLOCK_TABLE_2D` or `SGLANG_PAGE_TABLE_1D` trait based on input arguments. * **Tests**: Added `test_batch_prefill_vllm` to verify block table correctness and updated existing tests to use the vectorized layout. * update CK * update ck * adopt api changes from fmha_batch_prefill_traits * add support for linear kv cache layout * update api * Refactor the test code by gathering the different test functions into one * update ck * update ck * Add profile measurements for batch prefill function * update ck * fix style * fix style * [FMHA] Support 3D linear layout (page_size=1) and non-contiguous KV tensors in batch prefill - Enable 3D [N, H, D] K/V tensors for batch prefill, treating as linear layout with page_size=1. - Relax contiguity checks to only require the last dimension to be contiguous. - Update C++ stride calculations for 3D, 4D, and 5D layouts. - Add tests for 3D layout and non-contiguous KV cache. * update ck --------- Co-authored-by: ltqin <letaoqin@amd.com>
* Standardize pattern of GMM kernel config file * Refactor `get_gemm_config` calls to pass hardcoded strings as config names * Add Python script to select which Triton tests to run * Write tests to run to environment file * Remove timestamps from logging messages GitHub already has a "show timestamps" feature, so logging timestamps isn't adding anything. * [CI] Run test selection script on CI job * Add benchmarks and test selection script to Triton paths filter. * Install NetworkX dependency of test selection script. * Fetch target branch from remote. * Run test selection script. * Comment out writing to `GITHUB_ENV` file The 1st stage of test selection script is dry-run only. We'll evaluate its correctness in the wild over time and, later on, fully enable it.
Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
[dev/perf][Update] update dev/perf branch to main branch commit 1ec04f7
Author: azaidy
Date: Tue Jan 13 21:12:51 2026 -0500
Fix INT4 QR TP8 boundary condition (#1834)
Technical Details
This update differs from the main branch in two key aspects:
moe. These changes have not been merged into main due to significant merge conflicts and because the main branch is expected to gradually adopt CKTile instead.allreduce, which are still under development and have not yet been fully completed or merged into main.Test Plan
https://github.com/ROCm/rocActions/actions/workflows/vllm-benchmark-workflow.yaml
Test Result