Skip to content

[dev/perf][Update] update dev/perf branch to main branch commit 1ec04f7#1838

Merged
zhuyuhua-v merged 122 commits intodev/perffrom
yuhua/dev/perf
Jan 16, 2026
Merged

[dev/perf][Update] update dev/perf branch to main branch commit 1ec04f7#1838
zhuyuhua-v merged 122 commits intodev/perffrom
yuhua/dev/perf

Conversation

@zhuyuhua-v
Copy link
Contributor

Motivation

[dev/perf][Update] update dev/perf branch to main branch commit 1ec04f7

Author: azaidy
Date: Tue Jan 13 21:12:51 2026 -0500
Fix INT4 QR TP8 boundary condition (#1834)

Technical Details

This update differs from the main branch in two key aspects:

  1. It retains the Alibaba-specific modifications in dev/perf moe. These changes have not been merged into main due to significant merge conflicts and because the main branch is expected to gradually adopt CKTile instead.
  2. It preserves the optimizations in dev/perf allreduce, which are still under development and have not yet been fully completed or merged into main.

Test Plan

https://github.com/ROCm/rocActions/actions/workflows/vllm-benchmark-workflow.yaml

Test Result

yuguo68 and others added 30 commits January 13, 2026 12:39
* trying to fix pypi won't handle git dependencies

* fix format

* fix comments typo
* remove_iris_from_setup

* Update setup.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update

* update

* update

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
* add one shot pa kernel

* fix buffer load in sliding window kernel

* fix typo

* revert

---------

Co-authored-by: root <root@hjbog-srdc-24.amd.com>
Signed-off-by: Double Young <yang.yang2@amd.com>
/lgtm

The customer has tested the code. It can work.

* topk uplift v1

* topk add api for choose topk_v1 or topk_v2

---------

Co-authored-by: yonshuai <yonshuai@amd.com>
Co-authored-by: yongshuai <yongshuai@amd.com>
* Remove the input parameter "out" in gemm_a4w4

* update

* format

---------

Co-authored-by: valarLip <Lingpeng.Jin@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
* add fmoe co with tilesize 32x128

* add ps co

* fix pertoken co bug

* add co to csv

* add 128ntile logic for one stage asm

* fix mem fault during perf turn

* en vs for pertoken kernel

---------

Co-authored-by: feifei14119 <feiw@amd.com>
Co-authored-by: zufayu <zufayu@amd.com>
* Introduce new grid config strategy for compatibility with cases that hdim is small.

* add launch bound to make sure that occu is always 8

* follow Copilot the suggestions
* enhance prebuild logic

* ATen.h build issues

* bug fix

* bug fix II

* bug fix III

---------

Co-authored-by: zufayu <zufayu@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
* QR cap implemented to limit QR to prefill

* test git config

* Fix to genericize qr comm cap

* Incorrect cap number
* open mla mtp and remove some logs

* fix qlen dense 128,N

* fix hint

* support sparse qlen input = 1

* change default splits
* add two fp4 tune shapes and tuned config

* change 32800 to 65536 to cover all cases between 32768 to 65536 as per feedback
* support moe a8w8 splitk  (#1654)

* Add support to a8w8_ck_moe_blk_gemm1 splitk

* add switch and add some logging

* tiny fix

* update ck 3rd party and add some logging

* add AITER_HEURISTIC_ONLY env

* update ck

* add condition to bypass tuned cfg

* change bypass type

* fix

* fix removed log

* upate ck submodule

* fix lint

* force to run tests

---------

Co-authored-by: oscar <huaiguxu@amd.com>

* Zan/moe a8w4 (#1655)

* update

* update

* update quant

* ut ready

* update quant type

* compile pass

* python3 op_tests/test_moe_2stage.py -t 16 -e 1 -k 1 -dim 256,256 ready

* update aiter dipatcher for bf16&fp8

* support a16 a8 dispatch

* finish quant & sort

* update aiter framework for a8w4 moe

* update ck

* update

* update

* update for atom

* update

---------

Co-authored-by: Zzz9990 <Zzz9990>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>

* update ck

* fix dispatch

* fix too much logging

* update

* update ck

* update ck

* fix ruff code style

* revert aiter-test yaml

* fix ci

* fix ci

* fix ci

* add mocked tuned result and decoding cfg token to next power of 2

* Update tuned_fmoe.csv

remove duplicate

* remove hack dtype

* fix black

* unique index

* add empty arg to ck_moe_stage1

* resolve bias into lru cache

* rename bypass cfg to AITER_BYPASS_TUNE_CONFIG

---------

Co-authored-by: oscar <huaiguxu@amd.com>
Co-authored-by: Zzz9990 <zanzhang@amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: felix <felix.li@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
* bf16_gemm_clean_in_kl

* update

* update

* update

* update
* fix tuner

* Update gradlib/gradlib/GemmTuner.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Lin, Soga <soga.lin@amd.com>
Co-authored-by: sogalin <39478626+sogalin@users.noreply.github.com>
* fix llvm issue

* fix copilot
tenpercent and others added 28 commits January 14, 2026 03:10
* maybe fix build

* update ck submodule past 3359

* Update to ROCm/composable_kernel@e339101

---------

Co-authored-by: Ding, Yi <yi.ding@amd.com>
* add gen_fake for MLA RoPE operator

* fix code stype

* sync logic in fake with actual function

* fix black error

* fix black error again

* remove in-place argument

* fix pytest of fused_kv_cache
* add kernel and config

* Update aiter/ops/triton/gemm_a16wfp4.py

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update op_tests/triton_tests/gemm/basic/test_gemm_a16wfp4.py

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* format

* black format

* clean

* fix

* update config

* fix api

* black format

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* add weight preshuffling for triton fp8 blockscale gemm

* add config interface

* add x_scale shuffle

* import

* add default config for gfx942

* fix get_config return

* fix

* Added tuned configs for gemm a8w8 blockscale preshuffled

* Fixed tuned configs keys

* resolve comments

* resolve comments

* Created a fused_kv_proj_cat kernel

* Created tests for the fused_kv_proj_cat kernel

* Renamed kernel

* Renamed R block size

* Ran black formatter

* UT comments

* move test file

* fix

* fix get_arch

* Implemented preshuffled GEMM + split + cat

* Ran black formatter

* Moved gemm to new folders

* Fixed merge

* Added transpose_scale parameter

* Added tests for fused reduce rms quant with transpose_scale

* Use ck from main

* Updated imports to follow dir structure

---------

Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com>
* add weight preshuffling for triton fp8 blockscale gemm

* add config interface

* add x_scale shuffle

* import

* add default config for gfx942

* fix get_config return

* fix

* Added tuned configs for gemm a8w8 blockscale preshuffled

* Fixed tuned configs keys

* resolve comments

* resolve comments

* update config

* add kernel, add config, standard return_y_pp flag

* update config

* fix

* update config and UT

---------

Co-authored-by: Farel Lukas <farlukas@amd.com>
* mv fmoe tune to csrc/ck_gemm_moe_2stages_codegen

* rename fmoe tune.py to gemm_moe_tune.py

* update tune readme for more usages

* update splitK info and fix splitK error

* Apply suggestions from code review

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix splitK error in cktile tune and clean mp_tuner log

* fix tune hang when get result error

* fix mp_tuner error

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* add a8w8 fp8 tune support

* add q_dtype_w to deal with different type and refine config csv file

---------

Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: yzhou103 <Ying.Zhou2@amd.com>
* fix fuse_qk_rope_concat_and_cache_mla in rocm-6.4.1

* update

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update format

* revert set interface change

* use gmem in opus.h to replace ck_tile::buffer_view

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* bug fix

* fix flag

* update

---------

Co-authored-by: zufayu <zufayu@amd.com>
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
…on (#1787)

* Support profiler pa and reduce kernel and remove p99 method to use test_common

* format code

* add more info for pa

* reformat

* add bs info

* use torch schedule to wram up

* remove some feature in csv

* refactor

* test(paps): enhance paps ut

Signed-off-by: Double Young <yang.yang2@amd.com>

* fix(paps): fix format

* fix(paps): fix format

* fix(paps): fix format

---------

Signed-off-by: Double Young <yang.yang2@amd.com>
Co-authored-by: Double Young <yang.yang2@amd.com>
* Implement a4w4 moe kernel

* tune testcase for a4w4 based on deepseek r1 shapes

* refactor activation quant to use deepseek fp4 quant

* skip a4w4 unit tests on MI300

* Add layer1/layer2 suffix for easier profiling

* Add --num-weight-inits flag to average MoE benchmark results
---------
* Add a8w8 blockscale MoE

* Add XCD swizzle to a8w8 blockscale

* Tune num_xcds for a8w8 blockscale

* Fix skewed logit routing

* add layer1/layer2 suffixes to kernel names for easier profiling

* Add --num-weight-inits flag to average MoE benchmark results
…overhead", fullgraph=True) (#1794)

* add log and mutex for test

* add thread_local

* value capture args

* fix the Explicit assignment forces evaluation order and prevents compiler from reordering operations

* delete some logs
… in Batch Prefill kernel (#1754)

* add page size 16 to test and op

* add num_total_pages to kernel parameter

* add is_sglang parameter

* chang is_sglang to is_sglang_layout

* kv last page size=16 pass

* pass kv_last_page_lens to kernel

* add parameters check before calling kernel

* change kv layout to [page_num, page_size, nhead, hdim]

* adopt the changes of struct fmha_fwd_batch_prefill_traits

* change kv cache memory layout to [num_blocks, num_kv_heads, head_size/8, block_size, 8], [num_blocks, num_kv_heads, block_size/8, head_size, 8]

* [FMHA] Integrate vLLM block table support and enforce vectorized KV layout

Updated `mha_batch_prefill` API and tests to support vLLM-style block tables alongside SGLang-style page tables, while enforcing the new hardware-optimized 5D vectorized KV cache layout.

**Key Changes:**
*   **API**: Added `block_table` and `seqlen_k` arguments to python/C++ interfaces.
*   **Layout Enforcement**: Added strict checks for 5D vectorized KV layout (swizzled x=8) in host bindings and python wrappers.
*   **CodeGen**: Automatically select `VLLM_BLOCK_TABLE_2D` or `SGLANG_PAGE_TABLE_1D` trait based on input arguments.
*   **Tests**: Added `test_batch_prefill_vllm` to verify block table correctness and updated existing tests to use the vectorized layout.

* update CK

* update ck

* adopt api changes from fmha_batch_prefill_traits

* add support for linear kv cache layout

* update api

* Refactor the test code by gathering the different test functions into one

* update ck

* update ck

* Add profile measurements for batch prefill function

* update ck

* fix style

* fix style

* [FMHA] Support 3D linear layout (page_size=1) and non-contiguous KV tensors in batch prefill

- Enable 3D [N, H, D] K/V tensors for batch prefill, treating as linear layout with page_size=1.
- Relax contiguity checks to only require the last dimension to be contiguous.
- Update C++ stride calculations for 3D, 4D, and 5D layouts.
- Add tests for 3D layout and non-contiguous KV cache.

* update ck

---------

Co-authored-by: ltqin <letaoqin@amd.com>
* Standardize pattern of GMM kernel config file

* Refactor `get_gemm_config` calls to pass hardcoded strings as config names

* Add Python script to select which Triton tests to run

* Write tests to run to environment file

* Remove timestamps from logging messages

GitHub already has a "show timestamps" feature, so logging timestamps
isn't adding anything.

* [CI] Run test selection script on CI job

* Add benchmarks and test selection script to Triton paths filter.
* Install NetworkX dependency of test selection script.
* Fetch target branch from remote.
* Run test selection script.

* Comment out writing to `GITHUB_ENV` file

The 1st stage of test selection script is dry-run only. We'll evaluate
its correctness in the wild over time and, later on, fully enable it.
Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
@zhuyuhua-v zhuyuhua-v merged commit 044604f into dev/perf Jan 16, 2026
@zhuyuhua-v zhuyuhua-v deleted the yuhua/dev/perf branch January 16, 2026 06:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.