Add `qknorm+rope` fused kernel by xytpai · Pull Request #1590 · ROCm/aiter

xytpai · 2025-12-09T05:59:14Z

Motivation

This kernel is for Qwen3 models. It's a trivial work because we already have mrope_3d hip kernel template.

Test Result

4x compared with native implementation

zhuyuhua-v · 2025-12-09T09:49:22Z

Test Plan

# server:
MODEL=/data/pretrained-models/Qwen3-235B-A22B-Instruct-2507-FP8/
rm -rf /root/.cache/atom/

python -m atom.entrypoints.openai_server --model ${MODEL} -tp 8 --kv_cache_dtype fp8 --enable-expert-parallel

# client:
MODEL=Qwen3-235B-A22B-Instruct-2507-FP8/
ISL=1000
OSL=1000
CONC=128
PORT=8000
RESULT_FILENAME="qwen3_235b_a22b_instrct_2507_FP8_isl${ISL}_osl${OSL}_conc${CONC}_infrrate"
# Remember to use scripts in this repo!
git clone https://github.com/kimbochen/bench_serving.git
python bench_serving/benchmark_serving.py \
--model=$MODEL --backend=vllm --base-url=http://localhost:$PORT \
--dataset-name=random \
--random-input-len=$ISL --random-output-len=$OSL \
--random-range-ratio 1 \
--num-prompts=$(( $CONC * 2)) \
--max-concurrency=$CONC \
--request-rate=inf --ignore-eos \
--save-result --percentile-metrics="ttft,tpot,itl,e2el" \
--result-dir=./ --result-filename=$RESULT_FILENAME.json

# accuracy:
lm_eval --model local-completions \
        --model_args model=${model},base_url=http://localhost:8000/v1/completions,num_concurrent=128,max_retries=3,tokenized_requests=False \
        --tasks gsm8k \
        --num_fewshot 3

Test Result

perf without this pr: 11290.90 tok/s

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  45.35     
Total input tokens:                      256000    
Total generated tokens:                  256000    
Request throughput (req/s):              5.65      
Output token throughput (tok/s):         5645.45   
Total Token throughput (tok/s):          11290.90  
---------------Time to First Token----------------
Mean TTFT (ms):                          1858.31   
Median TTFT (ms):                        1775.52   
P99 TTFT (ms):                           3039.45   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          20.81     
Median TPOT (ms):                        20.79     
P99 TPOT (ms):                           22.32     
---------------Inter-token Latency----------------
Mean ITL (ms):                           20.79     
Median ITL (ms):                         19.20     
P99 ITL (ms):                            25.13     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          22650.63  
Median E2EL (ms):                        22647.60  
P99 E2EL (ms):                           22768.93  
==================================================

perf with this pr: 11805.87 tok/s

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  43.37     
Total input tokens:                      256000    
Total generated tokens:                  256000    
Request throughput (req/s):              5.90      
Output token throughput (tok/s):         5902.93   
Total Token throughput (tok/s):          11805.87  
---------------Time to First Token----------------
Mean TTFT (ms):                          1852.94   
Median TTFT (ms):                        1812.18   
P99 TTFT (ms):                           3045.32   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.83     
Median TPOT (ms):                        19.94     
P99 TPOT (ms):                           21.36     
---------------Inter-token Latency----------------
Mean ITL (ms):                           19.81     
Median ITL (ms):                         18.24     
P99 ITL (ms):                            24.24     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          21664.24  
Median E2EL (ms):                        21673.63  
P99 E2EL (ms):                           21750.85  
==================================================

Accuracy result:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  |0.3366|±  | 0.013|
|     |       |strict-match    |     3|exact_match|↑  |0.8795|±  | 0.009|

* add qknorm rope fused kernel * fix typo * fix lint

* Add `qknorm+rope` fused kernel (ROCm#1590) * add qknorm rope fused kernel * fix typo * fix lint * fix bug * add fused support --------- Co-authored-by: Yutao Xu <xytpai@foxmail.com>

* add qknorm rope fused kernel * fix typo * fix lint

xytpai added 2 commits December 9, 2025 05:56

add qknorm rope fused kernel

3e63910

fix typo

46e8270

zhuyuhua-v mentioned this pull request Dec 9, 2025

[Qwen3][fusion]port qknorm+rope fusion ROCm/ATOM#36

Open

xytpai added 2 commits December 9, 2025 17:43

Merge branch 'main' into xyt/qknorm_rope

0f959f7

fix lint

c5d8c40

xytpai requested a review from valarLip December 9, 2025 13:31

valarLip approved these changes Dec 9, 2025

View reviewed changes

xytpai merged commit a10e0f8 into main Dec 9, 2025
22 checks passed

xytpai deleted the xyt/qknorm_rope branch December 9, 2025 14:22

amd-youchen pushed a commit to amd-youchen/aiter that referenced this pull request Dec 10, 2025

Add qknorm+rope fused kernel (ROCm#1590)

e5bbf08

* add qknorm rope fused kernel * fix typo * fix lint

zhuyuhua-v pushed a commit that referenced this pull request Dec 17, 2025

Add qknorm+rope fused kernel (#1590)

cd0d956

* add qknorm rope fused kernel * fix typo * fix lint

ZhangLirong-amd pushed a commit that referenced this pull request Dec 29, 2025

Add qknorm+rope fused kernel (#1590)

5d6502d

* add qknorm rope fused kernel * fix typo * fix lint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `qknorm+rope` fused kernel#1590

Add `qknorm+rope` fused kernel#1590
xytpai merged 4 commits intomainfrom
xyt/qknorm_rope

xytpai commented Dec 9, 2025 •

edited

Loading

Uh oh!

zhuyuhua-v commented Dec 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xytpai commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Test Result

Uh oh!

zhuyuhua-v commented Dec 9, 2025

Test Plan

Test Result

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xytpai commented Dec 9, 2025 •

edited

Loading