Skip to content

Add qknorm+rope fused kernel#1590

Merged
xytpai merged 4 commits intomainfrom
xyt/qknorm_rope
Dec 9, 2025
Merged

Add qknorm+rope fused kernel#1590
xytpai merged 4 commits intomainfrom
xyt/qknorm_rope

Conversation

@xytpai
Copy link
Contributor

@xytpai xytpai commented Dec 9, 2025

Motivation

This kernel is for Qwen3 models. It's a trivial work because we already have mrope_3d hip kernel template.

Test Result

4x compared with native implementation

@zhuyuhua-v
Copy link
Contributor

Test Plan

# server:
MODEL=/data/pretrained-models/Qwen3-235B-A22B-Instruct-2507-FP8/
rm -rf /root/.cache/atom/

python -m atom.entrypoints.openai_server --model ${MODEL} -tp 8 --kv_cache_dtype fp8 --enable-expert-parallel

# client:
MODEL=Qwen3-235B-A22B-Instruct-2507-FP8/
ISL=1000
OSL=1000
CONC=128
PORT=8000
RESULT_FILENAME="qwen3_235b_a22b_instrct_2507_FP8_isl${ISL}_osl${OSL}_conc${CONC}_infrrate"
# Remember to use scripts in this repo!
git clone https://github.com/kimbochen/bench_serving.git
python bench_serving/benchmark_serving.py \
--model=$MODEL --backend=vllm --base-url=http://localhost:$PORT \
--dataset-name=random \
--random-input-len=$ISL --random-output-len=$OSL \
--random-range-ratio 1 \
--num-prompts=$(( $CONC * 2)) \
--max-concurrency=$CONC \
--request-rate=inf --ignore-eos \
--save-result --percentile-metrics="ttft,tpot,itl,e2el" \
--result-dir=./ --result-filename=$RESULT_FILENAME.json

# accuracy:
lm_eval --model local-completions \
        --model_args model=${model},base_url=http://localhost:8000/v1/completions,num_concurrent=128,max_retries=3,tokenized_requests=False \
        --tasks gsm8k \
        --num_fewshot 3

Test Result

perf without this pr: 11290.90 tok/s

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  45.35     
Total input tokens:                      256000    
Total generated tokens:                  256000    
Request throughput (req/s):              5.65      
Output token throughput (tok/s):         5645.45   
Total Token throughput (tok/s):          11290.90  
---------------Time to First Token----------------
Mean TTFT (ms):                          1858.31   
Median TTFT (ms):                        1775.52   
P99 TTFT (ms):                           3039.45   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          20.81     
Median TPOT (ms):                        20.79     
P99 TPOT (ms):                           22.32     
---------------Inter-token Latency----------------
Mean ITL (ms):                           20.79     
Median ITL (ms):                         19.20     
P99 ITL (ms):                            25.13     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          22650.63  
Median E2EL (ms):                        22647.60  
P99 E2EL (ms):                           22768.93  
==================================================

perf with this pr: 11805.87 tok/s

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  43.37     
Total input tokens:                      256000    
Total generated tokens:                  256000    
Request throughput (req/s):              5.90      
Output token throughput (tok/s):         5902.93   
Total Token throughput (tok/s):          11805.87  
---------------Time to First Token----------------
Mean TTFT (ms):                          1852.94   
Median TTFT (ms):                        1812.18   
P99 TTFT (ms):                           3045.32   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.83     
Median TPOT (ms):                        19.94     
P99 TPOT (ms):                           21.36     
---------------Inter-token Latency----------------
Mean ITL (ms):                           19.81     
Median ITL (ms):                         18.24     
P99 ITL (ms):                            24.24     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          21664.24  
Median E2EL (ms):                        21673.63  
P99 E2EL (ms):                           21750.85  
==================================================

Accuracy result:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match||0.3366|±  | 0.013|
|     |       |strict-match    |     3|exact_match||0.8795|±  | 0.009|

@xytpai xytpai requested a review from valarLip December 9, 2025 13:31
@xytpai xytpai merged commit a10e0f8 into main Dec 9, 2025
22 checks passed
@xytpai xytpai deleted the xyt/qknorm_rope branch December 9, 2025 14:22
amd-youchen pushed a commit to amd-youchen/aiter that referenced this pull request Dec 10, 2025
* add qknorm rope fused kernel

* fix typo

* fix lint
xytpai added a commit to ZLkanyo009/aiter that referenced this pull request Dec 11, 2025
* Add `qknorm+rope` fused kernel (ROCm#1590)

* add qknorm rope fused kernel

* fix typo

* fix lint

* fix bug

* add fused support

---------

Co-authored-by: Yutao Xu <xytpai@foxmail.com>
zhuyuhua-v pushed a commit that referenced this pull request Dec 17, 2025
* add qknorm rope fused kernel

* fix typo

* fix lint
ZhangLirong-amd pushed a commit that referenced this pull request Dec 29, 2025
* add qknorm rope fused kernel

* fix typo

* fix lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants