[Qwen3][fusion]port qknorm+rope fusion #36

zhuyuhua-v · 2025-12-09T09:03:30Z

Motivation

port qknorm+rope fusion for Qwen3-235B
co-work with pr ROCm/aiter#1590

Technical Details

Test Plan

# server:
MODEL=/data/pretrained-models/Qwen3-235B-A22B-Instruct-2507-FP8/
rm -rf /root/.cache/atom/

python -m atom.entrypoints.openai_server --model ${MODEL} -tp 8 --kv_cache_dtype fp8 --enable-expert-parallel

# client:
MODEL=Qwen3-235B-A22B-Instruct-2507-FP8/
ISL=1000
OSL=1000
CONC=128
PORT=8000
RESULT_FILENAME="qwen3_235b_a22b_instrct_2507_FP8_isl${ISL}_osl${OSL}_conc${CONC}_infrrate"
# Remember to use scripts in this repo!
git clone https://github.com/kimbochen/bench_serving.git
python bench_serving/benchmark_serving.py \
--model=$MODEL --backend=vllm --base-url=http://localhost:$PORT \
--dataset-name=random \
--random-input-len=$ISL --random-output-len=$OSL \
--random-range-ratio 1 \
--num-prompts=$(( $CONC * 2)) \
--max-concurrency=$CONC \
--request-rate=inf --ignore-eos \
--save-result --percentile-metrics="ttft,tpot,itl,e2el" \
--result-dir=./ --result-filename=$RESULT_FILENAME.json

# accuracy:
lm_eval --model local-completions \
        --model_args model=${model},base_url=http://localhost:8000/v1/completions,num_concurrent=128,max_retries=3,tokenized_requests=False \
        --tasks gsm8k \
        --num_fewshot 3

Test Result

perf without this pr: 11290.90 tok/s

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  45.35     
Total input tokens:                      256000    
Total generated tokens:                  256000    
Request throughput (req/s):              5.65      
Output token throughput (tok/s):         5645.45   
Total Token throughput (tok/s):          11290.90  
---------------Time to First Token----------------
Mean TTFT (ms):                          1858.31   
Median TTFT (ms):                        1775.52   
P99 TTFT (ms):                           3039.45   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          20.81     
Median TPOT (ms):                        20.79     
P99 TPOT (ms):                           22.32     
---------------Inter-token Latency----------------
Mean ITL (ms):                           20.79     
Median ITL (ms):                         19.20     
P99 ITL (ms):                            25.13     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          22650.63  
Median E2EL (ms):                        22647.60  
P99 E2EL (ms):                           22768.93  
==================================================

perf with this pr: 11805.87 tok/s

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  43.37     
Total input tokens:                      256000    
Total generated tokens:                  256000    
Request throughput (req/s):              5.90      
Output token throughput (tok/s):         5902.93   
Total Token throughput (tok/s):          11805.87  
---------------Time to First Token----------------
Mean TTFT (ms):                          1852.94   
Median TTFT (ms):                        1812.18   
P99 TTFT (ms):                           3045.32   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.83     
Median TPOT (ms):                        19.94     
P99 TPOT (ms):                           21.36     
---------------Inter-token Latency----------------
Mean ITL (ms):                           19.81     
Median ITL (ms):                         18.24     
P99 ITL (ms):                            24.24     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          21664.24  
Median E2EL (ms):                        21673.63  
P99 E2EL (ms):                           21750.85  
==================================================

Accuracy result:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  |0.3366|±  | 0.013|
|     |       |strict-match    |     3|exact_match|↑  |0.8795|±  | 0.009|

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

Copilot

Pull request overview

This PR introduces a performance optimization by implementing a fused QK-norm and RoPE (Rotary Position Embedding) operation for the Qwen3-235B model. The fusion combines query/key normalization with rotary position embedding into a single operation, reducing computational overhead.

Key changes:

Added environment variable ATOM_ENABLE_QK_NORM_ROPE_FUSION to toggle the fusion feature
Implemented RotaryEmbeddingQKNormFused class that performs combined QK-norm and RoPE operations
Modified Qwen3MoeAttention to conditionally use the fused implementation

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
atom/utils/envs.py	Adds environment variable for enabling QK-norm+RoPE fusion
atom/models/qwen3_moe.py	Implements fused RoPE class and integrates it into the attention mechanism
atom/model_engine/arg_utils.py	Adds import for envs module (unused in diff)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/models/qwen3_moe.py

atom/model_engine/arg_utils.py

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/models/qwen3_moe.py

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/models/qwen3_moe.py

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

zhuyuhua-v · 2025-12-11T09:37:15Z

@valarLip Could you please help review this pr?

Guanbao Yu and others added 6 commits December 2, 2025 18:33

add qwen3 moe model support

9be415a

refine fused_ar_norm

0469e51

enhance tp support

be95f02

tp8 fix

084b384

Merge branch 'main' into guanbao/add_qwen3_moe

d82de64

[Qwen3]port qknorm+rope fusion

b2c2539

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

Base automatically changed from guanbao/add_qwen3_moe to main December 9, 2025 09:34

Merge branch 'main' into yuhua/norm+rope

e21773d

Copilot AI review requested due to automatic review settings December 9, 2025 09:36

Copilot AI reviewed Dec 9, 2025

View reviewed changes

atom/models/qwen3_moe.py Outdated Show resolved Hide resolved

atom/models/qwen3_moe.py Outdated Show resolved Hide resolved

atom/model_engine/arg_utils.py Outdated Show resolved Hide resolved

Copilot AI review requested due to automatic review settings December 9, 2025 09:38

Copilot AI reviewed Dec 9, 2025

View reviewed changes

atom/models/qwen3_moe.py Show resolved Hide resolved

atom/models/qwen3_moe.py Outdated Show resolved Hide resolved

Copilot AI review requested due to automatic review settings December 9, 2025 09:39

Copilot AI reviewed Dec 9, 2025

View reviewed changes

atom/models/qwen3_moe.py Show resolved Hide resolved

atom/models/qwen3_moe.py Outdated Show resolved Hide resolved

Apply suggestions from code review

374cc77

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

zhuyuhua-v force-pushed the yuhua/norm+rope branch from b414a35 to 374cc77 Compare December 11, 2025 08:54

zhuyuhua-v changed the title ~~[Qwen3]port qknorm+rope fusion~~ [Qwen3][fusion]port qknorm+rope fusion Dec 11, 2025

zhuyuhua-v requested a review from valarLip December 11, 2025 09:37

valarLip approved these changes Dec 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Qwen3][fusion]port qknorm+rope fusion #36

[Qwen3][fusion]port qknorm+rope fusion #36

Uh oh!

zhuyuhua-v commented Dec 9, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

zhuyuhua-v commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Qwen3][fusion]port qknorm+rope fusion #36

Are you sure you want to change the base?

[Qwen3][fusion]port qknorm+rope fusion #36

Uh oh!

Conversation

zhuyuhua-v commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

zhuyuhua-v commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhuyuhua-v commented Dec 9, 2025 •

edited

Loading