Skip to content

Conversation

@zhuyuhua-v
Copy link
Contributor

@zhuyuhua-v zhuyuhua-v commented Dec 9, 2025

Motivation

port qknorm+rope fusion for Qwen3-235B
co-work with pr ROCm/aiter#1590

Technical Details

Test Plan

# server:
MODEL=/data/pretrained-models/Qwen3-235B-A22B-Instruct-2507-FP8/
rm -rf /root/.cache/atom/

python -m atom.entrypoints.openai_server --model ${MODEL} -tp 8 --kv_cache_dtype fp8 --enable-expert-parallel

# client:
MODEL=Qwen3-235B-A22B-Instruct-2507-FP8/
ISL=1000
OSL=1000
CONC=128
PORT=8000
RESULT_FILENAME="qwen3_235b_a22b_instrct_2507_FP8_isl${ISL}_osl${OSL}_conc${CONC}_infrrate"
# Remember to use scripts in this repo!
git clone https://github.com/kimbochen/bench_serving.git
python bench_serving/benchmark_serving.py \
--model=$MODEL --backend=vllm --base-url=http://localhost:$PORT \
--dataset-name=random \
--random-input-len=$ISL --random-output-len=$OSL \
--random-range-ratio 1 \
--num-prompts=$(( $CONC * 2)) \
--max-concurrency=$CONC \
--request-rate=inf --ignore-eos \
--save-result --percentile-metrics="ttft,tpot,itl,e2el" \
--result-dir=./ --result-filename=$RESULT_FILENAME.json

# accuracy:
lm_eval --model local-completions \
        --model_args model=${model},base_url=http://localhost:8000/v1/completions,num_concurrent=128,max_retries=3,tokenized_requests=False \
        --tasks gsm8k \
        --num_fewshot 3

Test Result

perf without this pr: 11290.90 tok/s

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  45.35     
Total input tokens:                      256000    
Total generated tokens:                  256000    
Request throughput (req/s):              5.65      
Output token throughput (tok/s):         5645.45   
Total Token throughput (tok/s):          11290.90  
---------------Time to First Token----------------
Mean TTFT (ms):                          1858.31   
Median TTFT (ms):                        1775.52   
P99 TTFT (ms):                           3039.45   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          20.81     
Median TPOT (ms):                        20.79     
P99 TPOT (ms):                           22.32     
---------------Inter-token Latency----------------
Mean ITL (ms):                           20.79     
Median ITL (ms):                         19.20     
P99 ITL (ms):                            25.13     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          22650.63  
Median E2EL (ms):                        22647.60  
P99 E2EL (ms):                           22768.93  
==================================================

perf with this pr: 11805.87 tok/s

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  43.37     
Total input tokens:                      256000    
Total generated tokens:                  256000    
Request throughput (req/s):              5.90      
Output token throughput (tok/s):         5902.93   
Total Token throughput (tok/s):          11805.87  
---------------Time to First Token----------------
Mean TTFT (ms):                          1852.94   
Median TTFT (ms):                        1812.18   
P99 TTFT (ms):                           3045.32   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.83     
Median TPOT (ms):                        19.94     
P99 TPOT (ms):                           21.36     
---------------Inter-token Latency----------------
Mean ITL (ms):                           19.81     
Median ITL (ms):                         18.24     
P99 ITL (ms):                            24.24     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          21664.24  
Median E2EL (ms):                        21673.63  
P99 E2EL (ms):                           21750.85  
==================================================

Accuracy result:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match||0.3366|±  | 0.013|
|     |       |strict-match    |     3|exact_match||0.8795|±  | 0.009|

Base automatically changed from guanbao/add_qwen3_moe to main December 9, 2025 09:34
Copilot AI review requested due to automatic review settings December 9, 2025 09:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a performance optimization by implementing a fused QK-norm and RoPE (Rotary Position Embedding) operation for the Qwen3-235B model. The fusion combines query/key normalization with rotary position embedding into a single operation, reducing computational overhead.

Key changes:

  • Added environment variable ATOM_ENABLE_QK_NORM_ROPE_FUSION to toggle the fusion feature
  • Implemented RotaryEmbeddingQKNormFused class that performs combined QK-norm and RoPE operations
  • Modified Qwen3MoeAttention to conditionally use the fused implementation

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
atom/utils/envs.py Adds environment variable for enabling QK-norm+RoPE fusion
atom/models/qwen3_moe.py Implements fused RoPE class and integrates it into the attention mechanism
atom/model_engine/arg_utils.py Adds import for envs module (unused in diff)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings December 9, 2025 09:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings December 9, 2025 09:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
@zhuyuhua-v zhuyuhua-v changed the title [Qwen3]port qknorm+rope fusion [Qwen3][fusion]port qknorm+rope fusion Dec 11, 2025
@zhuyuhua-v
Copy link
Contributor Author

@valarLip Could you please help review this pr?

@zhuyuhua-v zhuyuhua-v requested a review from valarLip December 11, 2025 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants