Conversation
Contributor
Test Plan# server:
MODEL=/data/pretrained-models/Qwen3-235B-A22B-Instruct-2507-FP8/
rm -rf /root/.cache/atom/
python -m atom.entrypoints.openai_server --model ${MODEL} -tp 8 --kv_cache_dtype fp8 --enable-expert-parallel
# client:
MODEL=Qwen3-235B-A22B-Instruct-2507-FP8/
ISL=1000
OSL=1000
CONC=128
PORT=8000
RESULT_FILENAME="qwen3_235b_a22b_instrct_2507_FP8_isl${ISL}_osl${OSL}_conc${CONC}_infrrate"
# Remember to use scripts in this repo!
git clone https://github.com/kimbochen/bench_serving.git
python bench_serving/benchmark_serving.py \
--model=$MODEL --backend=vllm --base-url=http://localhost:$PORT \
--dataset-name=random \
--random-input-len=$ISL --random-output-len=$OSL \
--random-range-ratio 1 \
--num-prompts=$(( $CONC * 2)) \
--max-concurrency=$CONC \
--request-rate=inf --ignore-eos \
--save-result --percentile-metrics="ttft,tpot,itl,e2el" \
--result-dir=./ --result-filename=$RESULT_FILENAME.json
# accuracy:
lm_eval --model local-completions \
--model_args model=${model},base_url=http://localhost:8000/v1/completions,num_concurrent=128,max_retries=3,tokenized_requests=False \
--tasks gsm8k \
--num_fewshot 3Test Resultperf without this pr: 11290.90 tok/s ============ Serving Benchmark Result ============
Successful requests: 256
Benchmark duration (s): 45.35
Total input tokens: 256000
Total generated tokens: 256000
Request throughput (req/s): 5.65
Output token throughput (tok/s): 5645.45
Total Token throughput (tok/s): 11290.90
---------------Time to First Token----------------
Mean TTFT (ms): 1858.31
Median TTFT (ms): 1775.52
P99 TTFT (ms): 3039.45
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 20.81
Median TPOT (ms): 20.79
P99 TPOT (ms): 22.32
---------------Inter-token Latency----------------
Mean ITL (ms): 20.79
Median ITL (ms): 19.20
P99 ITL (ms): 25.13
----------------End-to-end Latency----------------
Mean E2EL (ms): 22650.63
Median E2EL (ms): 22647.60
P99 E2EL (ms): 22768.93
==================================================perf with this pr: 11805.87 tok/s ============ Serving Benchmark Result ============
Successful requests: 256
Benchmark duration (s): 43.37
Total input tokens: 256000
Total generated tokens: 256000
Request throughput (req/s): 5.90
Output token throughput (tok/s): 5902.93
Total Token throughput (tok/s): 11805.87
---------------Time to First Token----------------
Mean TTFT (ms): 1852.94
Median TTFT (ms): 1812.18
P99 TTFT (ms): 3045.32
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 19.83
Median TPOT (ms): 19.94
P99 TPOT (ms): 21.36
---------------Inter-token Latency----------------
Mean ITL (ms): 19.81
Median ITL (ms): 18.24
P99 ITL (ms): 24.24
----------------End-to-end Latency----------------
Mean E2EL (ms): 21664.24
Median E2EL (ms): 21673.63
P99 E2EL (ms): 21750.85
==================================================Accuracy result: |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 3|exact_match|↑ |0.3366|± | 0.013|
| | |strict-match | 3|exact_match|↑ |0.8795|± | 0.009| |
valarLip
approved these changes
Dec 9, 2025
amd-youchen
pushed a commit
to amd-youchen/aiter
that referenced
this pull request
Dec 10, 2025
* add qknorm rope fused kernel * fix typo * fix lint
xytpai
added a commit
to ZLkanyo009/aiter
that referenced
this pull request
Dec 11, 2025
* Add `qknorm+rope` fused kernel (ROCm#1590) * add qknorm rope fused kernel * fix typo * fix lint * fix bug * add fused support --------- Co-authored-by: Yutao Xu <xytpai@foxmail.com>
zhuyuhua-v
pushed a commit
that referenced
this pull request
Dec 17, 2025
* add qknorm rope fused kernel * fix typo * fix lint
ZhangLirong-amd
pushed a commit
that referenced
this pull request
Dec 29, 2025
* add qknorm rope fused kernel * fix typo * fix lint
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This kernel is for Qwen3 models. It's a trivial work because we already have
mrope_3dhip kernel template.Test Result
4x compared with native implementation