Skip to content

Conversation

@kliuae-amd
Copy link

@kliuae-amd kliuae-amd commented Oct 13, 2025

Purpose

This PR adds aiter's rmsnorm and fp8 quant fusion kernel, invoked in the rmsnorm+quant_fp8 custom fusion pass.
To use this feature, enable aiter with VLLM_ROCM_USE_AITER=1 and set --compilation-config '{"pass_config": {"enable_fusion": true, "enable_noop": true, "enable-attn-fusion": false}, "custom_ops": ["+rms_norm", "+quant_fp8"] to enable the fusion pass.

Test Plan

End-to-end test using RedHatAI/Qwen3-14B-FP8-dynamic model

Server command:

VLLM_ROCM_USE_AITER=1 VLLM_USE_V1=1 VLLM_DISABLE_COMPILE_CACHE=1 \
  vllm serve RedHatAI/Qwen3-14B-FP8-dynamic \
  --compilation-config '{"pass_config": {"enable_fusion": true, "enable_noop": true, "enable-attn-fusion": false}, "custom_ops": ["+rms_norm", "+quant_fp8"], "cudagraph_capture_sizes": [1,2,4,8,16,24,32,256]}' \
  --trust-remote-code --swap-space 16 --distributed-executor-backend mp

lm_eval command:

lm_eval --model local-completions --tasks gsm8k --model_args model=RedHatAI/Qwen3-14B-FP8-dynamic,base_url=http://localhost:9090/v1/completions --trust_remote_code --num_fewshot 5 --batch_size 128

Benchmark command:

vllm bench serve --backend vllm --model "RedHatAI/Qwen3-14B-FP8-dynamic" --dataset-name random --num-prompts 500 --random-input-len 1000 --random-output-len 1000 --endpoint /v1/completions --random-range-ratio 0.9

Test Result

lm_eval

w/o fusion

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match _ 0.7726 _ 0.0115
strict-match 5 exact_match _ 0.8825 _ 0.0089

w/ fusion

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match _ 0.7665 _ 0.0117
strict-match 5 exact_match _ 0.8825 _ 0.0089

Serving benchmark

Metric w/o fusion w/ fusion
Successful requests 500 500
Benchmark duration (s) 182.04 179.43
Total input tokens 520558 520558
Total generated tokens 459135 448224
Request throughput (req/s) 2.75 2.79
Output token throughput (tok/s) 2522.18 2498.05
Peak output token throughput (tok/s) 7756.00 7762.00
Peak concurrent requests 500.00 500.00
Total token throughput (tok/s) 5381.78 5399.23
Mean TTFT (ms) 36077.04 35873.91
Median TTFT (ms) 30327.35 30536.64
P99 TTFT (ms) 94438.56 93577.47
Mean TPOT (ms) 171.56 182.25
Median TPOT (ms) 120.75 123.33
P99 TPOT (ms) 936.65 987.21
Mean ITL (ms) 118.17 118.68
Median ITL (ms) 60.30 60.00
P99 ITL (ms) 2173.92 2181.85

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
@VllmInductorPass.time_and_log
def __call__(self, graph: fx.Graph):
self.matched_count = self.patterns.apply(graph)
print("Matched count:", self.matched_count)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kliuae-amd can you remove this line of print?

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
@tjtanaavllm
Copy link

LTGM.

@tjtanaavllm tjtanaavllm merged commit e718694 into ROCm:dev/perf Oct 14, 2025
1 of 2 checks passed
@tjtanaavllm
Copy link

tjtanaavllm commented Oct 17, 2025

Add --compilation-config '{"pass_config": {"enable_fusion": true, "enable_noop": true, "enable-attn-fusion": false}, "custom_ops": ["+rms_norm", "+quant_fp8"]}'

There are issues on mi300X when using newer AITER, KF could only test this Qwen3-Coder-PTPC-FP8 model and saw improvements.

local-completions (model=EmbeddedLLM/Qwen3-Coder-480B-A35B-Instruct-FP8-Dynamic,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8886|±  |0.0087|
|     |       |strict-match    |     5|exact_match|↑  |0.8650|±  |0.0094|
Metric Without Fused RMS Norm With Fused RMS Norm Difference % Change
Overall Performance
Successful requests 640 640 0 0%
Benchmark duration (s) 470.49 462.86 -7.63 -1.62%
Request throughput (req/s) 1.36 1.38 +0.02 +1.47%
Token Throughput
Output token throughput (tok/s) 1392.93 1415.90 +22.97 +1.65%
Peak output token throughput (tok/s) 2304.00 2368.00 +64.00 +2.78%
Total token throughput (tok/s) 6268.20 6371.54 +103.34 +1.65%
Concurrency
Peak concurrent requests 71.00 75.00 +4.00 +5.63%
Time to First Token (TTFT)
Mean TTFT (ms) 2281.44 2544.96 +263.52 +11.55%
Median TTFT (ms) 2014.27 2116.78 +102.51 +5.09%
P99 TTFT (ms) 11940.64 11891.41 -49.23 -0.41%
Time Per Output Token (TPOT)
Mean TPOT (ms) 43.73 42.73 -1.00 -2.29%
Median TPOT (ms) 44.62 43.40 -1.22 -2.73%
P99 TPOT (ms) 46.01 45.53 -0.48 -1.04%
Inter-token Latency (ITL)
Mean ITL (ms) 43.73 42.73 -1.00 -2.29%
Median ITL (ms) 28.65 28.33 -0.32 -1.12%
P99 ITL (ms) 685.79 672.58 -13.21 -1.93%

@sunway513
Copy link

thanks for the PR, but we should start to move all new development in upstream first..
cc @wuhuikx

@wuhuikx
Copy link

wuhuikx commented Oct 18, 2025

@sunway513 the corresponding PR for upstream is here vllm-project#26575

Yes, for each feature, we will directly PR to upstream. Considering it's too long progress of upstream PR to be merged, we port the upstream PR back to rocm/vllm and combine the other upstream PRs together for performance verification.

@sunway513
Copy link

@sunway513 the corresponding PR for upstream is here vllm-project#26575

Yes, for each feature, we will directly PR to upstream. Considering it's too long progress of upstream PR to be merged, we port the upstream PR back to rocm/vllm and combine the other upstream PRs together for performance verification.

makes sense. we're in progress of moving such usage to upstream amd_dev branch:
https://github.com/vllm-project/vllm/tree/amd_dev

@wuhuikx
Copy link

wuhuikx commented Oct 18, 2025

@sunway513 the corresponding PR for upstream is here vllm-project#26575
Yes, for each feature, we will directly PR to upstream. Considering it's too long progress of upstream PR to be merged, we port the upstream PR back to rocm/vllm and combine the other upstream PRs together for performance verification.

makes sense. we're in progress of moving such usage to upstream amd_dev branch: https://github.com/vllm-project/vllm/tree/amd_dev

Got it. We will follow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants