[Feat][aiter][ROCm] Add aiter rmsnorm and quant fusion #735

kliuae-amd · 2025-10-13T16:55:47Z

Purpose

This PR adds aiter's rmsnorm and fp8 quant fusion kernel, invoked in the rmsnorm+quant_fp8 custom fusion pass.
To use this feature, enable aiter with VLLM_ROCM_USE_AITER=1 and set --compilation-config '{"pass_config": {"enable_fusion": true, "enable_noop": true, "enable-attn-fusion": false}, "custom_ops": ["+rms_norm", "+quant_fp8"] to enable the fusion pass.

Test Plan

End-to-end test using RedHatAI/Qwen3-14B-FP8-dynamic model

Server command:

VLLM_ROCM_USE_AITER=1 VLLM_USE_V1=1 VLLM_DISABLE_COMPILE_CACHE=1 \
  vllm serve RedHatAI/Qwen3-14B-FP8-dynamic \
  --compilation-config '{"pass_config": {"enable_fusion": true, "enable_noop": true, "enable-attn-fusion": false}, "custom_ops": ["+rms_norm", "+quant_fp8"], "cudagraph_capture_sizes": [1,2,4,8,16,24,32,256]}' \
  --trust-remote-code --swap-space 16 --distributed-executor-backend mp

lm_eval command:

lm_eval --model local-completions --tasks gsm8k --model_args model=RedHatAI/Qwen3-14B-FP8-dynamic,base_url=http://localhost:9090/v1/completions --trust_remote_code --num_fewshot 5 --batch_size 128

Benchmark command:

vllm bench serve --backend vllm --model "RedHatAI/Qwen3-14B-FP8-dynamic" --dataset-name random --num-prompts 500 --random-input-len 1000 --random-output-len 1000 --endpoint /v1/completions --random-range-ratio 0.9

Test Result

lm_eval

w/o fusion

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	_	0.7726	_	0.0115
		strict-match	5	exact_match	_	0.8825	_	0.0089

w/ fusion

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	_	0.7665	_	0.0117
		strict-match	5	exact_match	_	0.8825	_	0.0089

Serving benchmark

Metric	w/o fusion	w/ fusion
Successful requests	500	500
Benchmark duration (s)	182.04	179.43
Total input tokens	520558	520558
Total generated tokens	459135	448224
Request throughput (req/s)	2.75	2.79
Output token throughput (tok/s)	2522.18	2498.05
Peak output token throughput (tok/s)	7756.00	7762.00
Peak concurrent requests	500.00	500.00
Total token throughput (tok/s)	5381.78	5399.23
Mean TTFT (ms)	36077.04	35873.91
Median TTFT (ms)	30327.35	30536.64
P99 TTFT (ms)	94438.56	93577.47
Mean TPOT (ms)	171.56	182.25
Median TPOT (ms)	120.75	123.33
P99 TPOT (ms)	936.65	987.21
Mean ITL (ms)	118.17	118.68
Median ITL (ms)	60.30	60.00
P99 ITL (ms)	2173.92	2181.85

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

tjtanaavllm · 2025-10-14T01:37:19Z

vllm/compilation/rocm_aiter_rmsnorm_fusion.py

+    @VllmInductorPass.time_and_log
+    def __call__(self, graph: fx.Graph):
+        self.matched_count = self.patterns.apply(graph)
+        print("Matched count:", self.matched_count)


@kliuae-amd can you remove this line of print?

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

tjtanaavllm · 2025-10-14T03:19:37Z

LTGM.

tjtanaavllm · 2025-10-17T03:14:00Z

Add --compilation-config '{"pass_config": {"enable_fusion": true, "enable_noop": true, "enable-attn-fusion": false}, "custom_ops": ["+rms_norm", "+quant_fp8"]}'

There are issues on mi300X when using newer AITER, KF could only test this Qwen3-Coder-PTPC-FP8 model and saw improvements.

local-completions (model=EmbeddedLLM/Qwen3-Coder-480B-A35B-Instruct-FP8-Dynamic,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8886|±  |0.0087|
|     |       |strict-match    |     5|exact_match|↑  |0.8650|±  |0.0094|

Metric	Without Fused RMS Norm	With Fused RMS Norm	Difference	% Change
Overall Performance
Successful requests	640	640	0	0%
Benchmark duration (s)	470.49	462.86	-7.63	-1.62% ✓
Request throughput (req/s)	1.36	1.38	+0.02	+1.47% ✓
Token Throughput
Output token throughput (tok/s)	1392.93	1415.90	+22.97	+1.65% ✓
Peak output token throughput (tok/s)	2304.00	2368.00	+64.00	+2.78% ✓
Total token throughput (tok/s)	6268.20	6371.54	+103.34	+1.65% ✓
Concurrency
Peak concurrent requests	71.00	75.00	+4.00	+5.63% ✓
Time to First Token (TTFT)
Mean TTFT (ms)	2281.44	2544.96	+263.52	+11.55% ✗
Median TTFT (ms)	2014.27	2116.78	+102.51	+5.09% ✗
P99 TTFT (ms)	11940.64	11891.41	-49.23	-0.41% ✓
Time Per Output Token (TPOT)
Mean TPOT (ms)	43.73	42.73	-1.00	-2.29% ✓
Median TPOT (ms)	44.62	43.40	-1.22	-2.73% ✓
P99 TPOT (ms)	46.01	45.53	-0.48	-1.04% ✓
Inter-token Latency (ITL)
Mean ITL (ms)	43.73	42.73	-1.00	-2.29% ✓
Median ITL (ms)	28.65	28.33	-0.32	-1.12% ✓
P99 ITL (ms)	685.79	672.58	-13.21	-1.93% ✓

sunway513 · 2025-10-18T05:48:12Z

thanks for the PR, but we should start to move all new development in upstream first..
cc @wuhuikx

wuhuikx · 2025-10-18T06:18:07Z

@sunway513 the corresponding PR for upstream is here vllm-project#26575

Yes, for each feature, we will directly PR to upstream. Considering it's too long progress of upstream PR to be merged, we port the upstream PR back to rocm/vllm and combine the other upstream PRs together for performance verification.

sunway513 · 2025-10-18T06:32:12Z

@sunway513 the corresponding PR for upstream is here vllm-project#26575

Yes, for each feature, we will directly PR to upstream. Considering it's too long progress of upstream PR to be merged, we port the upstream PR back to rocm/vllm and combine the other upstream PRs together for performance verification.

makes sense. we're in progress of moving such usage to upstream amd_dev branch:
https://github.com/vllm-project/vllm/tree/amd_dev

wuhuikx · 2025-10-18T06:38:46Z

@sunway513 the corresponding PR for upstream is here vllm-project#26575
Yes, for each feature, we will directly PR to upstream. Considering it's too long progress of upstream PR to be merged, we port the upstream PR back to rocm/vllm and combine the other upstream PRs together for performance verification.

makes sense. we're in progress of moving such usage to upstream amd_dev branch: https://github.com/vllm-project/vllm/tree/amd_dev

Got it. We will follow.

add aiter rmsnorm and quant fusion kernel

94f0a4d

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

kliuae-amd requested review from Alexei-V-Ivanov-AMD, charlifu, gshtras, hongxiayang, maleksan85, mawong-amd, shajrawi and sunway513 as code owners October 13, 2025 16:55

tjtanaavllm reviewed Oct 14, 2025

View reviewed changes

kliuae added 2 commits October 14, 2025 02:23

deprint

29d325a

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

disable aiter quant mm for compat

0f141b9

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

tjtanaavllm merged commit e718694 into ROCm:dev/perf Oct 14, 2025
1 of 2 checks passed

tjtanaavllm mentioned this pull request Oct 14, 2025

[script] run with pure TP8 #736

Merged

wuhuikx removed request for Alexei-V-Ivanov-AMD, charlifu, gshtras, hongxiayang, maleksan85, mawong-amd, shajrawi and sunway513 October 18, 2025 06:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feat][aiter][ROCm] Add aiter rmsnorm and quant fusion #735

[Feat][aiter][ROCm] Add aiter rmsnorm and quant fusion #735

Uh oh!

kliuae-amd commented Oct 13, 2025 •

edited by github-actions bot

Loading

Uh oh!

tjtanaavllm Oct 14, 2025

Uh oh!

tjtanaavllm commented Oct 14, 2025

Uh oh!

Uh oh!

tjtanaavllm commented Oct 17, 2025 •

edited

Loading

Uh oh!

sunway513 commented Oct 18, 2025

Uh oh!

wuhuikx commented Oct 18, 2025

Uh oh!

sunway513 commented Oct 18, 2025

Uh oh!

wuhuikx commented Oct 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[Feat][aiter][ROCm] Add aiter rmsnorm and quant fusion #735

[Feat][aiter][ROCm] Add aiter rmsnorm and quant fusion #735

Uh oh!

Conversation

kliuae-amd commented Oct 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

lm_eval

Serving benchmark

Uh oh!

tjtanaavllm Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

tjtanaavllm commented Oct 14, 2025

Uh oh!

Uh oh!

tjtanaavllm commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sunway513 commented Oct 18, 2025

Uh oh!

wuhuikx commented Oct 18, 2025

Uh oh!

sunway513 commented Oct 18, 2025

Uh oh!

wuhuikx commented Oct 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kliuae-amd commented Oct 13, 2025 •

edited by github-actions bot

Loading

tjtanaavllm commented Oct 17, 2025 •

edited

Loading