Add activation and mul and per-token dynamic FP8 quant fusion kernels by kliuae · Pull Request #1771 · ROCm/aiter

kliuae · 2026-01-06T05:18:10Z

Motivation

This PR adds fusion kernels to fuse the act and mul ops with dynamic per-token FP8 quantization to speed up ptpc-fp8 model inferencing.
The supported activation functions are silu, gelu, and gelu with tanh approximation.

Technical Details

Cache the intermediate results to avoid GDS loading during per-token max scale evaluation and quantization.

Test Plan

Added unit tests for act_and_mul and FP8 per-token quant fusion.

Integration test as a fusion op in vLLM for model end-to-end inferencing. Tested on RedhatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamic with silu-mul-ptpcfp8 fusion.

Test Result

vLLM End-to-end tests with RedhatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamic TP4 on MI300X

lm_eval with ChartQA

Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.89,
    "anywhere_in_answer_relaxed_correctness": 0.892
}

Throughput test

vllm bench serve --port 8088 --backend openai-chat --model RedHatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamic --dataset-name hf --dataset-path lmarena-ai/VisionArena-Chat --hf-split train --endpoint /v1/chat/completions --max-concurrency 64

without fusion

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Maximum request concurrency:             64
Benchmark duration (s):                  169.15
Total input tokens:                      94327
Total generated tokens:                  114782
Request throughput (req/s):              5.91
Output token throughput (tok/s):         678.60
Peak output token throughput (tok/s):    2110.00
Peak concurrent requests:                121.00
Total token throughput (tok/s):          1236.26
---------------Time to First Token----------------
Mean TTFT (ms):                          6165.98
Median TTFT (ms):                        6321.34
P99 TTFT (ms):                           11041.16
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          43.59
Median TPOT (ms):                        41.07
P99 TPOT (ms):                           201.45
---------------Inter-token Latency----------------
Mean ITL (ms):                           112.09
Median ITL (ms):                         21.13
P99 ITL (ms):                            1945.71
==================================================

with fusion

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Maximum request concurrency:             64
Benchmark duration (s):                  163.26
Total input tokens:                      94327
Total generated tokens:                  114472
Request throughput (req/s):              6.13
Output token throughput (tok/s):         701.17
Peak output token throughput (tok/s):    2298.00
Peak concurrent requests:                121.00
Total token throughput (tok/s):          1278.95
---------------Time to First Token----------------
Mean TTFT (ms):                          5883.03
Median TTFT (ms):                        6015.88
P99 TTFT (ms):                           10457.55
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          43.45
Median TPOT (ms):                        40.10
P99 TPOT (ms):                           140.93
---------------Inter-token Latency----------------
Mean ITL (ms):                           119.03
Median ITL (ms):                         20.26
P99 ITL (ms):                            2053.98
==================================================

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

tjtanaa · 2026-01-14T07:32:29Z

@zufayu could you help to take a look at this PR. Thank you.

zufayu · 2026-01-14T07:42:40Z

@kliuae
we would not add more API for fused_silu_quant ;
we add more ifelse and more args in aiter/csrc/kernels/activation_kernels.cu

act_and_mul_kernel & scaled_act_and_mul_kernel to support fused_fucntion
void silu_and_mul(torch::Tensor& out, // [..., d]
torch::Tensor& input) // [..., 2 * d]
.....
std::optionalxxxx::xxxx quant_type,
.....
```
   {
       LAUNCH_ACTIVATION_GATE_KERNEL(aiter::silu_kernel);
   }
```

No specific op test add for this function. We use old op test

kliuae-amd · 2026-01-19T15:03:09Z

@kliuae we would not add more API for fused_silu_quant ; we add more ifelse and more args in aiter/csrc/kernels/activation_kernels.cu
act_and_mul_kernel & scaled_act_and_mul_kernel to support fused_fucntion
void silu_and_mul(torch::Tensor& out, // [..., d]
torch::Tensor& input) // [..., 2 * d]
.....
std::optionalxxxx::xxxx quant_type,
.....
   {
       LAUNCH_ACTIVATION_GATE_KERNEL(aiter::silu_kernel);
   }
No specific op test add for this function. We use old op test

Hi @zufayu,

Thank you for the feedback. Really appreciate the suggestions.

I did notice that in aiter, there are other activation-quantization fusion kernels (act_mul_and_mxfp4_quant, act_mul_and_fp8_group_quant, fused_silu_mul_fp8_per_tensor_static_quant) for other quantization types implemented as separate APIs within the library. My initial thought was to follow this existing pattern to maintain consistency, and before moving forward, I was wondering if adding these new activation per-token quant fusions as separate APIs would be a feasible option, to align with what aiter already has.

That said, I'm happy to consolidate the changes into activation_kernels.cu, and into the unquantized silu_and_mul if you prefer that being the way forward. This would involve adding arguments like quant_type and output_scale. Looking forward to hearing your thoughts.

tjtanaa · 2026-02-12T04:06:16Z

@zufayu can any thoughts about @kliuae question?

kliuae added 11 commits January 6, 2026 03:22

draft act mul quant fusion kernel

dbf949d

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

fix blocksize

1e7c0e2

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

lds

d9984c3

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

remove unnecessary sync

bef872e

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

fix f4 build

dbb4f30

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

make explicit per token naming

8295bfc

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

generalize dispatch

8957cfd

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

reorg test

238a5f4

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

rename test

0a437be

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

format

c7f107d

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

lift intermediate size limit

40aeb53

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

kliuae requested a review from a team January 6, 2026 05:18

valarLip requested a review from zufayu January 6, 2026 08:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add activation and mul and per-token dynamic FP8 quant fusion kernels#1771

Add activation and mul and per-token dynamic FP8 quant fusion kernels#1771
kliuae wants to merge 11 commits intoROCm:mainfrom
EmbeddedLLM:fused_act_mul_token_quant

kliuae commented Jan 6, 2026

Uh oh!

tjtanaa commented Jan 14, 2026

Uh oh!

zufayu commented Jan 14, 2026 •

edited

Loading

Uh oh!

kliuae-amd commented Jan 19, 2026

Uh oh!

tjtanaa commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kliuae commented Jan 6, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

tjtanaa commented Jan 14, 2026

Uh oh!

zufayu commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kliuae-amd commented Jan 19, 2026

Uh oh!

tjtanaa commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zufayu commented Jan 14, 2026 •

edited

Loading