Skip to content

Add activation and mul and per-token dynamic FP8 quant fusion kernels#1771

Open
kliuae wants to merge 11 commits intoROCm:mainfrom
EmbeddedLLM:fused_act_mul_token_quant
Open

Add activation and mul and per-token dynamic FP8 quant fusion kernels#1771
kliuae wants to merge 11 commits intoROCm:mainfrom
EmbeddedLLM:fused_act_mul_token_quant

Conversation

@kliuae
Copy link
Contributor

@kliuae kliuae commented Jan 6, 2026

Motivation

This PR adds fusion kernels to fuse the act and mul ops with dynamic per-token FP8 quantization to speed up ptpc-fp8 model inferencing.
The supported activation functions are silu, gelu, and gelu with tanh approximation.

Technical Details

Cache the intermediate results to avoid GDS loading during per-token max scale evaluation and quantization.

Test Plan

Added unit tests for act_and_mul and FP8 per-token quant fusion.

Integration test as a fusion op in vLLM for model end-to-end inferencing. Tested on RedhatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamic with silu-mul-ptpcfp8 fusion.

Test Result

vLLM End-to-end tests with RedhatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamic TP4 on MI300X

lm_eval with ChartQA

Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.89,
    "anywhere_in_answer_relaxed_correctness": 0.892
}

Throughput test

vllm bench serve --port 8088 --backend openai-chat --model RedHatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamic --dataset-name hf --dataset-path lmarena-ai/VisionArena-Chat --hf-split train --endpoint /v1/chat/completions --max-concurrency 64

without fusion

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Maximum request concurrency:             64
Benchmark duration (s):                  169.15
Total input tokens:                      94327
Total generated tokens:                  114782
Request throughput (req/s):              5.91
Output token throughput (tok/s):         678.60
Peak output token throughput (tok/s):    2110.00
Peak concurrent requests:                121.00
Total token throughput (tok/s):          1236.26
---------------Time to First Token----------------
Mean TTFT (ms):                          6165.98
Median TTFT (ms):                        6321.34
P99 TTFT (ms):                           11041.16
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          43.59
Median TPOT (ms):                        41.07
P99 TPOT (ms):                           201.45
---------------Inter-token Latency----------------
Mean ITL (ms):                           112.09
Median ITL (ms):                         21.13
P99 ITL (ms):                            1945.71
==================================================

with fusion

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Maximum request concurrency:             64
Benchmark duration (s):                  163.26
Total input tokens:                      94327
Total generated tokens:                  114472
Request throughput (req/s):              6.13
Output token throughput (tok/s):         701.17
Peak output token throughput (tok/s):    2298.00
Peak concurrent requests:                121.00
Total token throughput (tok/s):          1278.95
---------------Time to First Token----------------
Mean TTFT (ms):                          5883.03
Median TTFT (ms):                        6015.88
P99 TTFT (ms):                           10457.55
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          43.45
Median TPOT (ms):                        40.10
P99 TPOT (ms):                           140.93
---------------Inter-token Latency----------------
Mean ITL (ms):                           119.03
Median ITL (ms):                         20.26
P99 ITL (ms):                            2053.98
==================================================

Submission Checklist

kliuae added 11 commits January 6, 2026 03:22
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
@kliuae kliuae requested a review from a team January 6, 2026 05:18
@valarLip valarLip requested a review from zufayu January 6, 2026 08:14
@tjtanaa
Copy link
Contributor

tjtanaa commented Jan 14, 2026

@zufayu could you help to take a look at this PR. Thank you.

@zufayu
Copy link
Contributor

zufayu commented Jan 14, 2026

@kliuae
we would not add more API for fused_silu_quant ;
we add more ifelse and more args in aiter/csrc/kernels/activation_kernels.cu

  • act_and_mul_kernel & scaled_act_and_mul_kernel to support fused_fucntion

  • void silu_and_mul(torch::Tensor& out, // [..., d]
    torch::Tensor& input) // [..., 2 * d]
    .....
    std::optionalxxxx::xxxx quant_type,
    .....

       {
           LAUNCH_ACTIVATION_GATE_KERNEL(aiter::silu_kernel);
       }
    

No specific op test add for this function. We use old op test

@kliuae-amd
Copy link
Contributor

@kliuae we would not add more API for fused_silu_quant ; we add more ifelse and more args in aiter/csrc/kernels/activation_kernels.cu

  • act_and_mul_kernel & scaled_act_and_mul_kernel to support fused_fucntion
  • void silu_and_mul(torch::Tensor& out, // [..., d]
    torch::Tensor& input) // [..., 2 * d]
    .....
    std::optionalxxxx::xxxx quant_type,
    .....
       {
           LAUNCH_ACTIVATION_GATE_KERNEL(aiter::silu_kernel);
       }
    

No specific op test add for this function. We use old op test

Hi @zufayu,

Thank you for the feedback. Really appreciate the suggestions.

I did notice that in aiter, there are other activation-quantization fusion kernels (act_mul_and_mxfp4_quant, act_mul_and_fp8_group_quant, fused_silu_mul_fp8_per_tensor_static_quant) for other quantization types implemented as separate APIs within the library. My initial thought was to follow this existing pattern to maintain consistency, and before moving forward, I was wondering if adding these new activation per-token quant fusions as separate APIs would be a feasible option, to align with what aiter already has.

That said, I'm happy to consolidate the changes into activation_kernels.cu, and into the unquantized silu_and_mul if you prefer that being the way forward. This would involve adding arguments like quant_type and output_scale. Looking forward to hearing your thoughts.

@tjtanaa
Copy link
Contributor

tjtanaa commented Feb 12, 2026

@zufayu can any thoughts about @kliuae question?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants