Add activation and mul and per-token dynamic FP8 quant fusion kernels#1771
Add activation and mul and per-token dynamic FP8 quant fusion kernels#1771
Conversation
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
|
@zufayu could you help to take a look at this PR. Thank you. |
|
@kliuae
No specific op test add for this function. We use old op test |
Hi @zufayu, Thank you for the feedback. Really appreciate the suggestions. I did notice that in aiter, there are other activation-quantization fusion kernels (act_mul_and_mxfp4_quant, act_mul_and_fp8_group_quant, fused_silu_mul_fp8_per_tensor_static_quant) for other quantization types implemented as separate APIs within the library. My initial thought was to follow this existing pattern to maintain consistency, and before moving forward, I was wondering if adding these new activation per-token quant fusions as separate APIs would be a feasible option, to align with what aiter already has. That said, I'm happy to consolidate the changes into |
Motivation
This PR adds fusion kernels to fuse the act and mul ops with dynamic per-token FP8 quantization to speed up ptpc-fp8 model inferencing.
The supported activation functions are silu, gelu, and gelu with tanh approximation.
Technical Details
Cache the intermediate results to avoid GDS loading during per-token max scale evaluation and quantization.
Test Plan
Added unit tests for act_and_mul and FP8 per-token quant fusion.
Integration test as a fusion op in vLLM for model end-to-end inferencing. Tested on RedhatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamic with silu-mul-ptpcfp8 fusion.
Test Result
vLLM End-to-end tests with
RedhatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamicTP4 on MI300Xlm_eval with ChartQA
Throughput test
without fusion
with fusion
Submission Checklist