Skip to content

Conversation

@gmagogsfm
Copy link
Contributor

@gmagogsfm gmagogsfm commented Nov 20, 2025

  • This prorotype implements a naive silu_mul_fp8 kernel and integrates it in vLLM's custom fusion pass in the form of a custom op
  • Numerical accuracy is verified
  • There is on average about 4x slow down compared to vLLM's custom silu_mul_fp8 CUDA kernel

- This prorotype implements a naive silu_mul_fp8 kernel and
integrates it in vLLM's custom fusion pass in the form of a
custom op
- Numerical accuracy is verified
- There is on average about 4x slow down compared to vLLM's custom
silu_mul_fp8 CUDA kernel

Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant