-
-
Notifications
You must be signed in to change notification settings - Fork 11k
Add CUDA kernel for per_token_group_quant_fp8 #14175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CUDA kernel for per_token_group_quant_fp8 #14175
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does the 512.f come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this from dynamic_per_token_scaled_fp8_quant_kernel
vllm/csrc/quantization/fp8/common.cu
Line 33 in f89978a
| float const min_scaling_factor = 1.0f / (FP8_E4M3_MAX * 512.f); |
Signed-off-by: mgoin <mgoin64@gmail.com>
7905345 to
314d1a8
Compare
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
Seeing a minor boost over: #14476 when combined with that |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Currently failing some shapes