-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Perf] Using __nv_fp8_e4m3 instead of c10::e4m3 for per_token_group_quant
#21867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request improves the performance of per-token-group FP8 quantization by replacing the C++-emulated c10::e4m3 type with the native CUDA __nv_fp8_e4m3 type. The benchmark results clearly show the benefits of this change. The review includes one high-severity suggestion to verify the removal of the header file doesn't introduce any regressions.
|
Cool! Could you run the kernel benchmark on a more standard range of M, like 2^1 to 2^14, and an lm-eval check - I think Qwen3 FP8 block should be fine to trigger this |
Sure~ Updated in the above |
|
@mgoin Updated using 8192 now |
…oup_quant` (vllm-project#21867) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…oup_quant` (vllm-project#21867) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…oup_quant` (vllm-project#21867) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: x22x22 <wadeking@qq.com>
…oup_quant` (vllm-project#21867) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…oup_quant` (vllm-project#21867) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
…oup_quant` (vllm-project#21867) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Noam Gat <noamgat@gmail.com>
…oup_quant` (vllm-project#21867) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
…oup_quant` (vllm-project#21867) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
…oup_quant` (vllm-project#21867) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…oup_quant` (vllm-project#21867) Signed-off-by: yewentao256 <zhyanwentao@126.com>
Purpose
sgl-project/sglang#8499 shows we can get performance improvement through
__nv_fp8_e4m3This PR does that, doesn't change
dynamic/static fp8 quantyet since we will update it together in #19630Test
(vllm-user-6) vllm-user-6@centralia:~/vllm/benchmarks/kernels$ python benchmark_per_token_group_quant.pyAcc Test
lm_eval --model vllm --model_args "pretrained=Qwen/Qwen3-30B-A3B-FP8,max_model_len=32768,enforce_eager=True" --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size automain
Now: