-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Bugfix] Enable padded FP4 quantization #25947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a critical bug in padded FP4 quantization by ensuring that allocated tensors for quantized output and scales are zero-initialized. The previous use of torch.empty could lead to uninitialized garbage values in padded regions, causing incorrect and non-deterministic behavior. By switching to torch.zeros, the padded areas are correctly filled with zeros. Additionally, assertions that were incompatible with padded tensor shapes in flashinfer_scaled_fp4_mm have been correctly removed. These changes are essential for enabling models that require padding for FP4 quantization, such as Nemotron-Nano-9B-v2 with tensor parallelism. The fixes are correct and well-targeted.
f733f88 to
3298b91
Compare
vllm/_custom_ops.py
Outdated
|
|
||
| # Two fp4 values will be packed into an uint8. | ||
| output = torch.empty((m, n // 2), device=device, dtype=torch.uint8) | ||
| output = torch.zeros((m, n // 2), device=device, dtype=torch.uint8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a tensor of required out shape is allocated, why do we also need to initialize it with zeros? The divisibility check is only on the reducing dimension, could you clarify this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did it for consistency's sake, but you are correct, this allocation can stay as torch.empty. I'll revert this line.
|
Thanks for the PR! Could you also post lm_eval results for any other FP4 model to ensure that previous paths don't break? |
…4_mm` Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: Roi Koren <roik@nvidia.com>
3298b91 to
227f836
Compare
Ran GSM8K on And got the following results: Before my changes, the results are: |
This comment makes it seem like we should still be checking or asserting shapes at the kernel dispatch level to guide users, but we can leave that for future work if this fix is needed first. |
Signed-off-by: Roi Koren <roik@nvidia.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Roi Koren <roik@nvidia.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: Roi Koren <roik@nvidia.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Roi Koren <roik@nvidia.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: Roi Koren <roik@nvidia.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Purpose
This PR fixes an issue with padded FP4 quantization, where previously initialized values in the allocated tensors aren't overwritten properly when quantizing to FP a tensor which requires padding.
This issue prevents running NVIDIA's new Nemotron-Nano-9B-v2 with TP2, as some of the tensor sizes in such cases do not divide evenly by 128.
Note that this specific model doesn't work with TP>2, and this PR doesn't solve issues that come up in those scenarios. For TP4 the issue comes from inside the FP4 GEMM kernel, where it expects some matrix size to be divisible by 32, and for TP8 the issue happens when creating the model weights, where some layers' in features are not divisible by 16. The same issues happen in TensorRT-LLM as well.
Test Plan
No need for new tests, as tests are already in place to verify FP4 quantization, both with and without padding.
Test Result
All existing FP4 quantization tests passed locally.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.