-
-
Couldn't load subscription status.
- Fork 10.9k
[Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging #25895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces important fixes for FP8 MoE, particularly for TRT-LLM latency kernels. The changes correctly address an accuracy issue by ensuring e_score_correction_bias has the expected dtype, and fix a critical control flow bug by adding an early return after the MoE operation is complete. Additionally, the logic for selecting between FlashInfer, DeepGEMM, and Cutlass kernels is improved to prevent conflicts, disabling other kernels when FlashInfer MoE is active. The logging has also been updated to be more informative about which kernel is being used. The changes are well-structured and significantly improve the correctness and robustness of the FP8 MoE implementation.
| e_score_correction_bias = (e_score_correction_bias.to(x.dtype) | ||
| if e_score_correction_bias is not None else None) | ||
| return torch.ops.vllm.flashinfer_fused_moe_blockscale_fp8( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great fix that addresses two important issues:
- Casting
e_score_correction_biastox.dtyperesolves a potential dtype mismatch that could lead to accuracy problems, which is critical for correctness. - Adding the
returnstatement corrects a significant control flow bug. Previously, the code would fall through and executeselect_expertsand other logic even after the complete MoE operation was performed byflashinfer_fused_moe_blockscale_fp8. This early return ensures the function exits correctly.
This change is crucial for both correctness and logic of the FP8 MoE path.
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| elif not self.block_quant: | ||
| logger.warning_once("Model is not block quantized. Not using " | ||
| "DeepGemm kernels") | ||
| logger.warning_once("Model is not block quantized. Not using" | ||
| " DeepGemm kernels") | ||
| elif self.flashinfer_moe_backend: | ||
| logger.info_once("DeepGemm disabled: FlashInfer MOE is" | ||
| " enabled.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for future self: we should clean up these logs now that VLLM_USE_DEEP_GEMM=1 by default
| else: | ||
| assert (not renormalize | ||
| and custom_routing_function is not None) | ||
| result = apply_flashinfer_per_tensor_scale_fp8( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a bug too? It seems like we need to return here rather than write to result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it can potentially be a bug, havent' tested the path - so I was hestiant to change it.
…lm-project#25895) Signed-off-by: Pavani Majety <pmajety@nvidia.com>
…5895) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
…lm-project#25895) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…lm-project#25895) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…lm-project#25895) Signed-off-by: Pavani Majety <pmajety@nvidia.com>
…lm-project#25895) Signed-off-by: Pavani Majety <pmajety@nvidia.com>
…lm-project#25895) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
Fixes #25189: accuracy issues for TRTLLM DSR1 Latency kernels
Bugs introduced in #23991 and #23640
Also fixes incorrect logging prints for which kernels are used. When Flashinfer MOE is enabled, Deep Gemm is automatically disabled for SM100.
Test Plan
LM Eval with
VLLM_USE_FLASHINFER_MOE_FP8=1andVLLM_FLASHINFER_MOE_BACKEND="latency"Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.