[Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging #25895

pavanimajety · 2025-09-29T17:16:46Z

Purpose

Fixes #25189: accuracy issues for TRTLLM DSR1 Latency kernels
Bugs introduced in #23991 and #23640

Also fixes incorrect logging prints for which kernels are used. When Flashinfer MOE is enabled, Deep Gemm is automatically disabled for SM100.

(EngineCore_DP0 pid=101213) (Worker_TP3 pid=101225) INFO 09-29 10:14:46 [fp8.py:454] Detected Blackwell GPUs, using FlashInfer TensorRT-LLM kernels for FP8 MOE.
(EngineCore_DP0 pid=101213) (Worker_TP3 pid=101225) INFO 09-29 10:14:46 [fp8.py:468] DeepGemm disabled: FlashInfer MOE is enabled.

Test Plan

LM Eval with VLLM_USE_FLASHINFER_MOE_FP8=1 and VLLM_FLASHINFER_MOE_BACKEND="latency"

root@umbriel-b200-017:/workspace/scratch-pmaj-1/gh-pm-vllm# ./run_lm_eval.sh                                                                        
+ export FORCE_NUM_KV_SPLITS=1                                                                                                                      
+ FORCE_NUM_KV_SPLITS=1                                                                                                                             
+ export VLLM_USE_FLASHINFER_MOE_FP8=1                                                                                                              
+ VLLM_USE_FLASHINFER_MOE_FP8=1                                                                                                                     
+ export VLLM_FLASHINFER_MOE_BACKEND=latency                                                                                                        
+ VLLM_FLASHINFER_MOE_BACKEND=latency                                                                                                               
+ echo 'Running lm_eval with:'                                                                                                                      
Running lm_eval with:                                                                                                                               
                     
  Model: /models/DSR1-FP8/models--deepseek-ai--DeepSeek-R1-0528/snapshots/4236a6af538feda4548eca9ab308586007567f52/                                 
  Tensor Parallel Size: 8                                                                                                                           
  Quantization: fp8                                                                                                                                 
  GPU Memory Utilization: 0.90                                                                                                                      
                                                                      
+ lm_eval --model vllm --model_args pretrained=/models/DSR1-FP8/models--deepseek-ai--DeepSeek-R1-0528/snapshots/4236a6af538feda4548eca9ab30858600756
7f52/,quantization=fp8,tensor_parallel_size=8,gpu_memory_utilization=0.90,add_bos_token=True --gen_kwargs temperature=0.0,max_gen_toks=32768 --trust
_remote_code --tasks gsm8k --num_fewshot 5 --batch_size 200 --limit 1319 --output_path results/4236a6af538feda4548eca9ab308586007567f52-gsm8k-1319sa
mples_20250929_101419 --log_samples

Test Result

vllm (pretrained=/models/DSR1-FP8/models--deepseek-ai--DeepSeek-R1-0528/snapshots/4236a6af538feda4548eca9ab308586007567f52/,quantization=fp8,tensor_parallel_size=8,gpu_memory_utilization=0.90,add_bos_token=True,trust_remote_code=True), gen_kwargs: (temperature=0.0,max_gen_toks=32768), limit: 1319.0, num_fewshot: 5, batch_size: 200
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9666|±  |0.0049|
|     |       |strict-match    |     5|exact_match|↑  |0.9651|±  |0.0051|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
[N/A] (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Pavani Majety <pmajety@nvidia.com>

gemini-code-assist

Code Review

This pull request introduces important fixes for FP8 MoE, particularly for TRT-LLM latency kernels. The changes correctly address an accuracy issue by ensuring e_score_correction_bias has the expected dtype, and fix a critical control flow bug by adding an early return after the MoE operation is complete. Additionally, the logic for selecting between FlashInfer, DeepGEMM, and Cutlass kernels is improved to prevent conflicts, disabling other kernels when FlashInfer MoE is active. The logging has also been updated to be more informative about which kernel is being used. The changes are well-structured and significantly improve the correctness and robustness of the FP8 MoE implementation.

gemini-code-assist · 2025-09-29T17:19:19Z

vllm/model_executor/layers/quantization/fp8.py

+                e_score_correction_bias = (e_score_correction_bias.to(x.dtype)
+                            if e_score_correction_bias is not None else None)
+                return torch.ops.vllm.flashinfer_fused_moe_blockscale_fp8(


This is a great fix that addresses two important issues:

Casting e_score_correction_bias to x.dtype resolves a potential dtype mismatch that could lead to accuracy problems, which is critical for correctness.

Adding the return statement corrects a significant control flow bug. Previously, the code would fall through and execute select_experts and other logic even after the complete MoE operation was performed by flashinfer_fused_moe_blockscale_fp8. This early return ensures the function exits correctly.

This change is crucial for both correctness and logic of the FP8 MoE path.

Signed-off-by: Pavani Majety <pmajety@nvidia.com>

alexm-redhat

LGTM

vllm/model_executor/layers/quantization/fp8.py

mgoin · 2025-09-30T14:49:21Z

vllm/model_executor/layers/quantization/fp8.py

            elif not self.block_quant:
-                logger.warning_once("Model is not block quantized. Not using "
-                                    "DeepGemm kernels")
+                logger.warning_once("Model is not block quantized. Not using"
+                                    " DeepGemm kernels")
+            elif self.flashinfer_moe_backend:
+                logger.info_once("DeepGemm disabled: FlashInfer MOE is"
+                                 " enabled.")


Note for future self: we should clean up these logs now that VLLM_USE_DEEP_GEMM=1 by default

mgoin · 2025-09-30T14:50:46Z

vllm/model_executor/layers/quantization/fp8.py

            else:
                assert (not renormalize
                        and custom_routing_function is not None)
                result = apply_flashinfer_per_tensor_scale_fp8(


Is this a bug too? It seems like we need to return here rather than write to result

Yeah, it can potentially be a bug, havent' tested the path - so I was hestiant to change it.

…lm-project#25895) Signed-off-by: Pavani Majety <pmajety@nvidia.com>

…5895) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

…lm-project#25895) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…lm-project#25895) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…lm-project#25895) Signed-off-by: Pavani Majety <pmajety@nvidia.com>

…lm-project#25895) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Fix accuracy issue of TRTLLM FP8 MOE and improve logging

4d7bfa9

Signed-off-by: Pavani Majety <pmajety@nvidia.com>

pavanimajety requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners September 29, 2025 17:16

gemini-code-assist bot reviewed Sep 29, 2025

View reviewed changes

ruff linting

abb5d78

Signed-off-by: Pavani Majety <pmajety@nvidia.com>

alexm-redhat approved these changes Sep 29, 2025

View reviewed changes

vllm/model_executor/layers/quantization/fp8.py Show resolved Hide resolved

mgoin added ready ONLY add when PR is ready to merge/full CI is needed bug Something isn't working labels Sep 29, 2025

mgoin approved these changes Sep 30, 2025

View reviewed changes

mgoin merged commit ef28354 into vllm-project:main Sep 30, 2025
49 of 50 checks passed

pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025

[Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging (vl…

50a6b46

…lm-project#25895) Signed-off-by: Pavani Majety <pmajety@nvidia.com>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging (#2…

8c52fcc

…5895) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging (vl…

c9518d1

…lm-project#25895) Signed-off-by: Pavani Majety <pmajety@nvidia.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging (vl…

f0517b1

…lm-project#25895) Signed-off-by: Pavani Majety <pmajety@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging #25895

[Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging #25895

Uh oh!

pavanimajety commented Sep 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Uh oh!

alexm-redhat left a comment

Uh oh!

Uh oh!

mgoin Sep 30, 2025

Uh oh!

mgoin Sep 30, 2025

Uh oh!

pavanimajety Sep 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

[Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging #25895

[Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging #25895

Uh oh!

Conversation

pavanimajety commented Sep 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

alexm-redhat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mgoin Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

pavanimajety Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pavanimajety commented Sep 29, 2025 •

edited by github-actions bot

Loading