[Performance][B200] Fix deepgemm prologue #27897

varun-sundar-rabindranath · 2025-10-31T21:21:38Z

Purpose

DeepGEMM requires the activation and weights scales to be in a specific format. When the scales are not provided in the desired format, DeepGEMM transforms the scales itself. This is usually very slow.

On H100, DeepGEMM needs the scales to be ColumnMajor and in float32.
On B200, DeepGEMM needs the scales to be ColumnMajor but in a packed E8M0 format.

Note that we handle the H100 case already on main. This PR adds partial support for the B200 case. Concretely,

Import transform_sf_into_required_layout from DeepGEMM to perform weight scales transformation. This function handles both SM90 and SM100 internally and can be used commonly.
The DeepEP low latency dispatch supports dispatching the activation scales in packed E8M0 format. This PR enables that.

main:

PR:

Benchmark

full benchmark numbers - link

server command : VLLM_ALL2ALL_BACKEND=${A2A} VLLM_USE_DEEP_GEMM=1 canhazgpu run -g2 -- vllm serve Qwen/Qwen3-30B-A3B-FP8 --trust-remote-code --tensor-parallel-size 1 --data-parallel-size 2 --enable-expert-parallel --no-enable-prefix-caching --port 9010

Decode bench command : vllm bench serve --model Qwen/Qwen3-30B-A3B-FP8 --dataset-name random --num-prompts 128 --random-input-len 1 --random-output-len 1024 --request-rate 128 --ignore-eos --port 9010

Prefill bench command : vllm bench serve --model Qwen/Qwen3-30B-A3B-FP8 --dataset-name random --num-prompts 256 --random-input-len 8192 --random-output-len 1 --request-rate 256 --ignore-eos --port 9010 --backend vllm

B200 + deepep_low_latency + decode

	main	PR
Peak output token throughput (tok/s)	9600	12028

B200 + deepep_high_throughput + prefill

	main	PR
Total Token throughput (tok/s)	89736.95	91310.99

H100 + deepep_low_latency + decode

	main	PR
Peak output token throughput (tok/s)	8408	8422

H100 + deepep_high_throughput + prefill

	main	PR
Total Token throughput (tok/s)	73950.14	74095.51

Test Plan

Server command :
vllm serve Qwen/Qwen3-30B-A3B-FP8 --trust-remote-code --tensor-parallel-size ${tp_size} --data-parallel-size ${dp_size} --enable-expert-parallel --no-enable-prefix-caching --port 9010

Server config combinations (VLLM_ALL2ALL_BACKEND, VLLM_USE_DEEP_GEMM, dp_size, tp_size) ,

(deepep_low_latency, 1, 2, 1)
(deepep_high_throughput, 1, 2, 1)
(deepep_low_latency, 0, 2, 1) // This PR touches fp8 weight loading when Deepgemm is enabled, this config tests regression
(deepep_high_throughput, 0, 2, 1)
(n/a, 1, 1, 2)
(n/a, 0, 1, 2)

lm_eval command :

lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-30B-A3B-FP8,base_url=http://localhost:9010/v1/completions,num_concurrent=30,max_retries=3 --limit 100

Test Result

lm_eval produces desired results on the PR, on both H100 and B200.
example of a desired result,

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.87|±  |0.0338|
|     |       |strict-match    |     5|exact_match|↑  | 0.92|±  |0.0273|

gemini-code-assist

Code Review

This pull request introduces performance optimizations for DeepGEMM on B200 hardware by correctly handling weight and activation scales in the required E8M0 format. The changes refactor the scale transformation logic into a centralized function, which improves code structure and adds support for B200 while maintaining H100 compatibility. The MoE framework is also correctly extended to support packed activation scales. My review includes one minor fix for an incorrect logging statement.

vllm/model_executor/layers/quantization/fp8.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/layers/quantization/utils/fp8_utils.py

vllm/model_executor/layers/fused_moe/layer.py

vllm/model_executor/layers/quantization/fp8.py

mgoin

LGTM nice find. I just have the concern about applying the right ue8m0 format for both Hopper and Blackwell if the model requires it

mgoin · 2025-11-01T07:11:08Z

vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py

+        """
+        DeepGemm supports packed ue8m0 activation scales format in devices >= sm100
+        """
+        return current_platform.is_device_capability(100)


The comment doesn't match this line since "is" is ==
Also isn't it the case though that we still want to use UE8M0 on hopper for cases like DeepSeek terminus?

+1, actually we are using e8m0 for hopper currently, this seems a breaking change for me.
We should carefully test and benchmark before we use this.

The comment doesn't match this line since "is" is ==
Updated the comment to == sm100, since deepgemm readme specifies sm100 explicitly. We can upgrade as needed.

Also isn't it the case though that we still want to use UE8M0 on hopper for cases like DeepSeek terminus?

IIUC, this the state of main :

let ws be a weight scales tensor of shape [X, 4096] and datatype float32

on Hopper and Blackwell - When we use DeepGemm, we always (for block fp8 models) cast the weight scales to UE8M0. ~~but keep the weight scales in float32. i.e. each float32 value actually holds UE8M0 content. Look here. i.e. only the first byte of each float32 value will have the actual contents.~~

[EDIT]The stricken out portion was wrong. We actually cast the weights to ue8m0 and then expand it back to float32 - effectively the scale values can be one of {2^i where i in [-127, 127]}

ws will be of shape [X, 4096] and of datatype float32.

This PR:

on Hopper - We don't change the behaviour on Hopper.

on Blackwell - We requant to UE8M0 and then we use the transform_sf_into_required_layout() from deepgemm to pack the scales into an int32 tensor. i.e. ws will be of shape [x, 1024] and of datatype int32. Effectively the scale values can be one of {i where in [-127, 127]}

+1, actually we are using e8m0 for hopper currently, this seems a breaking change for me.
We should carefully test and benchmark before we use this.
@yewentao256 I have added some benchmark and lm-eval numbers in the PR description.

Okay so Blackwell just has the packing part specifically, understood

yewentao256

Nice find and great performance improvement! Thanks for the work!
A few thoughts

yewentao256 · 2025-11-03T16:55:52Z

vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py

+        """
+        DeepGemm supports packed ue8m0 activation scales format in devices >= sm100
+        """
+        return current_platform.is_device_capability(100)


+1, actually we are using e8m0 for hopper currently, this seems a breaking change for me.
We should carefully test and benchmark before we use this.

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py

vllm/model_executor/layers/quantization/fp8.py

yewentao256 · 2025-11-03T17:05:03Z

vllm/utils/deep_gemm.py

-    if envs.VLLM_USE_FLASHINFER_MOE_FP8:
-        logger.info_once("DeepGEMM E8M0 disabled: FlashInfer MOE is enabled.")
-        return False


I am not sure who adds this before, could you take a further look?

May I know why is this removed? Is this because of MOE vs Gemm impl differences?
DeepGemm seems to always be enabled even when other MOE backends are enabled. We need to have a better check to identify the moe backend.

For example to run Flashinfer MOE we now need to run:
VLLM_USE_FLASHINFER_MOE_FP8=1 VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_USE_DEEP_GEMM=0 python ....

Looks like #25895 adds it. @pavanimajety can you please take a look. Thanks.

@pavanimajety sorry i missed your comment.

May I know why is this removed? Is this because of MOE vs Gemm impl differences?

I removed it in an effort to cleanup. I think this function should depend only deepgemm specific attributes / envs.

DeepGemm seems to always be enabled even when other MOE backends are enabled. We need to have a better check to identify the moe backend.

Yes, this is a problem. I have addressed this is fp8.py. Please take a look at comment https://github.com/vllm-project/vllm/pull/27897/files#r2487520906 .

I tried running,

VLLM_USE_FLASHINFER_MOE_FP8=1 VLLM_FLASHINFER_MOE_BACKEND="latency" canhazgpu run -g8 -- lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-R1,quantization=fp8,tensor_parallel_size=8,gpu_memory_utilization=0.90,add_bos_token=True --gen_kwargs temperature=0.0,max_gen_toks=32768 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size 200 --limit 1319

from #25895 and got,

|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9613|± |0.0053| | | |strict-match | 5|exact_match|↑ |0.9621|± |0.0053|

Thanks for testing Varun, I added the check in #25895 because we see incorrect logs and unrequired tuning when flashinfer MoE is enabled but DeepGemm is assumed default anyway.

vllm/model_executor/layers/quantization/fp8.py

mergify · 2025-11-04T17:18:16Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/model_executor/layers/fused_moe/modular_kernel.py

yewentao256

The idea looks good to me, could you also update all of the deepgemm unit tests accordingly?

vllm/model_executor/layers/quantization/fp8.py

yewentao256 · 2025-11-06T20:00:30Z

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py

+        # We don't have enough information to determine if we should dispatch
+        # activation scales in a packed ue8m0 format during object construction
+        # time. This setting is handled by setup_packed_ue8m0_scales_dispatch.
+        self.use_ue8m0 = False


So is this flag only be used in low latency dispatch, and weight requant doesn't require this ?

Yes. only the low_latency dispatch exposes this option.

and weight requant doesn't require this ?

the transform_sf_into_required_layout function from deepgemm does this automatically when run on sm100.

Could we rename to use_ue8m0_dispatch? In this case we can avoid mixture with e8m0 with scales

mergify · 2025-11-10T16:34:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/model_executor/layers/quantization/fp8.py

mergify · 2025-11-11T12:53:50Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

yewentao256

Please fix the conflicts and pre-commit issues so that we could land

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

mgoin · 2025-11-12T18:52:58Z

vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py

+        """
+        DeepGemm supports packed ue8m0 activation scales format in devices >= sm100
+        """
+        return current_platform.is_device_capability(100)


Okay so Blackwell just has the packing part specifically, understood

mgoin · 2025-11-12T20:51:27Z

vllm/model_executor/layers/quantization/utils/fp8_utils.py

+        layer.weight = torch.nn.Parameter(dg_weight, requires_grad=False)
+        layer.weight_scale = torch.nn.Parameter(dg_weight_scale, requires_grad=False)


I think we want to preserve the attributes on the original parameter cc @kylesayrs

Yes, without the original attributes we won't be able to reload weights. More changes than this will be required to support reloading, so this is fine to land now and rebase later.

Should I just parameter.data.copy_(new_tensor) to avoid any unintended effects ?

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: George D. Torres <gdavtor@gmail.com>

varun-sundar-rabindranath requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners October 31, 2025 21:21

gemini-code-assist bot reviewed Oct 31, 2025

View reviewed changes

vllm/model_executor/layers/quantization/fp8.py Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Oct 31, 2025

View reviewed changes

vllm/model_executor/layers/quantization/utils/fp8_utils.py Outdated Show resolved Hide resolved

varun-sundar-rabindranath commented Oct 31, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated Show resolved Hide resolved

varun-sundar-rabindranath commented Oct 31, 2025

View reviewed changes

vllm/model_executor/layers/quantization/fp8.py Show resolved Hide resolved

mgoin reviewed Nov 1, 2025

View reviewed changes

yewentao256 reviewed Nov 3, 2025

View reviewed changes

varun-sundar-rabindranath force-pushed the varun/fix-deepgemm-prologue branch from c6d1dda to 98bc916 Compare November 3, 2025 18:31

varun-sundar-rabindranath commented Nov 3, 2025

View reviewed changes

vllm/model_executor/layers/quantization/fp8.py Show resolved Hide resolved

varun-sundar-rabindranath requested review from mgoin and yewentao256 November 4, 2025 17:17

mergify bot added the needs-rebase label Nov 4, 2025

bnellnm reviewed Nov 4, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/modular_kernel.py Show resolved Hide resolved

varun-sundar-rabindranath force-pushed the varun/fix-deepgemm-prologue branch from 98bc916 to 73bf402 Compare November 5, 2025 14:35

mergify bot removed the needs-rebase label Nov 5, 2025

yewentao256 reviewed Nov 6, 2025

View reviewed changes

varun-sundar-rabindranath force-pushed the varun/fix-deepgemm-prologue branch from 73bf402 to d99e278 Compare November 10, 2025 16:05

mergify bot added the needs-rebase label Nov 10, 2025

varun-sundar-rabindranath force-pushed the varun/fix-deepgemm-prologue branch from d99e278 to 63a5a22 Compare November 10, 2025 16:45

mergify bot removed the needs-rebase label Nov 10, 2025

bnellnm reviewed Nov 10, 2025

View reviewed changes

vllm/model_executor/layers/quantization/fp8.py Show resolved Hide resolved

mergify bot added the needs-rebase label Nov 11, 2025

yewentao256 reviewed Nov 11, 2025

View reviewed changes

varun-sundar-rabindranath force-pushed the varun/fix-deepgemm-prologue branch from 63a5a22 to 67f9d13 Compare November 11, 2025 20:13

mergify bot removed the needs-rebase label Nov 11, 2025

bnellnm approved these changes Nov 11, 2025

View reviewed changes

varun-sundar-rabindranath requested a review from yewentao256 November 12, 2025 14:23

Varun Sundar Rabindranath added 3 commits November 12, 2025 11:25

fix deepgemm prologue

2184513

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

mk cleanup

356b2bf

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

use_ue8m0 -> use_ue8m0_dispatch

ad9b2fb

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

varun-sundar-rabindranath force-pushed the varun/fix-deepgemm-prologue branch from 67f9d13 to ad9b2fb Compare November 12, 2025 16:27

mgoin added performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed deepseek Related to DeepSeek models nvidia labels Nov 12, 2025

github-project-automation bot added this to NVIDIA Nov 12, 2025

mgoin reviewed Nov 12, 2025

View reviewed changes

vllm-bot merged commit 74a9a9f into vllm-project:main Nov 12, 2025
55 of 56 checks passed

github-project-automation bot moved this to Done in NVIDIA Nov 12, 2025

		layer.weight = torch.nn.Parameter(dg_weight, requires_grad=False)
		layer.weight_scale = torch.nn.Parameter(dg_weight_scale, requires_grad=False)

Uh oh!

[Performance][B200] Fix deepgemm prologue #27897

[Performance][B200] Fix deepgemm prologue #27897

Uh oh!

Conversation

varun-sundar-rabindranath commented Oct 31, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Benchmark

B200 + deepep_low_latency + decode

B200 + deepep_high_throughput + prefill

H100 + deepep_low_latency + decode

H100 + deepep_high_throughput + prefill

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yewentao256 Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

yewentao256 Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Nov 4, 2025

Uh oh!

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

varun-sundar-rabindranath commented Oct 31, 2025 •

edited by github-actions bot

Loading

yewentao256 Nov 3, 2025 •

edited

Loading

varun-sundar-rabindranath Nov 3, 2025 •

edited

Loading

yewentao256 Nov 3, 2025 •

edited

Loading

varun-sundar-rabindranath Nov 3, 2025 •

edited

Loading

varun-sundar-rabindranath Nov 10, 2025 •

edited

Loading