-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Performance][B200] Fix deepgemm prologue #27897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance][B200] Fix deepgemm prologue #27897
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces performance optimizations for DeepGEMM on B200 hardware by correctly handling weight and activation scales in the required E8M0 format. The changes refactor the scale transformation logic into a centralized function, which improves code structure and adds support for B200 while maintaining H100 compatibility. The MoE framework is also correctly extended to support packed activation scales. My review includes one minor fix for an incorrect logging statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM nice find. I just have the concern about applying the right ue8m0 format for both Hopper and Blackwell if the model requires it
| """ | ||
| DeepGemm supports packed ue8m0 activation scales format in devices >= sm100 | ||
| """ | ||
| return current_platform.is_device_capability(100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment doesn't match this line since "is" is ==
Also isn't it the case though that we still want to use UE8M0 on hopper for cases like DeepSeek terminus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, actually we are using e8m0 for hopper currently, this seems a breaking change for me.
We should carefully test and benchmark before we use this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment doesn't match this line since "is" is ==
Updated the comment to == sm100, since deepgemm readme specifies sm100 explicitly. We can upgrade as needed.
Also isn't it the case though that we still want to use UE8M0 on hopper for cases like DeepSeek terminus?
IIUC, this the state of main :
let ws be a weight scales tensor of shape [X, 4096] and datatype float32
- on Hopper and Blackwell - When we use DeepGemm, we always (for block fp8 models) cast the weight scales to UE8M0.
but keep the weight scales infloat32. i.e. each float32 value actually holds UE8M0 content. Look here. i.e. only the first byte of each float32 value will have the actual contents.
[EDIT]The stricken out portion was wrong. We actually cast the weights to ue8m0 and then expand it back to float32 - effectively the scale values can be one of {2^i where i in [-127, 127]}
ws will be of shape [X, 4096] and of datatype float32.
This PR:
- on Hopper - We don't change the behaviour on Hopper.
- on Blackwell - We requant to
UE8M0and then we use thetransform_sf_into_required_layout()from deepgemm to pack the scales into an int32 tensor. i.e.wswill be of shape[x, 1024]and of datatypeint32. Effectively the scale values can be one of{i where in [-127, 127]}
+1, actually we are using e8m0 for hopper currently, this seems a breaking change for me.
We should carefully test and benchmark before we use this.
@yewentao256 I have added some benchmark and lm-eval numbers in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so Blackwell just has the packing part specifically, understood
yewentao256
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find and great performance improvement! Thanks for the work!
A few thoughts
| """ | ||
| DeepGemm supports packed ue8m0 activation scales format in devices >= sm100 | ||
| """ | ||
| return current_platform.is_device_capability(100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, actually we are using e8m0 for hopper currently, this seems a breaking change for me.
We should carefully test and benchmark before we use this.
| if envs.VLLM_USE_FLASHINFER_MOE_FP8: | ||
| logger.info_once("DeepGEMM E8M0 disabled: FlashInfer MOE is enabled.") | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure who adds this before, could you take a further look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I know why is this removed? Is this because of MOE vs Gemm impl differences?
DeepGemm seems to always be enabled even when other MOE backends are enabled. We need to have a better check to identify the moe backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example to run Flashinfer MOE we now need to run:
VLLM_USE_FLASHINFER_MOE_FP8=1 VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_USE_DEEP_GEMM=0 python ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like #25895 adds it. @pavanimajety can you please take a look. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pavanimajety sorry i missed your comment.
May I know why is this removed? Is this because of MOE vs Gemm impl differences?
I removed it in an effort to cleanup. I think this function should depend only deepgemm specific attributes / envs.
DeepGemm seems to always be enabled even when other MOE backends are enabled. We need to have a better check to identify the moe backend.
Yes, this is a problem. I have addressed this is fp8.py. Please take a look at comment https://github.com/vllm-project/vllm/pull/27897/files#r2487520906 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried running,
VLLM_USE_FLASHINFER_MOE_FP8=1 VLLM_FLASHINFER_MOE_BACKEND="latency" canhazgpu run -g8 -- lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-R1,quantization=fp8,tensor_parallel_size=8,gpu_memory_utilization=0.90,add_bos_token=True --gen_kwargs temperature=0.0,max_gen_toks=32768 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size 200 --limit 1319
from #25895 and got,
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9613|± |0.0053|
| | |strict-match | 5|exact_match|↑ |0.9621|± |0.0053|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for testing Varun, I added the check in #25895 because we see incorrect logs and unrequired tuning when flashinfer MoE is enabled but DeepGemm is assumed default anyway.
c6d1dda to
98bc916
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
98bc916 to
73bf402
Compare
yewentao256
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea looks good to me, could you also update all of the deepgemm unit tests accordingly?
| # We don't have enough information to determine if we should dispatch | ||
| # activation scales in a packed ue8m0 format during object construction | ||
| # time. This setting is handled by setup_packed_ue8m0_scales_dispatch. | ||
| self.use_ue8m0 = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is this flag only be used in low latency dispatch, and weight requant doesn't require this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. only the low_latency dispatch exposes this option.
and weight requant doesn't require this ?
the transform_sf_into_required_layout function from deepgemm does this automatically when run on sm100.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we rename to use_ue8m0_dispatch? In this case we can avoid mixture with e8m0 with scales
73bf402 to
d99e278
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
d99e278 to
63a5a22
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
yewentao256
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the conflicts and pre-commit issues so that we could land
63a5a22 to
67f9d13
Compare
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
67f9d13 to
ad9b2fb
Compare
| """ | ||
| DeepGemm supports packed ue8m0 activation scales format in devices >= sm100 | ||
| """ | ||
| return current_platform.is_device_capability(100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so Blackwell just has the packing part specifically, understood
| layer.weight = torch.nn.Parameter(dg_weight, requires_grad=False) | ||
| layer.weight_scale = torch.nn.Parameter(dg_weight_scale, requires_grad=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to preserve the attributes on the original parameter cc @kylesayrs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, without the original attributes we won't be able to reload weights. More changes than this will be required to support reloading, so this is fine to land now and rebase later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I just parameter.data.copy_(new_tensor) to avoid any unintended effects ?
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: George D. Torres <gdavtor@gmail.com>
Purpose
DeepGEMM requires the activation and weights scales to be in a specific format. When the scales are not provided in the desired format, DeepGEMM transforms the scales itself. This is usually very slow.
On H100, DeepGEMM needs the scales to be ColumnMajor and in float32.
On B200, DeepGEMM needs the scales to be ColumnMajor but in a packed E8M0 format.
Note that we handle the H100 case already on
main. This PR adds partial support for the B200 case. Concretely,transform_sf_into_required_layoutfrom DeepGEMM to perform weight scales transformation. This function handles both SM90 and SM100 internally and can be used commonly.main:PR:Benchmark
full benchmark numbers - link
server command :
VLLM_ALL2ALL_BACKEND=${A2A} VLLM_USE_DEEP_GEMM=1 canhazgpu run -g2 -- vllm serve Qwen/Qwen3-30B-A3B-FP8 --trust-remote-code --tensor-parallel-size 1 --data-parallel-size 2 --enable-expert-parallel --no-enable-prefix-caching --port 9010Decode bench command :
vllm bench serve --model Qwen/Qwen3-30B-A3B-FP8 --dataset-name random --num-prompts 128 --random-input-len 1 --random-output-len 1024 --request-rate 128 --ignore-eos --port 9010Prefill bench command :
vllm bench serve --model Qwen/Qwen3-30B-A3B-FP8 --dataset-name random --num-prompts 256 --random-input-len 8192 --random-output-len 1 --request-rate 256 --ignore-eos --port 9010 --backend vllmB200 + deepep_low_latency + decode
<style type="text/css"></style>
B200 + deepep_high_throughput + prefill
<style type="text/css"></style>
H100 + deepep_low_latency + decode
<style type="text/css"></style>
H100 + deepep_high_throughput + prefill
<style type="text/css"></style>
Test Plan
Server command :
vllm serve Qwen/Qwen3-30B-A3B-FP8 --trust-remote-code --tensor-parallel-size ${tp_size} --data-parallel-size ${dp_size} --enable-expert-parallel --no-enable-prefix-caching --port 9010Server config combinations (
VLLM_ALL2ALL_BACKEND,VLLM_USE_DEEP_GEMM,dp_size,tp_size) ,lm_eval command :
Test Result
lm_eval produces desired results on the PR, on both H100 and B200.
example of a desired result,