[Bugfix] fix deepseek fp16 scale bug #14809

jinzhen-lin · 2025-03-14T08:35:25Z

#13232 scales the hidden states and residual of deepseek model to avoid the bf16 overflow issue of deepseek-v2. However, it only consider the case when first_k_dense_replace == 1. With models that use first_k_dense_replace > 1 (DeepSeek-R1 use 3), residual *= 1. / self.routed_scaling_factor would run multi times, leads the model to generate meaningless text.

This PR fix it, it consider all cases of first_k_dense_replace value.

BTW, fp16 overflow issue seems only happen on deepseek_v2 models, maybe we should disable it for deepseek_v3 models ? (since deepseek_v3 is trained with fp8, it is less likely to have fp16 overflow issue.)

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

github-actions · 2025-03-14T08:35:34Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

davidsyoung · 2025-03-16T21:55:11Z

Thanks for this, I believe I'm affected by it at long context with Deepseek R1. Will test!

UPDATE: My issue may be something else. I'm getting incoherent generation as the context gets longer.

Concurrensee · 2025-03-17T16:29:39Z

@jinzhen-lin
Sorry, I previously only considered the first_k_dense_replace for DeepSeek-V2, did not consider when first_k_dense_replace is not 1.

For DeepSeek-V2 fp16 overflow issue, since it is only observed in MoE, maybe we can reduce the number of element-wise multiplication in MLP? Here is my suggestion (also an improvement to my previous pr): Concurrensee@646daad.

LingxiaoShawn · 2025-03-19T18:42:24Z

@jinzhen-lin Sorry, I previously only considered the first_k_dense_replace for DeepSeek-V2, did not consider when first_k_dense_replace is not 1.

For DeepSeek-V2 fp16 overflow issue, since it is only observed in MoE, maybe we can reduce the number of element-wise multiplication in MLP? Here is my suggestion (also an improvement to my previous pr): Concurrensee@646daad.

@Concurrensee I think your design is incorrect: the variable fp16_residual_rescaled is not shared across layers, each layer has its own fp16_residual_rescaled.

You can try to change it to be class variable instead of instance variable. But still, you need to reset it back to False after the last layer.

LingxiaoShawn · 2025-03-20T01:12:03Z

@jinzhen-lin @Concurrensee Another thing that I want to ask is that why you use routed_scaling_factor to scale down the fp16 computation? How can you make sure that the routed_scaling_factor is large enough for resolving overflow issue?

Concurrensee · 2025-03-20T13:57:43Z

@jinzhen-lin Sorry, I previously only considered the first_k_dense_replace for DeepSeek-V2, did not consider when first_k_dense_replace is not 1.
For DeepSeek-V2 fp16 overflow issue, since it is only observed in MoE, maybe we can reduce the number of element-wise multiplication in MLP? Here is my suggestion (also an improvement to my previous pr): Concurrensee@646daad.

@Concurrensee I think your design is incorrect: the variable fp16_residual_rescaled is not shared across layers, each layer has its own fp16_residual_rescaled.

You can try to change it to be class variable instead of instance variable. But still, you need to reset it back to False after the last layer.

Hi Lingxiao, I tweaked my code a little bit and fixed my suggestion to Concurrensee@f40cc40

So now, the suggestion only applies on DeepSeek V2 to prevent further unknown issue that we do not know now.

Concurrensee · 2025-03-20T14:05:49Z

@jinzhen-lin @Concurrensee Another thing that I want to ask is that why you use routed_scaling_factor to scale down the fp16 computation? How can you make sure that the routed_scaling_factor is large enough for resolving overflow issue?

Originally, in DeepSeek V2, the overflow was caused at scaling step in MoE layer. So, if we instead exploiting non-linearity and scaling values down, we can prevent the overflow in the MoE layer and keep the model same as before.

jinzhen-lin · 2025-03-23T12:54:10Z

@Concurrensee It seems that your commit doesn't solve the issue that the residual is scaled multi times (no tested though). We should always scale the residual on first layer (layer_idx == 0). Could you review my commit?

@mgoin I think we should solve this issue as soon as possible, since it have great impact to deepseek-v3 + fp16.

Concurrensee · 2025-03-27T15:36:54Z

@Concurrensee It seems that your commit doesn't solve the issue that the residual is scaled multi times (no tested though). We should always scale the residual on first layer (layer_idx == 0). Could you review my commit?

@mgoin I think we should solve this issue as soon as possible, since it have great impact to deepseek-v3 + fp16.

My commit just limits the scope to Deepseek-V2 and solves scaled multi times problem. Anyway, my commit is just a suggestion to reduce the number of element wise multiplications. Your solution should work.

mgoin

Sorry for missing this. LGTM although it would be nice if this could be confirmed with a reproducible prompt or eval.

mgoin · 2025-04-07T16:54:51Z

@jinzhen-lin I think you need to merge with main to fix the failing tests

jinzhen-lin · 2025-04-08T02:04:51Z

@jinzhen-lin I think you need to merge with main to fix the failing tests

Sorry for late. BTW, should we disable this scaling for config.model_type == 'deepseek_v3' ? DeepSeek V3 series use fp8 for training, the maximum value of fp8 is 57344 (e5m2) or 448 (e4m3), so I think it should not have fp16 overflow issue.

LagPixelLOL · 2025-04-08T09:44:18Z

@jinzhen-lin DeepSeek V3 does have the overflow issue, back when I first quantized it with AWQ, it happened at the last layer.

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Yang Wang <elainewy@meta.com>

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

fix deepseek fp16 scale bug

12fa021

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

jeejeelee requested a review from mgoin March 14, 2025 10:57

jinzhen-lin mentioned this pull request Apr 6, 2025

[Kernel] moe wna16 marlin kernel #14447

Merged

mgoin approved these changes Apr 6, 2025

View reviewed changes

mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed labels Apr 6, 2025

Merge branch 'main' into fix-deepseek-fp16-scale-error

8c3e358

mgoin merged commit db10422 into vllm-project:main Apr 8, 2025
43 checks passed

taishan1994 mentioned this pull request Apr 14, 2025

[Bug]: deepseek-r1 awq fp16 overflow #16579

Closed

1 task

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025

[Bugfix] fix deepseek fp16 scale bug (vllm-project#14809)

973e94d

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[Bugfix] fix deepseek fp16 scale bug (vllm-project#14809)

7a07b5d

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>

Uh oh!

[Bugfix] fix deepseek fp16 scale bug #14809

[Bugfix] fix deepseek fp16 scale bug #14809

Uh oh!

Conversation

jinzhen-lin commented Mar 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 14, 2025

Uh oh!

davidsyoung commented Mar 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Concurrensee commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LingxiaoShawn commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LingxiaoShawn commented Mar 20, 2025

Uh oh!

Concurrensee commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Concurrensee commented Mar 20, 2025

Uh oh!

jinzhen-lin commented Mar 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Concurrensee commented Mar 27, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin commented Apr 7, 2025

Uh oh!

jinzhen-lin commented Apr 8, 2025

Uh oh!

LagPixelLOL commented Apr 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jinzhen-lin commented Mar 14, 2025 •

edited by github-actions bot

Loading

davidsyoung commented Mar 16, 2025 •

edited

Loading

Concurrensee commented Mar 17, 2025 •

edited

Loading

LingxiaoShawn commented Mar 19, 2025 •

edited

Loading

Concurrensee commented Mar 20, 2025 •

edited

Loading

jinzhen-lin commented Mar 23, 2025 •

edited

Loading