Skip to content

Conversation

@jinzhen-lin
Copy link
Contributor

@jinzhen-lin jinzhen-lin commented Mar 14, 2025

#13232 scales the hidden states and residual of deepseek model to avoid the bf16 overflow issue of deepseek-v2. However, it only consider the case when first_k_dense_replace == 1. With models that use first_k_dense_replace > 1 (DeepSeek-R1 use 3), residual *= 1. / self.routed_scaling_factor would run multi times, leads the model to generate meaningless text.

This PR fix it, it consider all cases of first_k_dense_replace value.

BTW, fp16 overflow issue seems only happen on deepseek_v2 models, maybe we should disable it for deepseek_v3 models ? (since deepseek_v3 is trained with fp8, it is less likely to have fp16 overflow issue.)

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@jeejeelee jeejeelee requested a review from mgoin March 14, 2025 10:57
@davidsyoung
Copy link

davidsyoung commented Mar 16, 2025

Thanks for this, I believe I'm affected by it at long context with Deepseek R1. Will test!

UPDATE: My issue may be something else. I'm getting incoherent generation as the context gets longer.

@Concurrensee
Copy link
Contributor

Concurrensee commented Mar 17, 2025

@jinzhen-lin
Sorry, I previously only considered the first_k_dense_replace for DeepSeek-V2, did not consider when first_k_dense_replace is not 1.

For DeepSeek-V2 fp16 overflow issue, since it is only observed in MoE, maybe we can reduce the number of element-wise multiplication in MLP? Here is my suggestion (also an improvement to my previous pr): Concurrensee@646daad.

@LingxiaoShawn
Copy link

LingxiaoShawn commented Mar 19, 2025

@jinzhen-lin Sorry, I previously only considered the first_k_dense_replace for DeepSeek-V2, did not consider when first_k_dense_replace is not 1.

For DeepSeek-V2 fp16 overflow issue, since it is only observed in MoE, maybe we can reduce the number of element-wise multiplication in MLP? Here is my suggestion (also an improvement to my previous pr): Concurrensee@646daad.

@Concurrensee I think your design is incorrect: the variable fp16_residual_rescaled is not shared across layers, each layer has its own fp16_residual_rescaled.

You can try to change it to be class variable instead of instance variable. But still, you need to reset it back to False after the last layer.

@LingxiaoShawn
Copy link

@jinzhen-lin @Concurrensee Another thing that I want to ask is that why you use routed_scaling_factor to scale down the fp16 computation? How can you make sure that the routed_scaling_factor is large enough for resolving overflow issue?

@Concurrensee
Copy link
Contributor

Concurrensee commented Mar 20, 2025

@jinzhen-lin Sorry, I previously only considered the first_k_dense_replace for DeepSeek-V2, did not consider when first_k_dense_replace is not 1.
For DeepSeek-V2 fp16 overflow issue, since it is only observed in MoE, maybe we can reduce the number of element-wise multiplication in MLP? Here is my suggestion (also an improvement to my previous pr): Concurrensee@646daad.

@Concurrensee I think your design is incorrect: the variable fp16_residual_rescaled is not shared across layers, each layer has its own fp16_residual_rescaled.

You can try to change it to be class variable instead of instance variable. But still, you need to reset it back to False after the last layer.

Hi Lingxiao, I tweaked my code a little bit and fixed my suggestion to Concurrensee@f40cc40

So now, the suggestion only applies on DeepSeek V2 to prevent further unknown issue that we do not know now.

@Concurrensee
Copy link
Contributor

@jinzhen-lin @Concurrensee Another thing that I want to ask is that why you use routed_scaling_factor to scale down the fp16 computation? How can you make sure that the routed_scaling_factor is large enough for resolving overflow issue?

Originally, in DeepSeek V2, the overflow was caused at scaling step in MoE layer. So, if we instead exploiting non-linearity and scaling values down, we can prevent the overflow in the MoE layer and keep the model same as before.

@jinzhen-lin
Copy link
Contributor Author

jinzhen-lin commented Mar 23, 2025

@Concurrensee It seems that your commit doesn't solve the issue that the residual is scaled multi times (no tested though). We should always scale the residual on first layer (layer_idx == 0). Could you review my commit?

@mgoin I think we should solve this issue as soon as possible, since it have great impact to deepseek-v3 + fp16.

@Concurrensee
Copy link
Contributor

@Concurrensee It seems that your commit doesn't solve the issue that the residual is scaled multi times (no tested though). We should always scale the residual on first layer (layer_idx == 0). Could you review my commit?

@mgoin I think we should solve this issue as soon as possible, since it have great impact to deepseek-v3 + fp16.

My commit just limits the scope to Deepseek-V2 and solves scaled multi times problem. Anyway, my commit is just a suggestion to reduce the number of element wise multiplications. Your solution should work.

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for missing this. LGTM although it would be nice if this could be confirmed with a reproducible prompt or eval.

@mgoin mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed labels Apr 6, 2025
@mgoin
Copy link
Member

mgoin commented Apr 7, 2025

@jinzhen-lin I think you need to merge with main to fix the failing tests

@jinzhen-lin
Copy link
Contributor Author

@jinzhen-lin I think you need to merge with main to fix the failing tests

Sorry for late. BTW, should we disable this scaling for config.model_type == 'deepseek_v3' ? DeepSeek V3 series use fp8 for training, the maximum value of fp8 is 57344 (e5m2) or 448 (e4m3), so I think it should not have fp16 overflow issue.

@LagPixelLOL
Copy link

@jinzhen-lin DeepSeek V3 does have the overflow issue, back when I first quantized it with AWQ, it happened at the last layer.

@mgoin mgoin merged commit db10422 into vllm-project:main Apr 8, 2025
43 checks passed
yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Apr 21, 2025
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Yang Wang <elainewy@meta.com>
jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants