-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Bugfix] fix deepseek fp16 scale bug #14809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] fix deepseek fp16 scale bug #14809
Conversation
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
Thanks for this, I believe I'm affected by it at long context with Deepseek R1. Will test! UPDATE: My issue may be something else. I'm getting incoherent generation as the context gets longer. |
|
@jinzhen-lin For DeepSeek-V2 fp16 overflow issue, since it is only observed in MoE, maybe we can reduce the number of element-wise multiplication in MLP? Here is my suggestion (also an improvement to my previous pr): Concurrensee@646daad. |
@Concurrensee I think your design is incorrect: the variable You can try to change it to be class variable instead of instance variable. But still, you need to reset it back to False after the last layer. |
|
@jinzhen-lin @Concurrensee Another thing that I want to ask is that why you use |
Hi Lingxiao, I tweaked my code a little bit and fixed my suggestion to Concurrensee@f40cc40 So now, the suggestion only applies on DeepSeek V2 to prevent further unknown issue that we do not know now. |
Originally, in DeepSeek V2, the overflow was caused at scaling step in MoE layer. So, if we instead exploiting non-linearity and scaling values down, we can prevent the overflow in the MoE layer and keep the model same as before. |
|
@Concurrensee It seems that your commit doesn't solve the issue that the residual is scaled multi times (no tested though). We should always scale the residual on first layer (layer_idx == 0). Could you review my commit? @mgoin I think we should solve this issue as soon as possible, since it have great impact to deepseek-v3 + fp16. |
My commit just limits the scope to Deepseek-V2 and solves scaled multi times problem. Anyway, my commit is just a suggestion to reduce the number of element wise multiplications. Your solution should work. |
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for missing this. LGTM although it would be nice if this could be confirmed with a reproducible prompt or eval.
|
@jinzhen-lin I think you need to merge with main to fix the failing tests |
Sorry for late. BTW, should we disable this scaling for |
|
@jinzhen-lin DeepSeek V3 does have the overflow issue, back when I first quantized it with AWQ, it happened at the last layer. |
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Yang Wang <elainewy@meta.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
#13232 scales the hidden states and residual of deepseek model to avoid the bf16 overflow issue of deepseek-v2. However, it only consider the case when
first_k_dense_replace == 1. With models that usefirst_k_dense_replace > 1(DeepSeek-R1 use 3),residual *= 1. / self.routed_scaling_factorwould run multi times, leads the model to generate meaningless text.This PR fix it, it consider all cases of
first_k_dense_replacevalue.BTW, fp16 overflow issue seems only happen on
deepseek_v2models, maybe we should disable it fordeepseek_v3models ? (sincedeepseek_v3is trained with fp8, it is less likely to have fp16 overflow issue.)