-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phi-2 requires a disabled autocast in attention layer #28673
Conversation
Thanks for the PR! We are not super super fans of context managers for such things. TBH it's not that bad! cc @amyeroberts what's your take? |
Thanks for adding this fix @gugarosa! I don't mind this too much, it's pretty clean and simple :) Let's get @younesbelkada's opinion on whether this will break any other assumptions about weight loading in the library and possible alternatives |
No problems, thanks everyone for looking at it! Hopefully this is a one-time behavior and we will never see it again on future models 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gugarosa - thanks a lot for your work on this!
I am afraid here it might create some unexpected behaviour for users that use Trainer
with fp16=True
or bf16=True
(which turns on autocast) --> by force-setting these to false in the forward method of PhiAttention
we silently disable autocast for those users.
E.g. for correctly training with PEFT / QLoRA using autocast is crucial, so this PR might break QLoRA / PEFT convergence silently
What about educating users on this fix in the documentation? Do you think we can reasonably convert this PR to adding an appropriate section on Phi docs about educating users on how to fix the issue you mentioned on the PR?
Hi @gugarosa , seems like we are still having loss issue: #28488 (comment) Update: Ignore my comment - Apparently, my new installation of transformers didn't with your changes so same loss curves are expected. I tried to rerun training with changes in your PR and training failed:
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Sorry for the extremely delayed response. Sounds good, I will update the documentation. |
What does this PR do?
Phi-2 has an attention overflow issue, and since the model weights were released with a MIT license, there is no short-term solution in replacing them (re-training the model). Therefore, the only solution we could find to cover all corner cases regarding the overflow, is to also disable the autocast in the attention layer.
This update follows the current model file we have on
microsoft/phi-2
repository. Additionally, it follows the previous solution we had done before the Phi integration.Please let me know if we can think of any different solutions, or if there is anything else we can do.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@susnato @ArthurZucker