Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phi-2 requires a disabled autocast in attention layer #28673

Closed
wants to merge 2 commits into from
Closed

Phi-2 requires a disabled autocast in attention layer #28673

wants to merge 2 commits into from

Conversation

gugarosa
Copy link
Contributor

@gugarosa gugarosa commented Jan 23, 2024

What does this PR do?

Phi-2 has an attention overflow issue, and since the model weights were released with a MIT license, there is no short-term solution in replacing them (re-training the model). Therefore, the only solution we could find to cover all corner cases regarding the overflow, is to also disable the autocast in the attention layer.

This update follows the current model file we have on microsoft/phi-2 repository. Additionally, it follows the previous solution we had done before the Phi integration.

Please let me know if we can think of any different solutions, or if there is anything else we can do.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@susnato @ArthurZucker

@gugarosa gugarosa marked this pull request as ready for review January 23, 2024 22:14
@ArthurZucker
Copy link
Collaborator

Thanks for the PR! We are not super super fans of context managers for such things. TBH it's not that bad! cc @amyeroberts what's your take?

@amyeroberts
Copy link
Collaborator

Thanks for adding this fix @gugarosa!

I don't mind this too much, it's pretty clean and simple :) Let's get @younesbelkada's opinion on whether this will break any other assumptions about weight loading in the library and possible alternatives

@gugarosa
Copy link
Contributor Author

No problems, thanks everyone for looking at it! Hopefully this is a one-time behavior and we will never see it again on future models 🙏

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gugarosa - thanks a lot for your work on this!
I am afraid here it might create some unexpected behaviour for users that use Trainer with fp16=True or bf16=True (which turns on autocast) --> by force-setting these to false in the forward method of PhiAttention we silently disable autocast for those users.
E.g. for correctly training with PEFT / QLoRA using autocast is crucial, so this PR might break QLoRA / PEFT convergence silently
What about educating users on this fix in the documentation? Do you think we can reasonably convert this PR to adding an appropriate section on Phi docs about educating users on how to fix the issue you mentioned on the PR?

@akjindal53244
Copy link

akjindal53244 commented Feb 13, 2024

Hi @gugarosa , seems like we are still having loss issue: #28488 (comment)

Update: Ignore my comment - Apparently, my new installation of transformers didn't with your changes so same loss curves are expected. I tried to rerun training with changes in your PR and training failed:

  File "/home/minimalist/miniconda3/envs/axolotl_Feb12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/minimalist/miniconda3/envs/axolotl_Feb12/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/minimalist/miniconda3/envs/axolotl_Feb12/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/minimalist/work/projects/transformers/src/transformers/models/phi/modeling_phi.py", line 318, in forward
    query_states = self.q_proj(hidden_states)
  File "/home/minimalist/miniconda3/envs/axolotl_Feb12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/minimalist/miniconda3/envs/axolotl_Feb12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/minimalist/miniconda3/envs/axolotl_Feb12/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16

@hackyon hackyon mentioned this pull request Feb 19, 2024
5 tasks
Copy link

github-actions bot commented Mar 9, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@gugarosa
Copy link
Contributor Author

Sorry for the extremely delayed response.

Sounds good, I will update the documentation.

@gugarosa gugarosa closed this Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants