Phi-2 requires a disabled autocast in attention layer #28673

gugarosa · 2024-01-23T21:49:07Z

What does this PR do?

Phi-2 has an attention overflow issue, and since the model weights were released with a MIT license, there is no short-term solution in replacing them (re-training the model). Therefore, the only solution we could find to cover all corner cases regarding the overflow, is to also disable the autocast in the attention layer.

This update follows the current model file we have on microsoft/phi-2 repository. Additionally, it follows the previous solution we had done before the Phi integration.

Please let me know if we can think of any different solutions, or if there is anything else we can do.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@susnato @ArthurZucker

ArthurZucker · 2024-01-25T17:53:42Z

Thanks for the PR! We are not super super fans of context managers for such things. TBH it's not that bad! cc @amyeroberts what's your take?

amyeroberts · 2024-01-26T18:33:33Z

Thanks for adding this fix @gugarosa!

I don't mind this too much, it's pretty clean and simple :) Let's get @younesbelkada's opinion on whether this will break any other assumptions about weight loading in the library and possible alternatives

gugarosa · 2024-01-29T12:50:01Z

No problems, thanks everyone for looking at it! Hopefully this is a one-time behavior and we will never see it again on future models 🙏

younesbelkada

Hi @gugarosa - thanks a lot for your work on this!
I am afraid here it might create some unexpected behaviour for users that use Trainer with fp16=True or bf16=True (which turns on autocast) --> by force-setting these to false in the forward method of PhiAttention we silently disable autocast for those users.
E.g. for correctly training with PEFT / QLoRA using autocast is crucial, so this PR might break QLoRA / PEFT convergence silently
What about educating users on this fix in the documentation? Do you think we can reasonably convert this PR to adding an appropriate section on Phi docs about educating users on how to fix the issue you mentioned on the PR?

akjindal53244 · 2024-02-13T16:51:23Z

Hi @gugarosa , seems like we are still having loss issue: #28488 (comment)

Update: Ignore my comment - Apparently, my new installation of transformers didn't with your changes so same loss curves are expected. I tried to rerun training with changes in your PR and training failed:

  File "/home/minimalist/miniconda3/envs/axolotl_Feb12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/minimalist/miniconda3/envs/axolotl_Feb12/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/minimalist/miniconda3/envs/axolotl_Feb12/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/minimalist/work/projects/transformers/src/transformers/models/phi/modeling_phi.py", line 318, in forward
    query_states = self.q_proj(hidden_states)
  File "/home/minimalist/miniconda3/envs/axolotl_Feb12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/minimalist/miniconda3/envs/axolotl_Feb12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/minimalist/miniconda3/envs/axolotl_Feb12/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16

github-actions · 2024-03-09T08:04:29Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

gugarosa · 2024-03-13T14:00:43Z

Sorry for the extremely delayed response.

Sounds good, I will update the documentation.

gugarosa added 2 commits January 23, 2024 13:39

fix(phi): Disables autocast on PhiAttention layer.

d9321b2

chore(docs): Update Phi docs.

0f7c4ca

gugarosa marked this pull request as ready for review January 23, 2024 22:14

younesbelkada reviewed Jan 29, 2024

View reviewed changes

ArthurZucker mentioned this pull request Feb 13, 2024

fine tuning the updated Phi-2 with flash-attn-2 produces very high loss > 2 #28488

Closed

4 tasks

hackyon mentioned this pull request Feb 19, 2024

[Phi] Add support for sdpa #29108

Merged

5 tasks

gugarosa closed this Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phi-2 requires a disabled autocast in attention layer #28673

Phi-2 requires a disabled autocast in attention layer #28673

gugarosa commented Jan 23, 2024 •

edited

Loading

ArthurZucker commented Jan 25, 2024

amyeroberts commented Jan 26, 2024

gugarosa commented Jan 29, 2024

younesbelkada left a comment

akjindal53244 commented Feb 13, 2024 •

edited

Loading

github-actions bot commented Mar 9, 2024

gugarosa commented Mar 13, 2024

Phi-2 requires a disabled autocast in attention layer #28673

Phi-2 requires a disabled autocast in attention layer #28673

Conversation

gugarosa commented Jan 23, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

ArthurZucker commented Jan 25, 2024

amyeroberts commented Jan 26, 2024

gugarosa commented Jan 29, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

akjindal53244 commented Feb 13, 2024 • edited Loading

github-actions bot commented Mar 9, 2024

gugarosa commented Mar 13, 2024

gugarosa commented Jan 23, 2024 •

edited

Loading

akjindal53244 commented Feb 13, 2024 •

edited

Loading