Fixes default value of `softmax_scale` in `PhiFlashAttention2`. #28537

gugarosa · 2024-01-16T16:19:24Z

What does this PR do?

Phi has never used softmax_scale=1.0 with Flash-Attention, so the default is being moved to None. This tentatively fixes any issue regarding fine-tuning Phi-based checkpoints when Flash-Attention 2 is turned on.
Documentation is also updated to reflect the official Phi checkpoints.

Fixes #28488 (tentative)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker @susnato

susnato · 2024-01-16T17:35:30Z

docs/source/en/model_doc/phi.md

->>> model = AutoModelForCausalLM.from_pretrained("susnato/phi-2")
->>> tokenizer = AutoTokenizer.from_pretrained("susnato/phi-2")
+>>> model = AutoModelForCausalLM.from_pretrained("phi-2")
+>>> tokenizer = AutoTokenizer.from_pretrained("phi-2")


Thanks for changing these checkpoints. 🙌
I was about to open a PR to change these

No problems!

susnato · 2024-01-16T17:51:11Z

src/transformers/models/phi/modeling_phi.py

        attn_output = self._flash_attention_forward(
-            query_states, key_states, value_states, attention_mask, q_len, dropout=attn_dropout, softmax_scale=1.0
+            query_states, key_states, value_states, attention_mask, q_len, dropout=attn_dropout, softmax_scale=None
        )


Ah, was this the reason for the issue regarding fine-tuning?
Now I am curious how the FA tests were passing before...

Anyway thanks a lot for fixing this!

I hope it is fixed, at least, I am now able to see the same fine-tuning loss with/without flash-attention.

We pre-trained the Phi models using 1 / sqrt(head_dim) as the softmax scale, and flash-attention uses the very same value when softmax_scale=None

BTW just to be sure, could you please run all flash attention tests(for phi) to check if they are passing or not.

RUN_SLOW=1 pytest -m flash_attn_test tests/models/phi --verbose

Should not be necessary since it's already fixing the fine-tuning issue, but just to be sure.

Just ran and everything has passed!

Yep, our CIs don't test flash attention, bit of a pity!

gugarosa · 2024-01-16T18:40:19Z

The loss=0.0 error while fine-tuning with FP16 is another issue and I do have an ugly fix, but will look into it with more patience (and use a separate PR).

ArthurZucker

Thanks a lot for this fix! Very tricky indeed

ArthurZucker · 2024-01-17T13:22:09Z

src/transformers/models/phi/modeling_phi.py

        attn_output = self._flash_attention_forward(
-            query_states, key_states, value_states, attention_mask, q_len, dropout=attn_dropout, softmax_scale=1.0
+            query_states, key_states, value_states, attention_mask, q_len, dropout=attn_dropout, softmax_scale=None
        )


Yep, our CIs don't test flash attention, bit of a pity!

gugarosa · 2024-01-17T13:24:53Z

No problems! Thanks for the merge!

younesbelkada · 2024-01-17T13:28:30Z

Thanks very much @gugarosa for the deep dive and the fix!

HuggingFaceDocBuilderDev · 2024-01-17T13:45:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ingface#28537) * fix(phi): Phi does not use softmax_scale in Flash-Attention. * chore(docs): Update Phi docs.

gugarosa added 2 commits January 16, 2024 08:09

fix(phi): Phi does not use softmax_scale in Flash-Attention.

6a5ad20

chore(docs): Update Phi docs.

baf8d3e

susnato reviewed Jan 16, 2024

View reviewed changes

gugarosa marked this pull request as ready for review January 16, 2024 18:37

ArthurZucker approved these changes Jan 17, 2024

View reviewed changes

ArthurZucker merged commit d93ef7d into huggingface:main Jan 17, 2024
19 checks passed

gugarosa deleted the fix-phi-tune branch January 17, 2024 13:25

wgifford pushed a commit to wgifford/transformers that referenced this pull request Jan 21, 2024

Fixes default value of softmax_scale in PhiFlashAttention2. (hugg…

e905a4c

…ingface#28537) * fix(phi): Phi does not use softmax_scale in Flash-Attention. * chore(docs): Update Phi docs.

AjayP13 pushed a commit to AjayP13/transformers that referenced this pull request Jan 22, 2024

Fixes default value of softmax_scale in PhiFlashAttention2. (hugg…

a68c2de

…ingface#28537) * fix(phi): Phi does not use softmax_scale in Flash-Attention. * chore(docs): Update Phi docs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes default value of `softmax_scale` in `PhiFlashAttention2`. #28537

Fixes default value of `softmax_scale` in `PhiFlashAttention2`. #28537

gugarosa commented Jan 16, 2024 •

edited

Loading

susnato Jan 16, 2024

gugarosa Jan 16, 2024

susnato Jan 16, 2024

gugarosa Jan 16, 2024

susnato Jan 16, 2024

gugarosa Jan 17, 2024

ArthurZucker Jan 17, 2024

gugarosa commented Jan 16, 2024 •

edited

Loading

ArthurZucker left a comment

ArthurZucker Jan 17, 2024

gugarosa commented Jan 17, 2024

younesbelkada commented Jan 17, 2024

HuggingFaceDocBuilderDev commented Jan 17, 2024

Fixes default value of softmax_scale in PhiFlashAttention2. #28537

Fixes default value of softmax_scale in PhiFlashAttention2. #28537

Conversation

gugarosa commented Jan 16, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gugarosa commented Jan 16, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gugarosa commented Jan 17, 2024

younesbelkada commented Jan 17, 2024

HuggingFaceDocBuilderDev commented Jan 17, 2024

Fixes default value of `softmax_scale` in `PhiFlashAttention2`. #28537

Fixes default value of `softmax_scale` in `PhiFlashAttention2`. #28537

gugarosa commented Jan 16, 2024 •

edited

Loading

gugarosa commented Jan 16, 2024 •

edited

Loading