-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[T5] enable T5 fp16 #9487
[T5] enable T5 fp16 #9487
Conversation
@@ -640,6 +640,11 @@ def forward( | |||
hidden_states, present_key_value_state = self_attention_outputs[:2] | |||
attention_outputs = self_attention_outputs[2:] # Keep self-attention outputs and relative position weights | |||
|
|||
# clamp inf values | |||
if torch.isinf(hidden_states).any(): | |||
clamp_value = torch.finfo(hidden_states.dtype).max - 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the -1000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be on the safer side, setting it to the exact max value might again lead to inf
values in subsequent layers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okey just noticed that we do the same in Bart as well
@@ -640,6 +640,11 @@ def forward( | |||
hidden_states, present_key_value_state = self_attention_outputs[:2] | |||
attention_outputs = self_attention_outputs[2:] # Keep self-attention outputs and relative position weights | |||
|
|||
# clamp inf values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe improve comment slightly:
# clamp inf values | |
# clamp inf values to enable fp16 training |
This is great! |
Dear @patil-suraj Can you tell me, should your code fix fp16 on google/t5-v1_1-xl model? Upd: I run my code on Transformers's branch from your current PR #9487 merged with PR #9211 needed for deepspeed integration. |
4e284b6
to
5a47157
Compare
Hey @exelents, can you include a code snippet to reproduce your error as well as the full stack trace of your error? |
as stated in #9432 This fix works for following models and versions, with apex
Just did a small experiment with also, @exelents by overflow error do you mean the gradient overflow warning thrown by |
Ah ok, we still see |
Here is error stack: |
I'm again trying to locate where exactly in the model this happen. In case it's the same as above (first |
I have checked a loss value, and it seems in is not NaN. It got values like "48.7500" or "40.9688" but there are vaild values. Despite that I see messages like "OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024.0, reducing to 512.0", that it seems means that something bad happened with model's loss. |
Those warnings don't mean anything went wrong, it's logical with dynamic loss scaling that some loss scale values are too big at the beginning of training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for fixing this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! Thanks for working on this @patil-suraj!
What does this PR do?
This PR enables fp16 for T5 models, by clamping hidden states to the max value of the current data type.
As detailed in #9295, T5 produces large (
inf
) activations at 3 placesT5LayerFF
T5LayerSelfAttention
T5LayerCrossAttention
To avoid these
inf
activations this PR clamps thehidden_states
after above 3 outputs