T5 fp16 overflow in forward (T5DenseReluDense) #5651

lior1990 · 2020-07-10T07:14:12Z

🐛 Bug

Using AutoModelWithLMHead.from_pretrained("t5-base") for fine-tuning, after 34 iterations I get nan loss from the forward method.
After debugging it, I found that the source of the nan is due to an overflow that happens in T5DenseReluDense, when running h = self.wo(h). The result of this forward is a tensor that has inf in one of its values, which later on causes the nan loss.
I looked into this calculation with fp32 and I saw that his inf is caused due to a value of 66246.3906, which is over the maximum value of 65504 in fp16.

This issue only happens with fp16 (opt_level="O1"), with opt_level="O0" everything is fine.

Information

Model I am using (Bert, XLNet ...): T5

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

I don't have step by step instructions, because I will need to upload my entire dataset for that.
I have a pickle for the vector h and the weights of self.wo that causes the overflow in T5DenseReluDense, I can upload it if it might help.

Expected behavior

get a numeric loss

Environment info

transformers version: 3.0.2
Platform: Linux-5.3.0-1030-aws-x86_64-with-debian-buster-sid
Python version: 3.6.10
PyTorch version (GPU?): 1.5.1 (True)
Tensorflow version (GPU?): not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2020-07-10T12:53:19Z

See: #4586

patrickvonplaten closed this as completed Jul 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T5 fp16 overflow in forward (T5DenseReluDense) #5651

T5 fp16 overflow in forward (T5DenseReluDense) #5651

lior1990 commented Jul 10, 2020

patrickvonplaten commented Jul 10, 2020

T5 fp16 overflow in forward (T5DenseReluDense) #5651

T5 fp16 overflow in forward (T5DenseReluDense) #5651

Comments

lior1990 commented Jul 10, 2020

🐛 Bug

Information

To reproduce

Expected behavior

Environment info

patrickvonplaten commented Jul 10, 2020