T5 fp16 forward yields nan #4287

binshengliu · 2020-05-11T14:12:28Z

🐛 Bug

Information

Model I am using (Bert, XLNet ...): T5

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

I use pytorch-lightning to manage fp16. This is the minimal example that reproduces the result.

from transformers import T5Model, T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5Model.from_pretrained("t5-base").cuda().half()
text = "hello world!"
inputs = tokenizer.encode(text, return_tensors="pt").cuda()
out = model(input_ids=inputs, decoder_input_ids=inputs)
print(out[0][:, :, :10])

output:

tensor([[[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
         [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
         [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]]], device='cuda:0',
       dtype=torch.float16, grad_fn=<SliceBackward>)

Expected behavior

Get non-nan values.

Environment info

transformers version: 2.9.0
Platform: Linux-4.15.0-88-generic-x86_64-with-debian-buster-sid
Python version: 3.7.6
PyTorch version (GPU?): 1.4.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2020-05-18T13:57:29Z

Thanks for the detailed error description @binshengliu! I linked a PR that should fix it :-)

binshengliu · 2020-06-27T03:57:08Z

Hi, there's still some chance we get nan values during training. It does not always happen but is still pretty easy to reproduce. My current version is 2.11.0.

import torch
from transformers import T5Model, T5Tokenizer

torch.manual_seed(0)
tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5Model.from_pretrained("t5-base").cuda().half()
text = "hello world!"
model.train()
inputs = tokenizer.encode(text, return_tensors="pt").cuda()
for idx in range(1000):
    out = model(input_ids=inputs, decoder_input_ids=inputs)
    if torch.isnan(out[0]).any():
        print(idx)
        print(out[0][:, :, :10])
        exit()

Output:

143
tensor([[[nan, nan, nan, nan, nan, nan, nan, nan, nan, 0.],
         [nan, nan, nan, 0., nan, nan, nan, nan, nan, 0.],
         [nan, nan, 0., nan, nan, 0., 0., 0., nan, nan]]], device='cuda:0',
       dtype=torch.float16, grad_fn=<SliceBackward>)

binshengliu · 2020-06-27T04:13:11Z

Sorry I just noticed you commented #4586 (comment). I can confirm the issue I encountered is caused by the same reason.

yeliu918 · 2020-08-28T21:06:46Z

Hi, I have the same problem. I get NAN when using fp16. And when I set fp16=False, the NAN problem is gone.

patrickvonplaten · 2020-09-01T10:32:50Z

Yeah that's still an open problem....not sure how easy it will be to solve it, see: #4586 (comment)

SamsTheGreatest · 2021-07-01T10:35:04Z

Same when fine-tuning GPT Neo.

patrickvonplaten self-assigned this May 11, 2020

patrickvonplaten mentioned this issue May 18, 2020

[T5 fp16] Fix fp16 in T5 #4436

Merged

patrickvonplaten linked a pull request May 18, 2020 that will close this issue

[T5 fp16] Fix fp16 in T5 #4436

Merged

patrickvonplaten closed this as completed in #4436 May 18, 2020

ArthurZucker mentioned this issue Oct 5, 2024

Weird text encoder NaNs specifically for FSDP + multi GPU #33376

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T5 fp16 forward yields nan #4287

T5 fp16 forward yields nan #4287

binshengliu commented May 11, 2020 •

edited

Loading

patrickvonplaten commented May 18, 2020

binshengliu commented Jun 27, 2020

binshengliu commented Jun 27, 2020

yeliu918 commented Aug 28, 2020

patrickvonplaten commented Sep 1, 2020

SamsTheGreatest commented Jul 1, 2021

T5 fp16 forward yields nan #4287

T5 fp16 forward yields nan #4287

Comments

binshengliu commented May 11, 2020 • edited Loading

🐛 Bug

Information

To reproduce

Expected behavior

Environment info

patrickvonplaten commented May 18, 2020

binshengliu commented Jun 27, 2020

binshengliu commented Jun 27, 2020

yeliu918 commented Aug 28, 2020

patrickvonplaten commented Sep 1, 2020

SamsTheGreatest commented Jul 1, 2021

binshengliu commented May 11, 2020 •

edited

Loading