T5-large FP16 produces nan in loss #11461

raviskolli · 2021-04-27T00:14:01Z

Environment info

transformers version: 4.6.0.dev0, commit hash: 5e04d70
Platform: Linux-4.15.0-1071-azure-x86_64-with-debian-buster-sid
Python version: 3.7.3
PyTorch version (GPU?): 1.8.1+cu111 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: True
Using distributed or parallel set-up in script?: False

Who can help

t5: @patrickvonplaten, @patil-suraj

Information

Model I am using (Bert, XLNet ...): t5-large

The problem arises when using:

the official example scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)

To reproduce

Steps to reproduce the behavior:

cd examples/seq2seq
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=../../src USE_TF=0 ./run_translation.py
--model_name_or_path t5-large
--do_train --source_lang en --target_lang ro
--source_prefix "translate English to Romanian: "
--dataset_name wmt16 --dataset_config "ro-en"
--output_dir /tmp/tst-translation
--per_device_train_batch_size 4
--overwrite_output_dir
--predict_with_generate
--num_train_epochs 1 --fp16

Expected behavior

FP16 mode shouldn't produce nan in loss.

The text was updated successfully, but these errors were encountered:

raviskolli · 2021-04-27T00:27:02Z

I see nans creeping in at the T5Attention in decoder. I didn't find any inf or nan in either hidden_states or key_value_states but the computed values of both key_states and value_states have nan's

stas00 · 2021-04-27T03:59:04Z

FP16 mode shouldn't produce nan in loss.

Why do you believe this to be the case? This model was trained in bf16, which has a totally different numerical range from fp16. So it shouldn't produce NaNs under bf16 or fp32, but under fp16 it's almost guaranteed to not work. Please see: https://discuss.huggingface.co/t/mixed-precision-for-bfloat16-pretrained-models/5315

That's said, please try this branch #10956 that tries to use a workaround for AMP. Some users reported success. One user reported problems.

And you can also try the new over/underflow detector: #11274 if you want to get more precise info on where the problem emerges first. Just add --debug activation_overflow to the trainer command line and it will bail with the traces of the last frames as soon as nan or inf is encountered. I am reworking this tool to provide more info, and need to revamp the interface, but it's mostly done.

raviskolli · 2021-04-28T07:09:01Z

Thank you for the pointers to the discussion. Is it just finetuning or do you expect inference to be unstable as well in fp16 mode?

debug_activation_overflow looks like a great tool that can be useful in identifying the source of nans. I'll give #10956 a try and see if it helps with my runs.

stas00 · 2021-04-28T21:57:16Z

Is it just finetuning or do you expect inference to be unstable as well in fp16 mode?

There are less moving parts during inference. But more or less expect the same problems.

So the workaround is to identify where under/overflow happens and force the model to perform those ops in fp32 and then convert back to fp16.

In fact with finetuning if you don't have the problem happening right away like it does with mt5, you could try to stir the model into the fp16 range by punishing large activations. Please see the proposed loss calculation extra: #10956 (comment) (it in fact comes from the original t5 implementation but for some reason wasn't implemented in that ported model in transformers).

github-actions · 2021-05-27T15:09:35Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Jun 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T5-large FP16 produces nan in loss #11461

T5-large FP16 produces nan in loss #11461

raviskolli commented Apr 27, 2021

raviskolli commented Apr 27, 2021

stas00 commented Apr 27, 2021 •

edited

Loading

raviskolli commented Apr 28, 2021

stas00 commented Apr 28, 2021 •

edited

Loading

github-actions bot commented May 27, 2021

T5-large FP16 produces nan in loss #11461

T5-large FP16 produces nan in loss #11461

Comments

raviskolli commented Apr 27, 2021

Environment info

Who can help

Information

To reproduce

Expected behavior

raviskolli commented Apr 27, 2021

stas00 commented Apr 27, 2021 • edited Loading

raviskolli commented Apr 28, 2021

stas00 commented Apr 28, 2021 • edited Loading

github-actions bot commented May 27, 2021

stas00 commented Apr 27, 2021 •

edited

Loading

stas00 commented Apr 28, 2021 •

edited

Loading