[seq2seq] finetune.sh OOMs in fp16 w torch 1.6 on colab #6589

amanpreet692 · 2020-08-19T04:07:45Z

On trying to fine-tune either T5 or BART models for summarization I was encountering OOM repeatedly in the latest code whereas it used to work fine earlier for me, atleast on Google Colab.
On checking the startup scripts and latest commits I saw that optimizations have been added for native pytorch fp16 support recently. On removing the fp16 parameter from the script it started working as expected.
Please check if this could be a real issue or just a matter of a dangling parameter that needs to be removed?
Thanks

@sshleifer @patil-suraj

sshleifer · 2020-08-19T04:10:23Z

try passing --fp16 --fp_16_opt_level=O1
that is a relevant default that has changed. I have also experienced some torch 1.6 issues, so would love to know if that helps.

Semi-relatedly, a good practice is to run

!pip freeze | grep transformers
!pip freeze | grep torch

at the top of colab so that when you go back you can know what version you were on.

amanpreet692 · 2020-08-19T16:54:52Z

Thanks for the quick reply! I tried this and it didn't work though :(
Removing the fp16 parameter for now and fine-tuning.
Will keep the colab advice in mind :)

setu4993 · 2020-08-31T17:39:18Z

+1 on this. Using fp16 with level O1 or O2 both causes OOM even for batch size 1. Without fp16 fine-tuning works.

Torch 1.6.0, transformers 3.0.2, Linux, V100 GPU.

sshleifer · 2020-08-31T18:07:42Z

This is a torch 1.6 issue.
I haven't gotten anything working well with torch 1.6 + fp16.
torch 1.5.1 with apex installed works well for me.

setu4993 · 2020-08-31T19:50:53Z

I tried running fp16 training with amp_backend=apex and amp_backend=native (passing them as additional args) and the latter does much better in terms power consumption, but memory consumption is same for both (wandb GPU graphs). However, both of them OOM during the validation step. May have something to do with beam search since my validation batch size is 1.

sshleifer · 2020-08-31T20:10:26Z

Can you try torch 1.5.1 ?

setu4993 · 2020-08-31T21:40:07Z

Succeeds with 1.5.1, and power and temperature are in-line with native.

setu4993 · 2020-08-31T21:41:16Z

However, the process failing during generation for 1.6.0 suggests there's some optimization missing during the generation steps which causes OOM.

setu4993 · 2020-08-31T21:44:05Z

Another thing to note which might be related: Validation (400 samples) takes 5x time for 1 epoch of training (2400 samples). Even if accounting for beam size (4x), it is much slower.

sshleifer · 2020-08-31T22:03:49Z

Interesting! I would definitely be open to a PR here if you have a fix in mind!

setu4993 · 2020-09-01T00:44:57Z

Thanks! I have a couple ideas and will try them out and create a PR if any of them works.

setu4993 · 2020-09-01T04:53:57Z

I think the problem is that the _generative_step method calls _step in it, causing 2x forward steps within each validation step. Also, model.generate steps are inherently slower than an eval forward pass, even with num_beams=1, about 30-60x slower. But this is a different problem than the OOM issue on 1.6.0. Maybe should split this up into a different issue?

setu4993 · 2020-09-01T06:02:06Z

The problem is with model.generate that causes OOM on PyTorch 1.6. I switched out to using a custom validation_step that only uses _step and does not make a call to model.generate; it succeeds and is fast. The drawback is that I cannot use beam search for the validation step and keep do_predict set to False to ensure the test step does not execute. All of which are acceptable limitations to me for faster val, val not running into OOM and being able to use native fp16 with PyTorch 1.6.0.

I'm happy to create a PR for it if it makes sense to check it in.

sshleifer · 2020-09-07T21:30:37Z

That PR would be interesting. More interesting would be figuring out why generate OOMs in these conditions.

setu4993 · 2020-09-08T05:55:54Z

Definitely the question for why generate OOMs is interesting but one I haven't found an answer for yet. I suggested a workaround in #7004 using the fix I described earlier.

sshleifer · 2020-09-08T13:37:19Z

OK, I'm gunna try to fix the underlying issue today/tomorrow and if I fail, we'll move to your PR.
Thanks!

sshleifer · 2020-09-08T18:14:58Z

Does anyone have a snippet that replicates the OOM outside of colab?
I have no trouble running examples/seq2seq/test_bash_script.py on self hosted hardware in torch 1.6.

setu4993 · 2020-09-08T18:18:08Z

The issue wasn't on Colab but on AWS.

sshleifer · 2020-09-08T18:18:49Z

What was your command/hardware?

setu4993 · 2020-09-08T18:24:55Z

Command: python script.py ... with a bunch of args (I have a custom wrapper that inherits SummarizationModule for initialization and adds extra args. I did not modify train / eval / test in that so should be identical to running python fine-tune.py from finetune.sh).
GPU: V100-SXM2-16GB.

sshleifer · 2020-09-08T19:27:51Z

I can replicate on v100 with cuda 10.1, torch 1.6, python 3.7.
The problem is that during the first call to generate (during the validation sanity check) the untrained model generates config.max_length tokens, causing OOM.

Easiest fix is adding --num_sanity_val_steps=0 to your command. LMK if that works.
The linked PR above allows the user to limit how many tokens are generating during validation, which may be independently helpful.

setu4993 · 2020-09-08T19:50:56Z

Hmm, that makes sense. I'll also say that in the screenshots I had attached earlier it occurred at the end of the first epoch during validation, so setting that new flag should help with that. It is a tricky choice between setting a max_length for generate steps that is different from the model's expected output. I do prefer using a forward pass' output (my PR #7004) as a substitute for the runtime output when it is with the correct max_length instead of a shorter output that fits within the memory at that time.

However, this still does not explain the avg gen time being 30-60x time per batch (with equal batch sizes for training and validation).

Lastly, the same model does generate without producing an OOM at run-time on similar hardware with the model producing upto max_length, which continues to baffle me.

sshleifer · 2020-09-08T21:21:49Z

I don't have any slow down after the validation sanity check in my replication. Maybe I haven't found your bug.

sshleifer · 2020-09-08T21:23:58Z

I don't understand

Lastly, the same model does generate without producing an OOM at run-time on similar hardware with the model producing upto max_length, which continues to baffle me.

Do you have a snippet that does not involve finetune.py (just calls generate) that OOMs/is way slower?

setu4993 · 2020-09-08T22:45:04Z

I don't have any slow down after the validation sanity check in my replication. Maybe I haven't found your bug.

Or maybe it got resolved between when I tested it and this version. No worries.

Do you have a snippet that does not involve finetune.py (just calls generate) that OOMs/is way slower?

This might not matter anymore if the previous is fixed during training since this is specifically at runtime. Regardless, here's an example I have that is much slower.

%%timeit
with torch.no_grad():
    generated_ids = model.generate(tokenized_input["input_ids"].to("cuda"), skip_special_tokens=True, clean_up_tokenization_spaces=False, 
                                    num_beams=3, top_p=0.9, repetition_penalty=10, decoder_start_token_id=model.config.decoder_start_token_id, max_length=model.config.max_length)

which produces:
10.4 s ± 7.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

setu4993 · 2020-09-08T22:49:14Z

The sequence produced is of length 311. While the output sequence length is long (max possible is 768), 10 seconds is still quite a lot.

sshleifer · 2020-09-09T14:21:01Z

can you send a full working example that I can copy paste and try in different torch versions?

setu4993 · 2020-09-09T16:53:45Z

Sure! I'm using a finetuned model and a custom dataset so changed the below to bart-large and removed the lines where a dataset is queried. Everything else is the same.

from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large")
model = model.to("cuda")
model = model.eval()
tokenized_input = tokenizer(..., return_tensors="pt", max_length=model.config.max_position_embeddings)
with torch.no_grad():
    generated_ids = model.generate(tokenized_input["input_ids"].to("cuda"), skip_special_tokens=True, clean_up_tokenization_spaces=False, 
                                    num_beams=3, top_p=0.9, repetition_penalty=10, decoder_start_token_id=model.config.decoder_start_token_id, max_length=model.config.max_length)

I'm running this in a notebook so I can time profile the generate step.

setu4993 · 2020-09-10T00:57:20Z

I've spent a quite some time today focused on trying various combinations of generate. The increased delay arises from a num_beams count that is large, which leads to the model producing longer outputs, thus compounding the generate time (num_beams * max_length).

In conclusion, it doesn't appear to be a bug but a property of generate being more memory intensive.

vikigenius · 2020-09-17T01:51:18Z

@sshleifer , I have the same issue, and I am using the latest version that includes the PR you provided. I set eval_max_gen_length to 30 and still getting OOM during the sanity check. Do I also have to set num_sanity_val_steps=0 ?

setu4993 · 2020-09-17T19:15:58Z

@vikigenius : I don't think setting num_sanity_val_steps to 0 will solve it since it'll only delay what's bound to happen during the validation step later.

sshleifer · 2020-09-17T19:43:40Z

--num_sanity_val_steps=0 fixes it for me.
I only have a problem in that sanity check phase when the model is untrained. Future calls to generate (without super high config.min_length/eval_max_gen_length) don't OOM for me.
--eval_num_beams=2 may also help save memory.
I'd love to see somebody isolate the OOMing call to generate outside of training logic so that I can reproduce.

sshleifer self-assigned this Aug 31, 2020

sshleifer changed the title ~~CUDA Out of Memory on running finetune.sh for seq2seq~~ [seq2seq] finetune.sh OOMs in fp16 w torch 1.6 Aug 31, 2020

sshleifer added the Help wanted Extra attention is needed, help appreciated label Aug 31, 2020

setu4993 mentioned this issue Sep 8, 2020

[seq2seq Examples] Use _step instead of generate for val, test #7004

Closed

sshleifer changed the title ~~[seq2seq] finetune.sh OOMs in fp16 w torch 1.6~~ [seq2seq] finetune.sh OOMs in fp16 w torch 1.6 on colab Sep 8, 2020

sshleifer linked a pull request Sep 8, 2020 that will close this issue

[s2s] --eval_max_generate_length #7018

Merged

sshleifer closed this as completed in #7018 Sep 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[seq2seq] finetune.sh OOMs in fp16 w torch 1.6 on colab #6589

[seq2seq] finetune.sh OOMs in fp16 w torch 1.6 on colab #6589

amanpreet692 commented Aug 19, 2020

sshleifer commented Aug 19, 2020 •

edited

Loading

amanpreet692 commented Aug 19, 2020

setu4993 commented Aug 31, 2020

sshleifer commented Aug 31, 2020 •

edited

Loading

setu4993 commented Aug 31, 2020 •

edited

Loading

sshleifer commented Aug 31, 2020

setu4993 commented Aug 31, 2020

setu4993 commented Aug 31, 2020

setu4993 commented Aug 31, 2020

sshleifer commented Aug 31, 2020

setu4993 commented Sep 1, 2020

setu4993 commented Sep 1, 2020

setu4993 commented Sep 1, 2020 •

edited

Loading

sshleifer commented Sep 7, 2020

setu4993 commented Sep 8, 2020

sshleifer commented Sep 8, 2020

sshleifer commented Sep 8, 2020

setu4993 commented Sep 8, 2020

sshleifer commented Sep 8, 2020

setu4993 commented Sep 8, 2020

sshleifer commented Sep 8, 2020 •

edited

Loading

setu4993 commented Sep 8, 2020 •

edited

Loading

sshleifer commented Sep 8, 2020

sshleifer commented Sep 8, 2020

setu4993 commented Sep 8, 2020 •

edited

Loading

setu4993 commented Sep 8, 2020

sshleifer commented Sep 9, 2020

setu4993 commented Sep 9, 2020

setu4993 commented Sep 10, 2020

vikigenius commented Sep 17, 2020

setu4993 commented Sep 17, 2020

sshleifer commented Sep 17, 2020 •

edited

Loading

[seq2seq] finetune.sh OOMs in fp16 w torch 1.6 on colab #6589

[seq2seq] finetune.sh OOMs in fp16 w torch 1.6 on colab #6589

Comments

amanpreet692 commented Aug 19, 2020

sshleifer commented Aug 19, 2020 • edited Loading

amanpreet692 commented Aug 19, 2020

setu4993 commented Aug 31, 2020

sshleifer commented Aug 31, 2020 • edited Loading

setu4993 commented Aug 31, 2020 • edited Loading

sshleifer commented Aug 31, 2020

setu4993 commented Aug 31, 2020

setu4993 commented Aug 31, 2020

setu4993 commented Aug 31, 2020

sshleifer commented Aug 31, 2020

setu4993 commented Sep 1, 2020

setu4993 commented Sep 1, 2020

setu4993 commented Sep 1, 2020 • edited Loading

sshleifer commented Sep 7, 2020

setu4993 commented Sep 8, 2020

sshleifer commented Sep 8, 2020

sshleifer commented Sep 8, 2020

setu4993 commented Sep 8, 2020

sshleifer commented Sep 8, 2020

setu4993 commented Sep 8, 2020

sshleifer commented Sep 8, 2020 • edited Loading

setu4993 commented Sep 8, 2020 • edited Loading

sshleifer commented Sep 8, 2020

sshleifer commented Sep 8, 2020

setu4993 commented Sep 8, 2020 • edited Loading

setu4993 commented Sep 8, 2020

sshleifer commented Sep 9, 2020

setu4993 commented Sep 9, 2020

setu4993 commented Sep 10, 2020

vikigenius commented Sep 17, 2020

setu4993 commented Sep 17, 2020

sshleifer commented Sep 17, 2020 • edited Loading

sshleifer commented Aug 19, 2020 •

edited

Loading

sshleifer commented Aug 31, 2020 •

edited

Loading

setu4993 commented Aug 31, 2020 •

edited

Loading

setu4993 commented Sep 1, 2020 •

edited

Loading

sshleifer commented Sep 8, 2020 •

edited

Loading

setu4993 commented Sep 8, 2020 •

edited

Loading

setu4993 commented Sep 8, 2020 •

edited

Loading

sshleifer commented Sep 17, 2020 •

edited

Loading