Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[seq2seq] finetune.sh OOMs in fp16 w torch 1.6 on colab #6589

Closed
amanpreet692 opened this issue Aug 19, 2020 · 32 comments · Fixed by #7018
Closed

[seq2seq] finetune.sh OOMs in fp16 w torch 1.6 on colab #6589

amanpreet692 opened this issue Aug 19, 2020 · 32 comments · Fixed by #7018
Assignees
Labels
Help wanted Extra attention is needed, help appreciated

Comments

@amanpreet692
Copy link
Contributor

On trying to fine-tune either T5 or BART models for summarization I was encountering OOM repeatedly in the latest code whereas it used to work fine earlier for me, atleast on Google Colab.
On checking the startup scripts and latest commits I saw that optimizations have been added for native pytorch fp16 support recently. On removing the fp16 parameter from the script it started working as expected.
Please check if this could be a real issue or just a matter of a dangling parameter that needs to be removed?
Thanks

@sshleifer @patil-suraj

@sshleifer
Copy link
Contributor

sshleifer commented Aug 19, 2020

try passing --fp16 --fp_16_opt_level=O1
that is a relevant default that has changed. I have also experienced some torch 1.6 issues, so would love to know if that helps.

Semi-relatedly, a good practice is to run

!pip freeze | grep transformers
!pip freeze | grep torch

at the top of colab so that when you go back you can know what version you were on.

@amanpreet692
Copy link
Contributor Author

Thanks for the quick reply! I tried this and it didn't work though :(
Removing the fp16 parameter for now and fine-tuning.
Will keep the colab advice in mind :)

@setu4993
Copy link
Contributor

+1 on this. Using fp16 with level O1 or O2 both causes OOM even for batch size 1. Without fp16 fine-tuning works.

Torch 1.6.0, transformers 3.0.2, Linux, V100 GPU.

@sshleifer
Copy link
Contributor

sshleifer commented Aug 31, 2020

This is a torch 1.6 issue.
I haven't gotten anything working well with torch 1.6 + fp16.
torch 1.5.1 with apex installed works well for me.

@sshleifer sshleifer self-assigned this Aug 31, 2020
@sshleifer sshleifer changed the title CUDA Out of Memory on running finetune.sh for seq2seq [seq2seq] finetune.sh OOMs in fp16 w torch 1.6 Aug 31, 2020
@setu4993
Copy link
Contributor

setu4993 commented Aug 31, 2020

I tried running fp16 training with amp_backend=apex and amp_backend=native (passing them as additional args) and the latter does much better in terms power consumption, but memory consumption is same for both (wandb GPU graphs). However, both of them OOM during the validation step. May have something to do with beam search since my validation batch size is 1.

Screen Shot 2020-08-31 at 12 47 28

@sshleifer
Copy link
Contributor

Can you try torch 1.5.1 ?

@setu4993
Copy link
Contributor

Succeeds with 1.5.1, and power and temperature are in-line with native.

@setu4993
Copy link
Contributor

However, the process failing during generation for 1.6.0 suggests there's some optimization missing during the generation steps which causes OOM.

@setu4993
Copy link
Contributor

Another thing to note which might be related: Validation (400 samples) takes 5x time for 1 epoch of training (2400 samples). Even if accounting for beam size (4x), it is much slower.

@sshleifer
Copy link
Contributor

Interesting! I would definitely be open to a PR here if you have a fix in mind!

@sshleifer sshleifer added the Help wanted Extra attention is needed, help appreciated label Aug 31, 2020
@setu4993
Copy link
Contributor

setu4993 commented Sep 1, 2020

Thanks! I have a couple ideas and will try them out and create a PR if any of them works.

@setu4993
Copy link
Contributor

setu4993 commented Sep 1, 2020

I think the problem is that the _generative_step method calls _step in it, causing 2x forward steps within each validation step. Also, model.generate steps are inherently slower than an eval forward pass, even with num_beams=1, about 30-60x slower. But this is a different problem than the OOM issue on 1.6.0. Maybe should split this up into a different issue?

@setu4993
Copy link
Contributor

setu4993 commented Sep 1, 2020

The problem is with model.generate that causes OOM on PyTorch 1.6. I switched out to using a custom validation_step that only uses _step and does not make a call to model.generate; it succeeds and is fast. The drawback is that I cannot use beam search for the validation step and keep do_predict set to False to ensure the test step does not execute. All of which are acceptable limitations to me for faster val, val not running into OOM and being able to use native fp16 with PyTorch 1.6.0.

I'm happy to create a PR for it if it makes sense to check it in.

@sshleifer
Copy link
Contributor

That PR would be interesting. More interesting would be figuring out why generate OOMs in these conditions.

@setu4993
Copy link
Contributor

setu4993 commented Sep 8, 2020

Definitely the question for why generate OOMs is interesting but one I haven't found an answer for yet. I suggested a workaround in #7004 using the fix I described earlier.

@sshleifer
Copy link
Contributor

OK, I'm gunna try to fix the underlying issue today/tomorrow and if I fail, we'll move to your PR.
Thanks!

@sshleifer
Copy link
Contributor

Does anyone have a snippet that replicates the OOM outside of colab?
I have no trouble running examples/seq2seq/test_bash_script.py on self hosted hardware in torch 1.6.

@sshleifer sshleifer changed the title [seq2seq] finetune.sh OOMs in fp16 w torch 1.6 [seq2seq] finetune.sh OOMs in fp16 w torch 1.6 on colab Sep 8, 2020
@setu4993
Copy link
Contributor

setu4993 commented Sep 8, 2020

The issue wasn't on Colab but on AWS.

@sshleifer
Copy link
Contributor

What was your command/hardware?

@setu4993
Copy link
Contributor

setu4993 commented Sep 8, 2020

Command: python script.py ... with a bunch of args (I have a custom wrapper that inherits SummarizationModule for initialization and adds extra args. I did not modify train / eval / test in that so should be identical to running python fine-tune.py from finetune.sh).
GPU: V100-SXM2-16GB.

@sshleifer sshleifer linked a pull request Sep 8, 2020 that will close this issue
@sshleifer
Copy link
Contributor

sshleifer commented Sep 8, 2020

I can replicate on v100 with cuda 10.1, torch 1.6, python 3.7.
The problem is that during the first call to generate (during the validation sanity check) the untrained model generates config.max_length tokens, causing OOM.

Easiest fix is adding --num_sanity_val_steps=0 to your command. LMK if that works.
The linked PR above allows the user to limit how many tokens are generating during validation, which may be independently helpful.

@setu4993
Copy link
Contributor

setu4993 commented Sep 8, 2020

Hmm, that makes sense. I'll also say that in the screenshots I had attached earlier it occurred at the end of the first epoch during validation, so setting that new flag should help with that. It is a tricky choice between setting a max_length for generate steps that is different from the model's expected output. I do prefer using a forward pass' output (my PR #7004) as a substitute for the runtime output when it is with the correct max_length instead of a shorter output that fits within the memory at that time.

However, this still does not explain the avg gen time being 30-60x time per batch (with equal batch sizes for training and validation).

Lastly, the same model does generate without producing an OOM at run-time on similar hardware with the model producing upto max_length, which continues to baffle me.

@sshleifer
Copy link
Contributor

I don't have any slow down after the validation sanity check in my replication. Maybe I haven't found your bug.

@sshleifer
Copy link
Contributor

I don't understand

Lastly, the same model does generate without producing an OOM at run-time on similar hardware with the model producing upto max_length, which continues to baffle me.

Do you have a snippet that does not involve finetune.py (just calls generate) that OOMs/is way slower?

@setu4993
Copy link
Contributor

setu4993 commented Sep 8, 2020

I don't have any slow down after the validation sanity check in my replication. Maybe I haven't found your bug.

Or maybe it got resolved between when I tested it and this version. No worries.

Do you have a snippet that does not involve finetune.py (just calls generate) that OOMs/is way slower?

This might not matter anymore if the previous is fixed during training since this is specifically at runtime. Regardless, here's an example I have that is much slower.

%%timeit
with torch.no_grad():
    generated_ids = model.generate(tokenized_input["input_ids"].to("cuda"), skip_special_tokens=True, clean_up_tokenization_spaces=False, 
                                    num_beams=3, top_p=0.9, repetition_penalty=10, decoder_start_token_id=model.config.decoder_start_token_id, max_length=model.config.max_length)

which produces:
10.4 s ± 7.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@setu4993
Copy link
Contributor

setu4993 commented Sep 8, 2020

The sequence produced is of length 311. While the output sequence length is long (max possible is 768), 10 seconds is still quite a lot.

@sshleifer
Copy link
Contributor

can you send a full working example that I can copy paste and try in different torch versions?

@setu4993
Copy link
Contributor

setu4993 commented Sep 9, 2020

Sure! I'm using a finetuned model and a custom dataset so changed the below to bart-large and removed the lines where a dataset is queried. Everything else is the same.

from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large")
model = model.to("cuda")
model = model.eval()
tokenized_input = tokenizer(..., return_tensors="pt", max_length=model.config.max_position_embeddings)
with torch.no_grad():
    generated_ids = model.generate(tokenized_input["input_ids"].to("cuda"), skip_special_tokens=True, clean_up_tokenization_spaces=False, 
                                    num_beams=3, top_p=0.9, repetition_penalty=10, decoder_start_token_id=model.config.decoder_start_token_id, max_length=model.config.max_length)

I'm running this in a notebook so I can time profile the generate step.

@setu4993
Copy link
Contributor

I've spent a quite some time today focused on trying various combinations of generate. The increased delay arises from a num_beams count that is large, which leads to the model producing longer outputs, thus compounding the generate time (num_beams * max_length).

In conclusion, it doesn't appear to be a bug but a property of generate being more memory intensive.

@vikigenius
Copy link

@sshleifer , I have the same issue, and I am using the latest version that includes the PR you provided. I set eval_max_gen_length to 30 and still getting OOM during the sanity check. Do I also have to set num_sanity_val_steps=0 ?

@setu4993
Copy link
Contributor

@vikigenius : I don't think setting num_sanity_val_steps to 0 will solve it since it'll only delay what's bound to happen during the validation step later.

@sshleifer
Copy link
Contributor

sshleifer commented Sep 17, 2020

  • --num_sanity_val_steps=0 fixes it for me.
  • I only have a problem in that sanity check phase when the model is untrained. Future calls to generate (without super high config.min_length/eval_max_gen_length) don't OOM for me.
  • --eval_num_beams=2 may also help save memory.
  • I'd love to see somebody isolate the OOMing call to generate outside of training logic so that I can reproduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Help wanted Extra attention is needed, help appreciated
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants