-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[seq2seq] finetune.sh OOMs in fp16 w torch 1.6 on colab #6589
Comments
try passing Semi-relatedly, a good practice is to run
at the top of colab so that when you go back you can know what version you were on. |
Thanks for the quick reply! I tried this and it didn't work though :( |
+1 on this. Using Torch 1.6.0, transformers 3.0.2, Linux, V100 GPU. |
This is a torch 1.6 issue. |
I tried running fp16 training with |
Can you try torch 1.5.1 ? |
Succeeds with 1.5.1, and power and temperature are in-line with native. |
However, the process failing during generation for 1.6.0 suggests there's some optimization missing during the generation steps which causes OOM. |
Another thing to note which might be related: Validation (400 samples) takes 5x time for 1 epoch of training (2400 samples). Even if accounting for beam size (4x), it is much slower. |
Interesting! I would definitely be open to a PR here if you have a fix in mind! |
Thanks! I have a couple ideas and will try them out and create a PR if any of them works. |
I think the problem is that the |
The problem is with I'm happy to create a PR for it if it makes sense to check it in. |
That PR would be interesting. More interesting would be figuring out why generate OOMs in these conditions. |
Definitely the question for why generate OOMs is interesting but one I haven't found an answer for yet. I suggested a workaround in #7004 using the fix I described earlier. |
OK, I'm gunna try to fix the underlying issue today/tomorrow and if I fail, we'll move to your PR. |
Does anyone have a snippet that replicates the OOM outside of colab? |
The issue wasn't on Colab but on AWS. |
What was your command/hardware? |
Command: |
I can replicate on v100 with cuda 10.1, torch 1.6, python 3.7. Easiest fix is adding |
Hmm, that makes sense. I'll also say that in the screenshots I had attached earlier it occurred at the end of the first epoch during validation, so setting that new flag should help with that. It is a tricky choice between setting a However, this still does not explain the avg gen time being 30-60x time per batch (with equal batch sizes for training and validation). Lastly, the same model does generate without producing an OOM at run-time on similar hardware with the model producing upto |
I don't have any slow down after the validation sanity check in my replication. Maybe I haven't found your bug. |
I don't understand
Do you have a snippet that does not involve finetune.py (just calls |
Or maybe it got resolved between when I tested it and this version. No worries.
This might not matter anymore if the previous is fixed during training since this is specifically at runtime. Regardless, here's an example I have that is much slower. %%timeit
with torch.no_grad():
generated_ids = model.generate(tokenized_input["input_ids"].to("cuda"), skip_special_tokens=True, clean_up_tokenization_spaces=False,
num_beams=3, top_p=0.9, repetition_penalty=10, decoder_start_token_id=model.config.decoder_start_token_id, max_length=model.config.max_length) which produces: |
The sequence produced is of length 311. While the output sequence length is long (max possible is 768), 10 seconds is still quite a lot. |
can you send a full working example that I can copy paste and try in different torch versions? |
Sure! I'm using a finetuned model and a custom dataset so changed the below to from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large")
model = model.to("cuda")
model = model.eval()
tokenized_input = tokenizer(..., return_tensors="pt", max_length=model.config.max_position_embeddings)
with torch.no_grad():
generated_ids = model.generate(tokenized_input["input_ids"].to("cuda"), skip_special_tokens=True, clean_up_tokenization_spaces=False,
num_beams=3, top_p=0.9, repetition_penalty=10, decoder_start_token_id=model.config.decoder_start_token_id, max_length=model.config.max_length) I'm running this in a notebook so I can time profile the generate step. |
I've spent a quite some time today focused on trying various combinations of In conclusion, it doesn't appear to be a bug but a property of generate being more memory intensive. |
@sshleifer , I have the same issue, and I am using the latest version that includes the PR you provided. I set eval_max_gen_length to 30 and still getting OOM during the sanity check. Do I also have to set num_sanity_val_steps=0 ? |
@vikigenius : I don't think setting |
|
On trying to fine-tune either T5 or BART models for summarization I was encountering OOM repeatedly in the latest code whereas it used to work fine earlier for me, atleast on Google Colab.
On checking the startup scripts and latest commits I saw that optimizations have been added for native pytorch fp16 support recently. On removing the fp16 parameter from the script it started working as expected.
Please check if this could be a real issue or just a matter of a dangling parameter that needs to be removed?
Thanks
@sshleifer @patil-suraj
The text was updated successfully, but these errors were encountered: