Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory (again) #736

Closed
emrul opened this issue May 17, 2019 · 21 comments
Closed

CUDA out of memory (again) #736

emrul opened this issue May 17, 2019 · 21 comments
Labels
bug Something isn't working wontfix This will not be worked on

Comments

@emrul
Copy link

emrul commented May 17, 2019

I previously got CUDA out of memory issues but resolved them by

  1. Ensuring no sentence is greater than 200 tokens
  2. Updating PyTorch

This worked well with my small dataset (12.5k training sentences) but now I am trying with a bigger dataset (160k sentences) and my CUDA issue is back.

This is my code currently:
`
columns = {0: 'text', 1: 'ner'}
data_folder = 'training/sequences'
corpus: TaggedCorpus = NLPTaskDataFetcher.load_column_corpus(data_folder, columns) print(corpus)
tag_type = 'ner'
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary.idx2item)
embeddings = StackedEmbeddings(embeddings=[
WordEmbeddings('glove'),
PooledFlairEmbeddings('news-forward', pooling='mean', only_capitalized=True, use_cache=True, chars_per_chunk=32),
PooledFlairEmbeddings('news-backward', pooling='mean', only_capitalized=True, use_cache=True, chars_per_chunk=32),
WordEmbeddings(embeddings="embeddings/markup2vec.wv")
])

tagger: SequenceTagger = SequenceTagger(hidden_size=128,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type=tag_type)

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('resources/taggers/v003',
max_epochs=150,
mini_batch_size=192,
eval_mini_batch_size=16,
checkpoint=True)
`

and this is the error

RuntimeError: CUDA out of memory. Tried to allocate 156.00 MiB (GPU 0; 24.00 GiB total capacity; 15.79 GiB already allocated; 142.52 MiB free; 3.81 GiB cached)

I've tried using the latest Flair (GitHub Master) and running on Windows 10 with Python 3.7 any PyTorch 1.1.0. Any ideas on what else I can try to get training to work with a large number of samples?

@emrul emrul added the bug Something isn't working label May 17, 2019
@alanakbik
Copy link
Collaborator

Hello @emrul we just now pushed new changes to the master branch that enable streaming data loading using the Dataset and DataLoader apis.

If you replace the lines:

columns = {0: 'text', 1: 'ner'}
data_folder = 'training/sequences'
corpus: TaggedCorpus = NLPTaskDataFetcher.load_column_corpus(data_folder, columns) 

with:

from flair.datasets import ColumnCorpus
columns = {0: 'text', 1: 'ner'}
data_folder = 'training/sequences'
corpus = ColumnCorpus(data_folder, columns, in_memory=False)

By setting in_memory to False you make sure that the dataset is not stored in memory, but instead always read from disk.

On another note, I would currently not recommend using the PooledFlairEmbeddings on large corpora. They are currently unoptimized and this will completely blow up your memory and lead to very large models. There is an optimized version coming that is much more memory effective (and more accurate to boot), which I am hoping to push to master in the next weeks.

@aayushsanghavi
Copy link

@alanakbik from flair.datasets import ColumnCorpus gives an import error.

@alanakbik
Copy link
Collaborator

Are you using the latest master branch?

@aayushsanghavi
Copy link

I believe so. I'm using version 0.4.1. I noticed that the last commit on the datasets.py file was 2 hours ago so I tried pip install --upgrade flair (just in case the version was upgraded) before I changed those lines but it was the latest one already.

@alanakbik
Copy link
Collaborator

Ah yes that is different from the master branch. If you install from pip you are using the latest release 0.4.1. The master branch already contains features that will be part of 0.4.2 when it releases, so the only way to use them right now is for you to work on the master branch instead. You can install master like this:

pip install git+https://github.com/zalandoresearch/flair.git

@amitbcp
Copy link

amitbcp commented May 26, 2019

Hi @alanakbik

Can you please share a code-snippet for loading TextClassification corpus via data-streaming ?

@alanakbik
Copy link
Collaborator

Hello @amitbcp sure, here it is:

from flair.datasets import ClassificationCorpus
corpus = ClassificationCorpus("path/to/your/corpus/", in_memory=False)

If you set in_memory to True, it will read everything into memory like before. By setting it to False you get the streaming data loading. Everything else is the same so you can train models just like before.

@emrul
Copy link
Author

emrul commented May 26, 2019

@alanakbik The streaming data loader works really well on my initial tests. I have switched to using regular FlairEmbeddings but I'm looking forward to the enhanced Pooled version that supports larger datasets. Thanks for your help!

@emrul
Copy link
Author

emrul commented May 27, 2019

I spoke too soon - I was still using my small dataset. When using the larger one I still get OOM even though I've tried to make everything as lean as possible - my code is here: https://gist.github.com/emrul/74486783e9d750f2cb08695bf26719da

Again, any help would be really valuable.

@alanakbik
Copy link
Collaborator

Hi @emrul could you try running this code with embeddings_in_memory set to False? I.e. comment in the commented-out line.

@emrul
Copy link
Author

emrul commented May 27, 2019

Hi, yes - same issue however:

2019-05-27 19:23:17,479 epoch 1 - iter 0/2516 - loss 132.78053284
Traceback (most recent call last):
  File "train_seq_opt.py", line 59, in <module>
    train()
  File "train_seq_opt.py", line 54, in train
    checkpoint=True)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\trainers\trainer.py", line 197, in train
    loss = self.model.forward_loss(batch)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\models\sequence_tagger_model.py", line 316, in forward_loss
    features = self.forward(sentences)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\models\sequence_tagger_model.py", line 368, in forward
    self.embeddings.embed(sentences)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\embeddings.py", line 145, in embed
    embedding.embed(sentences)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\embeddings.py", line 76, in embed
    self._add_embeddings_internal(sentences)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\embeddings.py", line 1248, in _add_embeddings_internal
    sentences_padded, self.chars_per_chunk
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\models\language_model.py", line 130, in get_representation
    prediction, rnn_output, hidden = self.forward(batch, hidden)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\models\language_model.py", line 78, in forward
    output, hidden = self.rnn(emb, hidden)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\torch\nn\modules\module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\torch\nn\modules\rnn.py", line 559, in forward
    return self.forward_tensor(input, hx)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\torch\nn\modules\rnn.py", line 539, in forward_tensor
    output, hidden = self.forward_impl(input, hx, batch_sizes, max_batch_size, sorted_indices)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\torch\nn\modules\rnn.py", line 522, in forward_impl
    self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: CUDA out of memory. Tried to allocate 98.00 MiB (GPU 0; 24.00 GiB total capacity; 19.03 GiB already allocated; 80.52 MiB free; 642.41 MiB cached)

@aayushsanghavi
Copy link

@alanakbik sorry to bother you again but even after looking things up I couldn't fix my problem.
My sequence tagging dataset looks like -

token1   tag1
token2   tag2
... so on
tokenN   tagN

token1   tag1
token2   tag2
... so on
tokenM   tagM

Spaces separate the columns and newlines separate sentences. This was working fine earlier when I was using NLPTaskDataFetcher.load_column_corpus(). But after switching to ColumnCorpus (for handling bigger datasets) I am getting tons of this warning -
ACHTUNG: An empty Sentence was created! Are there empty strings in your dataset?

and this is what I'm doing

columns = {0: 'text', 1: 'tag'}
corpus = ColumnCorpus(args.data_folder, columns, in_memory=False)

I noticed that this log has been recently added to the Sentence class so it might as well have been the case earlier but since there was no logging I assumed that my dataset format was fine. I read a bit of the code that parses the file and creates Sentence objects but I couldn't figure out why sentence objects with empty text were being created.

Can you please tell me how to fix this?

@alanakbik
Copy link
Collaborator

@aayushsanghavi yes this is an unfortunate bug that was added and then fixed yesterday in another PR. Could you test with the current master to confirm that the bug is now gone? Thanks!

@alanakbik
Copy link
Collaborator

@emrul sorry I just realized that your memory issues are CUDA related - this is different from the regular memory issues and cannot be fixed with the embeddings_in_memory setting. Your script looks good, in particular since you've set the chars_per_chunk parameter to 64 which should be small enough given you have enough CUDA memory.

The only things I could think of trying would be to reduce the mini_batch_size or further reduce the chars_per_chunk since it seems that it gets a mini-batch that is too large for the available memory. But it really is strange that the memory is not enough. Are you running only one task on this GPU or is it also being used by other processes?

@aayushsanghavi
Copy link

aayushsanghavi commented May 29, 2019

Ooh right. Because when I tried it again this morning it worked. I put num_workers=6 and mini_batch_size=8 to reduce memory usage so I didn't have to use in_memory=False

@aayushsanghavi
Copy link

aayushsanghavi commented May 30, 2019

Hey @alanakbik whatever changes you guys pushed recently, they're pretty good. The memory management has improved so much. Earlier my model used to take up more than 12.5 GB of RAM but now it trains in about 7 GB of RAM. Nice work!

@emrul
Copy link
Author

emrul commented Jun 2, 2019

@alanakbik Nothing else is using the GPU - the entire machine is dedicated for this task. Its an Nvidia Quadro P6000 so has a lot of memory. I've reduced mini-batch size to 32 and still the issue persists.

I'll go through my training data to see if there's anything odd in there but other than that I am stumped.

@alanakbik
Copy link
Collaborator

@emrul sorry for the late reply (was busy attending NAACL) - have you made progress on this? What if you set the mini-batch size to something very low, like 2?

@alanakbik
Copy link
Collaborator

@aayushsanghavi thanks :) we hope to optimize the process more with future releases :)

@emrul
Copy link
Author

emrul commented Jun 11, 2019

Hi @alanakbik - I think it may be an issue with my larger data set. I am investigating but I think the problem is likely to be in the dataset and not in Flair. Will update when I find something.

@stale
Copy link

stale bot commented Apr 30, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Apr 30, 2020
@stale stale bot closed this as completed May 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants