CUDA out of memory (again) #736

emrul · 2019-05-17T22:50:41Z

I previously got CUDA out of memory issues but resolved them by

Ensuring no sentence is greater than 200 tokens
Updating PyTorch

This worked well with my small dataset (12.5k training sentences) but now I am trying with a bigger dataset (160k sentences) and my CUDA issue is back.

This is my code currently:
`
columns = {0: 'text', 1: 'ner'}
data_folder = 'training/sequences'
corpus: TaggedCorpus = NLPTaskDataFetcher.load_column_corpus(data_folder, columns) print(corpus)
tag_type = 'ner'
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary.idx2item)
embeddings = StackedEmbeddings(embeddings=[
WordEmbeddings('glove'),
PooledFlairEmbeddings('news-forward', pooling='mean', only_capitalized=True, use_cache=True, chars_per_chunk=32),
PooledFlairEmbeddings('news-backward', pooling='mean', only_capitalized=True, use_cache=True, chars_per_chunk=32),
WordEmbeddings(embeddings="embeddings/markup2vec.wv")
])

tagger: SequenceTagger = SequenceTagger(hidden_size=128,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type=tag_type)

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('resources/taggers/v003',
max_epochs=150,
mini_batch_size=192,
eval_mini_batch_size=16,
checkpoint=True)
`

and this is the error

RuntimeError: CUDA out of memory. Tried to allocate 156.00 MiB (GPU 0; 24.00 GiB total capacity; 15.79 GiB already allocated; 142.52 MiB free; 3.81 GiB cached)

I've tried using the latest Flair (GitHub Master) and running on Windows 10 with Python 3.7 any PyTorch 1.1.0. Any ideas on what else I can try to get training to work with a large number of samples?

alanakbik · 2019-05-20T09:32:41Z

Hello @emrul we just now pushed new changes to the master branch that enable streaming data loading using the Dataset and DataLoader apis.

If you replace the lines:

columns = {0: 'text', 1: 'ner'}
data_folder = 'training/sequences'
corpus: TaggedCorpus = NLPTaskDataFetcher.load_column_corpus(data_folder, columns)

with:

from flair.datasets import ColumnCorpus
columns = {0: 'text', 1: 'ner'}
data_folder = 'training/sequences'
corpus = ColumnCorpus(data_folder, columns, in_memory=False)

By setting in_memory to False you make sure that the dataset is not stored in memory, but instead always read from disk.

On another note, I would currently not recommend using the PooledFlairEmbeddings on large corpora. They are currently unoptimized and this will completely blow up your memory and lead to very large models. There is an optimized version coming that is much more memory effective (and more accurate to boot), which I am hoping to push to master in the next weeks.

aayushsanghavi · 2019-05-24T12:22:51Z

@alanakbik from flair.datasets import ColumnCorpus gives an import error.

alanakbik · 2019-05-24T12:31:23Z

Are you using the latest master branch?

aayushsanghavi · 2019-05-24T12:37:10Z

I believe so. I'm using version 0.4.1. I noticed that the last commit on the datasets.py file was 2 hours ago so I tried pip install --upgrade flair (just in case the version was upgraded) before I changed those lines but it was the latest one already.

alanakbik · 2019-05-24T12:40:18Z

Ah yes that is different from the master branch. If you install from pip you are using the latest release 0.4.1. The master branch already contains features that will be part of 0.4.2 when it releases, so the only way to use them right now is for you to work on the master branch instead. You can install master like this:

pip install git+https://github.com/zalandoresearch/flair.git

amitbcp · 2019-05-26T06:32:05Z

Hi @alanakbik

Can you please share a code-snippet for loading TextClassification corpus via data-streaming ?

alanakbik · 2019-05-26T11:59:04Z

Hello @amitbcp sure, here it is:

from flair.datasets import ClassificationCorpus
corpus = ClassificationCorpus("path/to/your/corpus/", in_memory=False)

If you set in_memory to True, it will read everything into memory like before. By setting it to False you get the streaming data loading. Everything else is the same so you can train models just like before.

emrul · 2019-05-26T21:44:32Z

@alanakbik The streaming data loader works really well on my initial tests. I have switched to using regular FlairEmbeddings but I'm looking forward to the enhanced Pooled version that supports larger datasets. Thanks for your help!

emrul · 2019-05-27T00:34:02Z

I spoke too soon - I was still using my small dataset. When using the larger one I still get OOM even though I've tried to make everything as lean as possible - my code is here: https://gist.github.com/emrul/74486783e9d750f2cb08695bf26719da

Again, any help would be really valuable.

alanakbik · 2019-05-27T10:17:53Z

Hi @emrul could you try running this code with embeddings_in_memory set to False? I.e. comment in the commented-out line.

emrul · 2019-05-27T18:28:09Z

Hi, yes - same issue however:

2019-05-27 19:23:17,479 epoch 1 - iter 0/2516 - loss 132.78053284
Traceback (most recent call last):
  File "train_seq_opt.py", line 59, in <module>
    train()
  File "train_seq_opt.py", line 54, in train
    checkpoint=True)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\trainers\trainer.py", line 197, in train
    loss = self.model.forward_loss(batch)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\models\sequence_tagger_model.py", line 316, in forward_loss
    features = self.forward(sentences)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\models\sequence_tagger_model.py", line 368, in forward
    self.embeddings.embed(sentences)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\embeddings.py", line 145, in embed
    embedding.embed(sentences)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\embeddings.py", line 76, in embed
    self._add_embeddings_internal(sentences)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\embeddings.py", line 1248, in _add_embeddings_internal
    sentences_padded, self.chars_per_chunk
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\models\language_model.py", line 130, in get_representation
    prediction, rnn_output, hidden = self.forward(batch, hidden)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\flair\models\language_model.py", line 78, in forward
    output, hidden = self.rnn(emb, hidden)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\torch\nn\modules\module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\torch\nn\modules\rnn.py", line 559, in forward
    return self.forward_tensor(input, hx)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\torch\nn\modules\rnn.py", line 539, in forward_tensor
    output, hidden = self.forward_impl(input, hx, batch_sizes, max_batch_size, sorted_indices)
  File "C:\Users\emrul\AppData\Local\conda\conda\envs\flair\lib\site-packages\torch\nn\modules\rnn.py", line 522, in forward_impl
    self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: CUDA out of memory. Tried to allocate 98.00 MiB (GPU 0; 24.00 GiB total capacity; 19.03 GiB already allocated; 80.52 MiB free; 642.41 MiB cached)

aayushsanghavi · 2019-05-28T13:36:39Z

@alanakbik sorry to bother you again but even after looking things up I couldn't fix my problem.
My sequence tagging dataset looks like -

token1   tag1
token2   tag2
... so on
tokenN   tagN

token1   tag1
token2   tag2
... so on
tokenM   tagM

Spaces separate the columns and newlines separate sentences. This was working fine earlier when I was using NLPTaskDataFetcher.load_column_corpus(). But after switching to ColumnCorpus (for handling bigger datasets) I am getting tons of this warning -
ACHTUNG: An empty Sentence was created! Are there empty strings in your dataset?

and this is what I'm doing

columns = {0: 'text', 1: 'tag'}
corpus = ColumnCorpus(args.data_folder, columns, in_memory=False)

I noticed that this log has been recently added to the Sentence class so it might as well have been the case earlier but since there was no logging I assumed that my dataset format was fine. I read a bit of the code that parses the file and creates Sentence objects but I couldn't figure out why sentence objects with empty text were being created.

Can you please tell me how to fix this?

alanakbik · 2019-05-29T10:02:35Z

@aayushsanghavi yes this is an unfortunate bug that was added and then fixed yesterday in another PR. Could you test with the current master to confirm that the bug is now gone? Thanks!

alanakbik · 2019-05-29T10:09:53Z

@emrul sorry I just realized that your memory issues are CUDA related - this is different from the regular memory issues and cannot be fixed with the embeddings_in_memory setting. Your script looks good, in particular since you've set the chars_per_chunk parameter to 64 which should be small enough given you have enough CUDA memory.

The only things I could think of trying would be to reduce the mini_batch_size or further reduce the chars_per_chunk since it seems that it gets a mini-batch that is too large for the available memory. But it really is strange that the memory is not enough. Are you running only one task on this GPU or is it also being used by other processes?

aayushsanghavi · 2019-05-29T10:32:13Z

Ooh right. Because when I tried it again this morning it worked. I put num_workers=6 and mini_batch_size=8 to reduce memory usage so I didn't have to use in_memory=False

aayushsanghavi · 2019-05-30T06:02:31Z

Hey @alanakbik whatever changes you guys pushed recently, they're pretty good. The memory management has improved so much. Earlier my model used to take up more than 12.5 GB of RAM but now it trains in about 7 GB of RAM. Nice work!

emrul · 2019-06-02T21:20:59Z

@alanakbik Nothing else is using the GPU - the entire machine is dedicated for this task. Its an Nvidia Quadro P6000 so has a lot of memory. I've reduced mini-batch size to 32 and still the issue persists.

I'll go through my training data to see if there's anything odd in there but other than that I am stumped.

alanakbik · 2019-06-11T12:32:56Z

@emrul sorry for the late reply (was busy attending NAACL) - have you made progress on this? What if you set the mini-batch size to something very low, like 2?

alanakbik · 2019-06-11T12:33:13Z

@aayushsanghavi thanks :) we hope to optimize the process more with future releases :)

emrul · 2019-06-11T12:38:18Z

Hi @alanakbik - I think it may be an issue with my larger data set. I am investigating but I think the problem is likely to be in the dataset and not in Flair. Will update when I find something.

GH-736: minor bug fixes

stale · 2020-04-30T01:10:47Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

emrul added the bug Something isn't working label May 17, 2019

alanakbik pushed a commit that referenced this issue Jun 24, 2019

GH-736: memory optimization for FlairEmbeddings

6e96d78

alanakbik mentioned this issue Jun 24, 2019

GH-736: minor bug fixes #832

Merged

alanakbik pushed a commit that referenced this issue Jun 24, 2019

Merge pull request #832 from zalandoresearch/GH-736-cuda

ff0c151

GH-736: minor bug fixes

alanakbik mentioned this issue Sep 5, 2019

Accumulating gradients to enable larger mini-batches #1072

Closed

stale bot added the wontfix This will not be worked on label Apr 30, 2020

stale bot closed this as completed May 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory (again) #736

CUDA out of memory (again) #736

emrul commented May 17, 2019 •

edited

Loading

alanakbik commented May 20, 2019

aayushsanghavi commented May 24, 2019

alanakbik commented May 24, 2019

aayushsanghavi commented May 24, 2019

alanakbik commented May 24, 2019

amitbcp commented May 26, 2019

alanakbik commented May 26, 2019

emrul commented May 26, 2019

emrul commented May 27, 2019

alanakbik commented May 27, 2019

emrul commented May 27, 2019 •

edited

Loading

aayushsanghavi commented May 28, 2019

alanakbik commented May 29, 2019

alanakbik commented May 29, 2019

aayushsanghavi commented May 29, 2019 •

edited

Loading

aayushsanghavi commented May 30, 2019 •

edited

Loading

emrul commented Jun 2, 2019

alanakbik commented Jun 11, 2019

alanakbik commented Jun 11, 2019

emrul commented Jun 11, 2019

stale bot commented Apr 30, 2020

CUDA out of memory (again) #736

CUDA out of memory (again) #736

Comments

emrul commented May 17, 2019 • edited Loading

alanakbik commented May 20, 2019

aayushsanghavi commented May 24, 2019

alanakbik commented May 24, 2019

aayushsanghavi commented May 24, 2019

alanakbik commented May 24, 2019

amitbcp commented May 26, 2019

alanakbik commented May 26, 2019

emrul commented May 26, 2019

emrul commented May 27, 2019

alanakbik commented May 27, 2019

emrul commented May 27, 2019 • edited Loading

aayushsanghavi commented May 28, 2019

alanakbik commented May 29, 2019

alanakbik commented May 29, 2019

aayushsanghavi commented May 29, 2019 • edited Loading

aayushsanghavi commented May 30, 2019 • edited Loading

emrul commented Jun 2, 2019

alanakbik commented Jun 11, 2019

alanakbik commented Jun 11, 2019

emrul commented Jun 11, 2019

stale bot commented Apr 30, 2020

emrul commented May 17, 2019 •

edited

Loading

emrul commented May 27, 2019 •

edited

Loading

aayushsanghavi commented May 29, 2019 •

edited

Loading

aayushsanghavi commented May 30, 2019 •

edited

Loading