-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA out of memory (again) #736
Comments
Hello @emrul we just now pushed new changes to the master branch that enable streaming data loading using the If you replace the lines: columns = {0: 'text', 1: 'ner'}
data_folder = 'training/sequences'
corpus: TaggedCorpus = NLPTaskDataFetcher.load_column_corpus(data_folder, columns) with: from flair.datasets import ColumnCorpus
columns = {0: 'text', 1: 'ner'}
data_folder = 'training/sequences'
corpus = ColumnCorpus(data_folder, columns, in_memory=False) By setting On another note, I would currently not recommend using the |
@alanakbik |
Are you using the latest master branch? |
I believe so. I'm using version 0.4.1. I noticed that the last commit on the datasets.py file was 2 hours ago so I tried |
Ah yes that is different from the master branch. If you install from pip you are using the latest release 0.4.1. The master branch already contains features that will be part of 0.4.2 when it releases, so the only way to use them right now is for you to work on the master branch instead. You can install master like this: pip install git+https://github.com/zalandoresearch/flair.git |
Hi @alanakbik Can you please share a code-snippet for loading TextClassification corpus via data-streaming ? |
Hello @amitbcp sure, here it is: from flair.datasets import ClassificationCorpus
corpus = ClassificationCorpus("path/to/your/corpus/", in_memory=False) If you set |
@alanakbik The streaming data loader works really well on my initial tests. I have switched to using regular FlairEmbeddings but I'm looking forward to the enhanced Pooled version that supports larger datasets. Thanks for your help! |
I spoke too soon - I was still using my small dataset. When using the larger one I still get OOM even though I've tried to make everything as lean as possible - my code is here: https://gist.github.com/emrul/74486783e9d750f2cb08695bf26719da Again, any help would be really valuable. |
Hi @emrul could you try running this code with |
Hi, yes - same issue however:
|
@alanakbik sorry to bother you again but even after looking things up I couldn't fix my problem.
Spaces separate the columns and newlines separate sentences. This was working fine earlier when I was using NLPTaskDataFetcher.load_column_corpus(). But after switching to ColumnCorpus (for handling bigger datasets) I am getting tons of this warning - and this is what I'm doing
I noticed that this log has been recently added to the Sentence class so it might as well have been the case earlier but since there was no logging I assumed that my dataset format was fine. I read a bit of the code that parses the file and creates Sentence objects but I couldn't figure out why sentence objects with empty text were being created. Can you please tell me how to fix this? |
@aayushsanghavi yes this is an unfortunate bug that was added and then fixed yesterday in another PR. Could you test with the current master to confirm that the bug is now gone? Thanks! |
@emrul sorry I just realized that your memory issues are CUDA related - this is different from the regular memory issues and cannot be fixed with the The only things I could think of trying would be to reduce the |
Ooh right. Because when I tried it again this morning it worked. I put |
Hey @alanakbik whatever changes you guys pushed recently, they're pretty good. The memory management has improved so much. Earlier my model used to take up more than 12.5 GB of RAM but now it trains in about 7 GB of RAM. Nice work! |
@alanakbik Nothing else is using the GPU - the entire machine is dedicated for this task. Its an Nvidia Quadro P6000 so has a lot of memory. I've reduced mini-batch size to 32 and still the issue persists. I'll go through my training data to see if there's anything odd in there but other than that I am stumped. |
@emrul sorry for the late reply (was busy attending NAACL) - have you made progress on this? What if you set the mini-batch size to something very low, like 2? |
@aayushsanghavi thanks :) we hope to optimize the process more with future releases :) |
Hi @alanakbik - I think it may be an issue with my larger data set. I am investigating but I think the problem is likely to be in the dataset and not in Flair. Will update when I find something. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I previously got CUDA out of memory issues but resolved them by
This worked well with my small dataset (12.5k training sentences) but now I am trying with a bigger dataset (160k sentences) and my CUDA issue is back.
This is my code currently:
`
columns = {0: 'text', 1: 'ner'}
data_folder = 'training/sequences'
corpus: TaggedCorpus = NLPTaskDataFetcher.load_column_corpus(data_folder, columns) print(corpus)
tag_type = 'ner'
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary.idx2item)
embeddings = StackedEmbeddings(embeddings=[
WordEmbeddings('glove'),
PooledFlairEmbeddings('news-forward', pooling='mean', only_capitalized=True, use_cache=True, chars_per_chunk=32),
PooledFlairEmbeddings('news-backward', pooling='mean', only_capitalized=True, use_cache=True, chars_per_chunk=32),
WordEmbeddings(embeddings="embeddings/markup2vec.wv")
])
tagger: SequenceTagger = SequenceTagger(hidden_size=128,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type=tag_type)
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
trainer.train('resources/taggers/v003',
max_epochs=150,
mini_batch_size=192,
eval_mini_batch_size=16,
checkpoint=True)
`
and this is the error
I've tried using the latest Flair (GitHub Master) and running on Windows 10 with Python 3.7 any PyTorch 1.1.0. Any ideas on what else I can try to get training to work with a large number of samples?
The text was updated successfully, but these errors were encountered: