Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"OverflowError: value too large to convert to int" when training word2vec on a large corpus #2578

Closed
miweru opened this issue Aug 14, 2019 · 9 comments
Assignees
Labels
bug Issue described a bug impact MEDIUM Big annoyance for affected users reach HIGH Affects most or all Gensim users

Comments

@miweru
Copy link

miweru commented Aug 14, 2019

Hi,
i am trying to train word2vec on on a large corpus (>500GB) and the newest gensim version (3.8.0) and an error occurs on every Thread:

Exception in thread Thread-10: Traceback (most recent call last): File "/home/michael/miniconda3/lib/python3.7/threading.py", line 917, in _bootstrap_inner self.run() File "/home/michael/miniconda3/lib/python3.7/threading.py", line 865, in run self._target(*self._args, **self._kwargs) File "/home/michael/miniconda3/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 175, in _worker_loop_corpusfile total_examples=total_examples, total_words=total_words, **kwargs) File "/home/michael/miniconda3/lib/python3.7/site-packages/gensim/models/word2vec.py", line 794, in _do_train_epoch total_examples, total_words, work, neu1, self.compute_loss) File "gensim/models/word2vec_corpusfile.pyx", line 379, in gensim.models.word2vec_corpusfile.train_epoch_cbow OverflowError: value too large to convert to int

My command for training the Model is:
Word2Vec(corpus_file="encc_tokenized", size = 1024, window = 8, workers = 16)

Thank you for help

@piskvorky
Copy link
Owner

piskvorky commented Aug 14, 2019

CC @persiyanov – is this expected? What "value" is that?

@piskvorky
Copy link
Owner

piskvorky commented Aug 14, 2019

Looking at the word2vec_corpusfile.pyx code, it's indeed using int for all sort of counters that could be large: expected_examples, effective_words, effective_sentences, total_sentences, indexes… This will need fixing :(

Or was there some reason to using only ints @persiyanov ? I see some vars do use long long, so I presume this was a deliberate choice.

@miweru
Copy link
Author

miweru commented Aug 15, 2019

The same problem occurs also with FastText:
Exception in thread Thread-8: Traceback (most recent call last): File "/home/michael/miniconda3/lib/python3.7/threading.py", line 917, in _bootstrap_inner self.run() File "/home/michael/miniconda3/lib/python3.7/threading.py", line 865, in run self._target(*self._args, **self._kwargs) File "/home/michael/miniconda3/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 175, in _worker_loop_corpusfile total_examples=total_examples, total_words=total_words, **kwargs) File "/home/michael/miniconda3/lib/python3.7/site-packages/gensim/models/fasttext.py", line 805, in _do_train_epoch total_examples, total_words, work, neu1) File "gensim/models/fasttext_corpusfile.pyx", line 215, in gensim.models.fasttext_corpusfile.train_epoch_cbow OverflowError: value too large to convert to int

@piskvorky piskvorky added the bug Issue described a bug label Aug 15, 2019
@mpenkov
Copy link
Collaborator

mpenkov commented Sep 7, 2019

@persiyanov Ping on this.

@mpenkov mpenkov self-assigned this Sep 7, 2019
@piskvorky piskvorky added reach HIGH Affects most or all Gensim users impact LOW Low impact on affected users labels Oct 8, 2019
@piskvorky
Copy link
Owner

Another instance of 32 vs 64 bit integers causing trouble for users:

https://groups.google.com/forum/#!topic/gensim/XbH5Sr6RBcI

Not sure if it's the exact same instance as in this ticket… perhaps we need to promote counters from 32 bit to 64 bit more consistently, everywhere.

@piskvorky piskvorky added impact MEDIUM Big annoyance for affected users and removed impact LOW Low impact on affected users labels Nov 12, 2019
@persiyanov
Copy link
Contributor

@mpenkov @piskvorky My apologies for the late response.

No, I don't think there was a deliberate choice of using int instead of long long and I believe it's safe to just to replace all int with long long types in all corpus-file-related Cython code.

Do you need a hand on this? I can devote some time next week to fix it.

@piskvorky
Copy link
Owner

piskvorky commented Nov 16, 2019

Definitely. If you could grep / review all the *2vec C code for such 32=>64 bit fixes, that would be awesome. We don't want any 32 bit counters. Thanks!

@Xiaoxiong-Liu
Copy link

@piskvorky
is this fixed in version 3.8?
I get the same error when trainning when training word2vec on a large corpus.
thx.

@piskvorky
Copy link
Owner

piskvorky commented Mar 10, 2021

@Xiaoxiong-Liu yes, a fix should have been a part of 4.0.0beta.

Let me know if this still appears after you upgrade Gensim: pip install --pre --upgrade gensim.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug impact MEDIUM Big annoyance for affected users reach HIGH Affects most or all Gensim users
Projects
None yet
Development

No branches or pull requests

5 participants