-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix scan vocab speed issue, build vocab from provided word frequencies #1599
Conversation
…viously provided word frequencies table
…vided word frequencies table
…vided word frequencies table
…vided word frequencies table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR @jodevak, please make small fixes and I'll merge your PR.
gensim/models/word2vec.py
Outdated
""" | ||
self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey | ||
self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update) # trim by min_count & precalculate downsampling | ||
self.finalize_vocab(update=update) # build tables & arrays | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too many blank lines
gensim/models/word2vec.py
Outdated
""" | ||
self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey | ||
self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update) # trim by min_count & precalculate downsampling | ||
self.finalize_vocab(update=update) # build tables & arrays | ||
|
||
|
||
def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add documentation in numpy-style
gensim/models/word2vec.py
Outdated
""" | ||
self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey | ||
self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update) # trim by min_count & precalculate downsampling | ||
self.finalize_vocab(update=update) # build tables & arrays | ||
|
||
|
||
def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add test for this method
gensim/models/word2vec.py
Outdated
"PROGRESS: at sentence #%i, processed %i words, keeping %i word types", | ||
sentence_no, sum(itervalues(vocab)) + total_words, len(vocab) | ||
) | ||
logger.info("PROGRESS: at sentence #%i, processed %i words, keeping %i word types", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use hanging indents
@jodevak I find 1) weird. Summing a bunch of values should be very fast, no matter the dictionary size. What was your |
@piskvorky 2- Summing the values will require iterating over the whole dictionary values, in other words it means iterating over the whole stored word counts which will be definitely slower than incrementing a single counter. @menshikh-iv |
…_vocab_from_freq, and hanging indents in build_vocab
Seems the Btw we'll be replacing all the counting stuff by Bounter, so this will be moot. Only needs some code style fixes (vertical indent) otherwise LGTM 👍 |
@piskvorky i edited the comment, progress_per is 10000 which is the default value. I hope you give it a try on some random generated word-occurrences. Anyway Thank you :) |
A sum of 2,700,000 dict values shouldn't take more than a few dozen milliseconds, and it's done only once every few seconds. Weird... but a timing is a timing! In any case, Bounter keeps a |
@piskvorky Yes, using bounter would be more elegant. And in order to make the speed issue more clear , consider this code. Thanks. `from time import time g=defaultdict(int,dict(enumerate(range(2700000)))) counter=0 |
gensim/models/word2vec.py
Outdated
@@ -329,7 +333,8 @@ def train_sg_pair(model, word, context_index, alpha, learn_vectors=True, learn_h | |||
return neu1e | |||
|
|||
|
|||
def train_cbow_pair(model, word, input_word_indices, l1, alpha, learn_vectors=True, learn_hidden=True, compute_loss=False, | |||
def train_cbow_pair(model, word, input_word_indices, l1, alpha, learn_vectors=True, learn_hidden=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use hanging indents (instead of vertical), here and anywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"When using a hanging indent the following should be considered; there should be no arguments on the first line and further indentation should be used to clearly distinguish itself as a continuation line."
Is this what you need ? if yes, do you recommend any tool other than autopep8 that would auto format the file ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, incorrect line
vertical indents OK for function/method definition, that's all. For other situations, in gensim, we used hanging indent.
Unfortunately, I can't recommend any tool for it (because we have no tool for check this condition yet), only manually
7b3b08a
to
c91b4cb
Compare
@menshikh-iv , Is indentation accepted now ? if yes, i think only a test for build_vocab_from_freq is left to include. Many thanks. |
@jodevak yes, please make needed test and that's all 👍 |
Congratz with first contribution @jodevak 🥇 |
@menshikh-iv Thanks 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already merged, but some changes needed.
) | ||
for word in sentence: | ||
vocab[word] += 1 | ||
total_words += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a good idea, may be (unnecessarily) slow. Why not add the entire len(sentence)
at once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm although it wont noticeably affect the speed, but yes it should be incrementing at once 👍
|
||
if self.max_vocab_size and len(vocab) > self.max_vocab_size: | ||
total_words += utils.prune_vocab(vocab, min_reduce, trim_rule=trim_rule) | ||
utils.prune_vocab(vocab, min_reduce, trim_rule=trim_rule) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any tests for this change during pruning, seems risky. Does it really work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm do you really think it needs a new test ? prunce_vocab has not been touched only the counter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, definitely. You changed the semantics of how the total_words
works; for example, the return value of utils.prune_vocab
is ignored now.
It may be correct, but is not obvious to me and deserves an explicit check.
|
||
Examples | ||
-------- | ||
>>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style: PEP8. Also, this is an instance method (cannot be called without an object).
-------- | ||
>>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True) | ||
""" | ||
logger.info("Processing provided word frequencies") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be more concrete in the log: what was provided to what? (how many entries, total frequencies?) Logs at INFO
level are important, we want to make them count.
>>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True) | ||
""" | ||
logger.info("Processing provided word frequencies") | ||
vocab = defaultdict(int, word_freq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this duplicate (double) the entire dictionary? Is it backward compatible in the sense that this refactoring won't consume much more memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicating the entire vocab ? its just assigning a ready raw vocab (word count) dictionary. Is there a part im not getting ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. The defaultdict
constructor will copy the entire contents of word_freq
, which may be memory intensive for large vocabularies.
self.corpus_count = corpus_count if corpus_count else 0 | ||
self.raw_vocab = vocab | ||
|
||
self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update) # trim by min_count & precalculate downsampling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function could use some comments and invariants: what's the relationship between vocab
vs raw_vocab
vs word_freq
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
word_freq is the same as raw_vocab, Vocab is the same as word freq, so yes i think i should use a different naming.
can you create new PR and fix all comments @jodevak? |
…1599) * fix build vocab speed issue, and new function to build vocab from previously provided word frequencies table * fix build vocab speed issue, function build vocab from previously provided word frequencies table * fix build vocab speed issue, function build vocab from previously provided word frequencies table * fix build vocab speed issue, function build vocab from previously provided word frequencies table * Removing the extra blank lines, documentation in numpy-style to build_vocab_from_freq, and hanging indents in build_vocab * Fixing Indentation * Fixing gensim/models/word2vec.py:697:1: W293 blank line contains whitespace * Remove trailing white spaces * Adding test * fix spaces
@menshikh-iv sure |
Continued in #1695. |
This request has two parts:
1-There was a noticeable speed issue with scan_vocab, and it turned out to be "sum(itervalues(vocab))" because this will iterate through the whole vocab on each completed "progress_per" iterations, which has a high speed cost when dealing with a big vocab. It took only 45 mins to iterate and build the whole vocab on 57 Gigabyte production ready words co-occurrences (window=1) with my modification, whereas it took 270 mins using old implementation.
2- Since build vocab is a single threaded operation, it would be very helpful to have a function that builds a word vocabulary from pre-given word frequencies (build_vocab_from_freq function), for example one could use Spark to count the words in a distributed way and then pipe the word frequencies to gensim word2vec.