-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix scan vocab speed issue, build vocab from provided word frequencies #1599
Changes from all commits
3f30e1e
c4f387e
8abd58b
8ec0433
b9f3a5f
0a5e8d6
644fcad
c91b4cb
1e4ef3e
9ae7a84
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -615,12 +615,49 @@ def build_vocab(self, sentences, keep_raw_vocab=False, trim_rule=None, progress_ | |
""" | ||
Build vocabulary from a sequence of sentences (can be a once-only generator stream). | ||
Each sentence must be a list of unicode strings. | ||
|
||
""" | ||
self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey | ||
self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update) # trim by min_count & precalculate downsampling | ||
self.finalize_vocab(update=update) # build tables & arrays | ||
|
||
def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False): | ||
""" | ||
Build vocabulary from a dictionary of word frequencies. | ||
Build model vocabulary from a passed dictionary that contains (word,word count). | ||
Words must be of type unicode strings. | ||
|
||
Parameters | ||
---------- | ||
`word_freq` : dict | ||
Word,Word_Count dictionary. | ||
`keep_raw_vocab` : bool | ||
If not true, delete the raw vocabulary after the scaling is done and free up RAM. | ||
`corpus_count`: int | ||
Even if no corpus is provided, this argument can set corpus_count explicitly. | ||
`trim_rule` = vocabulary trimming rule, specifies whether certain words should remain | ||
in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). | ||
Can be None (min_count will be used), or a callable that accepts parameters (word, count, min_count) and | ||
returns either `utils.RULE_DISCARD`, `utils.RULE_KEEP` or `utils.RULE_DEFAULT`. | ||
`update`: bool | ||
If true, the new provided words in `word_freq` dict will be added to model's vocab. | ||
|
||
Returns | ||
-------- | ||
None | ||
|
||
Examples | ||
-------- | ||
>>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True) | ||
""" | ||
logger.info("Processing provided word frequencies") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Be more concrete in the log: what was provided to what? (how many entries, total frequencies?) Logs at |
||
vocab = defaultdict(int, word_freq) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Won't this duplicate (double) the entire dictionary? Is it backward compatible in the sense that this refactoring won't consume much more memory? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Duplicating the entire vocab ? its just assigning a ready raw vocab (word count) dictionary. Is there a part im not getting ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think so. The |
||
|
||
self.corpus_count = corpus_count if corpus_count else 0 | ||
self.raw_vocab = vocab | ||
|
||
self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update) # trim by min_count & precalculate downsampling | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This function could use some comments and invariants: what's the relationship between There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. word_freq is the same as raw_vocab, Vocab is the same as word freq, so yes i think i should use a different naming. |
||
self.finalize_vocab(update=update) # build tables & arrays | ||
|
||
def scan_vocab(self, sentences, progress_per=10000, trim_rule=None): | ||
"""Do an initial scan of all words appearing in sentences.""" | ||
logger.info("collecting all words and their counts") | ||
|
@@ -641,16 +678,16 @@ def scan_vocab(self, sentences, progress_per=10000, trim_rule=None): | |
if sentence_no % progress_per == 0: | ||
logger.info( | ||
"PROGRESS: at sentence #%i, processed %i words, keeping %i word types", | ||
sentence_no, sum(itervalues(vocab)) + total_words, len(vocab) | ||
sentence_no, total_words, len(vocab) | ||
) | ||
for word in sentence: | ||
vocab[word] += 1 | ||
total_words += 1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not a good idea, may be (unnecessarily) slow. Why not add the entire There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmmm although it wont noticeably affect the speed, but yes it should be incrementing at once 👍 |
||
|
||
if self.max_vocab_size and len(vocab) > self.max_vocab_size: | ||
total_words += utils.prune_vocab(vocab, min_reduce, trim_rule=trim_rule) | ||
utils.prune_vocab(vocab, min_reduce, trim_rule=trim_rule) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see any tests for this change during pruning, seems risky. Does it really work? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmmm do you really think it needs a new test ? prunce_vocab has not been touched only the counter There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, definitely. You changed the semantics of how the It may be correct, but is not obvious to me and deserves an explicit check. |
||
min_reduce += 1 | ||
|
||
total_words += sum(itervalues(vocab)) | ||
logger.info( | ||
"collected %i word types from a corpus of %i raw words and %i sentences", | ||
len(vocab), total_words, sentence_no + 1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style: PEP8. Also, this is an instance method (cannot be called without an object).