Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix scan vocab speed issue, build vocab from provided word frequencies #1599

Merged
merged 10 commits into from
Oct 19, 2017
45 changes: 41 additions & 4 deletions gensim/models/word2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -615,12 +615,49 @@ def build_vocab(self, sentences, keep_raw_vocab=False, trim_rule=None, progress_
"""
Build vocabulary from a sequence of sentences (can be a once-only generator stream).
Each sentence must be a list of unicode strings.

"""
self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey
self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update) # trim by min_count & precalculate downsampling
self.finalize_vocab(update=update) # build tables & arrays

def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False):
"""
Build vocabulary from a dictionary of word frequencies.
Build model vocabulary from a passed dictionary that contains (word,word count).
Words must be of type unicode strings.

Parameters
----------
`word_freq` : dict
Word,Word_Count dictionary.
`keep_raw_vocab` : bool
If not true, delete the raw vocabulary after the scaling is done and free up RAM.
`corpus_count`: int
Even if no corpus is provided, this argument can set corpus_count explicitly.
`trim_rule` = vocabulary trimming rule, specifies whether certain words should remain
in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count).
Can be None (min_count will be used), or a callable that accepts parameters (word, count, min_count) and
returns either `utils.RULE_DISCARD`, `utils.RULE_KEEP` or `utils.RULE_DEFAULT`.
`update`: bool
If true, the new provided words in `word_freq` dict will be added to model's vocab.

Returns
--------
None

Examples
--------
>>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style: PEP8. Also, this is an instance method (cannot be called without an object).

"""
logger.info("Processing provided word frequencies")
Copy link
Owner

@piskvorky piskvorky Oct 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be more concrete in the log: what was provided to what? (how many entries, total frequencies?) Logs at INFO level are important, we want to make them count.

vocab = defaultdict(int, word_freq)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this duplicate (double) the entire dictionary? Is it backward compatible in the sense that this refactoring won't consume much more memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicating the entire vocab ? its just assigning a ready raw vocab (word count) dictionary. Is there a part im not getting ?

Copy link
Owner

@piskvorky piskvorky Oct 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. The defaultdict constructor will copy the entire contents of word_freq, which may be memory intensive for large vocabularies.


self.corpus_count = corpus_count if corpus_count else 0
self.raw_vocab = vocab

self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update) # trim by min_count & precalculate downsampling
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function could use some comments and invariants: what's the relationship between vocab vs raw_vocab vs word_freq?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

word_freq is the same as raw_vocab, Vocab is the same as word freq, so yes i think i should use a different naming.

self.finalize_vocab(update=update) # build tables & arrays

def scan_vocab(self, sentences, progress_per=10000, trim_rule=None):
"""Do an initial scan of all words appearing in sentences."""
logger.info("collecting all words and their counts")
Expand All @@ -641,16 +678,16 @@ def scan_vocab(self, sentences, progress_per=10000, trim_rule=None):
if sentence_no % progress_per == 0:
logger.info(
"PROGRESS: at sentence #%i, processed %i words, keeping %i word types",
sentence_no, sum(itervalues(vocab)) + total_words, len(vocab)
sentence_no, total_words, len(vocab)
)
for word in sentence:
vocab[word] += 1
total_words += 1
Copy link
Owner

@piskvorky piskvorky Oct 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good idea, may be (unnecessarily) slow. Why not add the entire len(sentence) at once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm although it wont noticeably affect the speed, but yes it should be incrementing at once 👍


if self.max_vocab_size and len(vocab) > self.max_vocab_size:
total_words += utils.prune_vocab(vocab, min_reduce, trim_rule=trim_rule)
utils.prune_vocab(vocab, min_reduce, trim_rule=trim_rule)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any tests for this change during pruning, seems risky. Does it really work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm do you really think it needs a new test ? prunce_vocab has not been touched only the counter

Copy link
Owner

@piskvorky piskvorky Oct 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely. You changed the semantics of how the total_words works; for example, the return value of utils.prune_vocab is ignored now.

It may be correct, but is not obvious to me and deserves an explicit check.

min_reduce += 1

total_words += sum(itervalues(vocab))
logger.info(
"collected %i word types from a corpus of %i raw words and %i sentences",
len(vocab), total_words, sentence_no + 1
Expand Down
47 changes: 47 additions & 0 deletions gensim/test/test_word2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,53 @@ def load_on_instance():


class TestWord2VecModel(unittest.TestCase):
def testBuildVocabFromFreq(self):
"""Test that the algorithm is able to build vocabulary from given
frequency table"""
freq_dict = {
'minors': 2, 'graph': 3, 'system': 4,
'trees': 3, 'eps': 2, 'computer': 2,
'survey': 2, 'user': 3, 'human': 2,
'time': 2, 'interface': 2, 'response': 2
}
model_hs = word2vec.Word2Vec(size=10, min_count=0, seed=42, hs=1, negative=0)
model_neg = word2vec.Word2Vec(size=10, min_count=0, seed=42, hs=0, negative=5)
model_hs.build_vocab_from_freq(freq_dict)
model_neg.build_vocab_from_freq(freq_dict)
self.assertTrue(len(model_hs.wv.vocab), 12)
self.assertTrue(len(model_neg.wv.vocab), 12)
self.assertEqual(model_hs.wv.vocab['minors'].count, 2)
self.assertEqual(model_hs.wv.vocab['graph'].count, 3)
self.assertEqual(model_hs.wv.vocab['system'].count, 4)
self.assertEqual(model_hs.wv.vocab['trees'].count, 3)
self.assertEqual(model_hs.wv.vocab['eps'].count, 2)
self.assertEqual(model_hs.wv.vocab['computer'].count, 2)
self.assertEqual(model_hs.wv.vocab['survey'].count, 2)
self.assertEqual(model_hs.wv.vocab['user'].count, 3)
self.assertEqual(model_hs.wv.vocab['human'].count, 2)
self.assertEqual(model_hs.wv.vocab['time'].count, 2)
self.assertEqual(model_hs.wv.vocab['interface'].count, 2)
self.assertEqual(model_hs.wv.vocab['response'].count, 2)
self.assertEqual(model_neg.wv.vocab['minors'].count, 2)
self.assertEqual(model_neg.wv.vocab['graph'].count, 3)
self.assertEqual(model_neg.wv.vocab['system'].count, 4)
self.assertEqual(model_neg.wv.vocab['trees'].count, 3)
self.assertEqual(model_neg.wv.vocab['eps'].count, 2)
self.assertEqual(model_neg.wv.vocab['computer'].count, 2)
self.assertEqual(model_neg.wv.vocab['survey'].count, 2)
self.assertEqual(model_neg.wv.vocab['user'].count, 3)
self.assertEqual(model_neg.wv.vocab['human'].count, 2)
self.assertEqual(model_neg.wv.vocab['time'].count, 2)
self.assertEqual(model_neg.wv.vocab['interface'].count, 2)
self.assertEqual(model_neg.wv.vocab['response'].count, 2)
new_freq_dict = {'computer': 1, 'artificial': 4, 'human': 1, 'graph': 1, 'intelligence': 4, 'system': 1, 'trees': 1}
model_hs.build_vocab_from_freq(new_freq_dict, update=True)
model_neg.build_vocab_from_freq(new_freq_dict, update=True)
self.assertTrue(model_hs.wv.vocab['graph'].count, 4)
self.assertTrue(model_hs.wv.vocab['artificial'].count, 4)
self.assertEqual(len(model_hs.wv.vocab), 14)
self.assertEqual(len(model_neg.wv.vocab), 14)

def testOnlineLearning(self):
"""Test that the algorithm is able to add new words to the
vocabulary and to a trained model when using a sorted vocabulary"""
Expand Down