-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving Scan_Vocab speed, build_vocab_from_freq function. Iteration 2 #1695
Conversation
…viously provided word frequencies table
…vided word frequencies table
…vided word frequencies table
…vided word frequencies table
…_vocab_from_freq, and hanging indents in build_vocab
…into build_vocab_freq
gensim/test/test_word2vec.py
Outdated
half_precision_model_kv = keyedvectors.KeyedVectors.load_word2vec_format( | ||
testfile(), binary=True, datatype=np.float16 | ||
) | ||
self.assertEqual(binary_model_kv.syn0.nbytes, half_precision_model_kv.syn0.nbytes * 2) | ||
self.assertEquals(binary_model_kv.syn0.nbytes, half_precision_model_kv.syn0.nbytes * 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.python.org/2/library/unittest.html#deprecated-aliases
assertEquals
is deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright
gensim/test/test_word2vec.py
Outdated
["minors", "survey", "minors", "survey", "minors"] | ||
] | ||
model = word2vec.Word2Vec(sentences, size=10, min_count=0, max_vocab_size=2, seed=42, hs=1, negative=0) | ||
self.assertTrue(len(model.wv.vocab), 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you need assertEqual
, not assertTrue
, don't you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix
gensim/models/word2vec.py
Outdated
.. [#taddy] Taddy, Matt. Document Classification by Inversion of Distributed Language Representations, in Proceedings of the 2015 Conference of the Association of Computational Linguistics. | ||
.. [#deepir] https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/deepir.ipynb | ||
.. [taddy] Taddy, Matt. Document Classification by Inversion of Distributed Language Representations, in Proceedings of the 2015 Conference of the Association of Computational Linguistics. | ||
.. [deepir] https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/deepir.ipynb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, but why are you remove #
in citate ? (#1633)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
autopep8 tool did
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file is merged with an older version.
gensim/models/word2vec.py
Outdated
@@ -647,15 +647,19 @@ def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=No | |||
|
|||
Examples | |||
-------- | |||
>>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True) | |||
>>> model.build_vocab_from_freq({"Word1":15,"Word2":20}, update=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: model.build_vocab_from_freq({"Word1": 15, "Word2": 20}, update=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, whats the problem with this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spaces after :
, ,
(in comment fixed variant)
@@ -647,15 +647,19 @@ def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=No | |||
|
|||
Examples | |||
-------- | |||
>>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Model is undefined, please create model first (docstring should be executable, i.e. I can copy-paste this code to console and I expect that code run successfully) we plan to add doctests to our CI soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
self.corpus_count = corpus_count if corpus_count else 0 | ||
self.raw_vocab = vocab | ||
self.corpus_count = corpus_count if corpus_count else 0 # Since no sentences are provided, this is to control the corpus_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8 - two spaces before #
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are 2 space, arent they ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, really, sorry
gensim/models/word2vec.py
Outdated
|
||
self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update) # trim by min_count & precalculate downsampling | ||
self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule,update=update) # trim by min_count & precalculate downsampling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return previous variant
gensim/models/word2vec.py
Outdated
@@ -675,14 +679,14 @@ def scan_vocab(self, sentences, progress_per=10000, trim_rule=None): | |||
type(sentence) | |||
) | |||
checked_string_types += 1 | |||
if sentence_no % progress_per == 0: | |||
if sentence_no % progress_per == 0 and sentence_no != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because 0% anything will equal to 0; so the logger will log a statement saying sentence 0 and processed 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we want that :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we want that :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add test based on #1599 (comment) and fix log message based on this comment - #1599 (comment)
After this - I'll merge your PR
gensim/models/word2vec.py
Outdated
@@ -647,13 +647,19 @@ def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=No | |||
|
|||
Examples | |||
-------- | |||
>>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True) | |||
>>> from gensim.models.word2vec import Word2Vec | |||
>>> model=Word2Vec() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8 model = Word2Vec()
@menshikh-iv function testPruneVocab is already there . |
@jodevak need add test for |
@menshikh-iv Do you have any suggestions to test total_words, other than adding new attributes to the model object nor returning total_words as a value ? |
I see 3 variants
@piskvorky what's variant looks best for you? |
@menshikh-iv Choice 3 seems most convenient to me. |
Thank you @jodevak 👍 |
…#1695) * fix build vocab speed issue, and new function to build vocab from previously provided word frequencies table * fix build vocab speed issue, function build vocab from previously provided word frequencies table * fix build vocab speed issue, function build vocab from previously provided word frequencies table * fix build vocab speed issue, function build vocab from previously provided word frequencies table * Removing the extra blank lines, documentation in numpy-style to build_vocab_from_freq, and hanging indents in build_vocab * Fixing Indentation * Fixing gensim/models/word2vec.py:697:1: W293 blank line contains whitespace * Remove trailing white spaces * Adding test * fix spaces * iteration 2 on code * iteration 2 on code * Fixing old version of word2vec.py merge problems * Fixing indent * Fixing Styling * Fixing Styling * test * test * adding total words count test * adding total words count test
As requested, this is a new pull request. Thanks