From b89be949fffd3cf3f8b09470f01bcf9c50af2721 Mon Sep 17 00:00:00 2001 From: Paul O'Leary McCann Date: Tue, 8 Sep 2020 17:54:42 +0900 Subject: [PATCH 1/4] Fix docs about Word2Vec (fix #2934) Docs say you can use a generator as the first argument, but you can't. The tempfile path was also unused, so that's been removed. --- gensim/models/word2vec.py | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/gensim/models/word2vec.py b/gensim/models/word2vec.py index 5a74796c3c..9e9e65ff10 100755 --- a/gensim/models/word2vec.py +++ b/gensim/models/word2vec.py @@ -39,18 +39,15 @@ .. sourcecode:: pycon - >>> from gensim.test.utils import common_texts, get_tmpfile + >>> from gensim.test.utils import common_texts >>> from gensim.models import Word2Vec >>> - >>> path = get_tmpfile("word2vec.model") - >>> >>> model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4) >>> model.save("word2vec.model") -The training is streamed, meaning `sentences` can be a generator, reading input data -from disk on-the-fly, without loading the entire corpus into RAM. +`common_texts` is a list of lists of strings, where each list is a document. -It also means you can continue training the model later: +If you save the model you can continue training it later: .. sourcecode:: pycon From 78eef97d9ece1614c45b68af5eaf460973460da8 Mon Sep 17 00:00:00 2001 From: Paul O'Leary McCann Date: Tue, 8 Sep 2020 19:46:39 +0900 Subject: [PATCH 2/4] Fix langauge to make it clear streaming is supported Technically a generator is a kind of iterator, so this clarifies that a restartable iterator (as opposed to a consumable generator) is necessary. --- gensim/models/word2vec.py | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/gensim/models/word2vec.py b/gensim/models/word2vec.py index 9e9e65ff10..9f4bf2f8b1 100755 --- a/gensim/models/word2vec.py +++ b/gensim/models/word2vec.py @@ -45,7 +45,14 @@ >>> model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4) >>> model.save("word2vec.model") -`common_texts` is a list of lists of strings, where each list is a document. + +The training is streamed, so `sentences` can be an iterable, reading input data +from disk on-the-fly. This lets you avoid loading the entire corpus into RAM. +However, note that because the iterable must be re-startable, `sentences` must +not be a generator. For an example of an appropriate iterator see +:class:`~gensim.models.word2vec.BrownCorpus`, +:class:`~gensim.models.word2vec.Text8Corpus` or +:class:`~gensim.models.word2vec.LineSentence`. If you save the model you can continue training it later: From 652dcfbf5e382031e7b86a93e6baf4f00ea16a7c Mon Sep 17 00:00:00 2001 From: Michael Penkov Date: Wed, 16 Sep 2020 16:39:09 +0900 Subject: [PATCH 3/4] Update gensim/models/word2vec.py --- gensim/models/word2vec.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gensim/models/word2vec.py b/gensim/models/word2vec.py index 9f4bf2f8b1..1c042ba851 100755 --- a/gensim/models/word2vec.py +++ b/gensim/models/word2vec.py @@ -46,7 +46,7 @@ >>> model.save("word2vec.model") -The training is streamed, so `sentences` can be an iterable, reading input data +The training is streamed, so ``sentences`` can be an iterable, reading input data from disk on-the-fly. This lets you avoid loading the entire corpus into RAM. However, note that because the iterable must be re-startable, `sentences` must not be a generator. For an example of an appropriate iterator see From c1191ae318a1db4d822f26abf98c1e9c22551b79 Mon Sep 17 00:00:00 2001 From: Michael Penkov Date: Wed, 16 Sep 2020 16:41:30 +0900 Subject: [PATCH 4/4] Update CHANGELOG.md --- CHANGELOG.md | 1 + 1 file changed, 1 insertion(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index 56eef755ca..989a71a6dd 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -15,6 +15,7 @@ This release contains a major refactoring. ### :books: Tutorial and doc improvements * Clear up LdaModel documentation - remove claim that it accepts CSC matrix as input (PR [#2832](https://github.com/RaRe-Technologies/gensim/pull/2832), [@FyzHsn](https://github.com/FyzHsn)) + * Fix "generator" language in word2vec docs (PR [#2935](https://github.com/RaRe-Technologies/gensim/pull/2935), __[@polm](https://github.com/polm)__) ## :warning: 3.8.x will be the last gensim version to support Py2.7. Starting with 4.0.0, gensim will only support Py3.5 and above