diff --git a/CHANGELOG.md b/CHANGELOG.md index 56eef755ca..989a71a6dd 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -15,6 +15,7 @@ This release contains a major refactoring. ### :books: Tutorial and doc improvements * Clear up LdaModel documentation - remove claim that it accepts CSC matrix as input (PR [#2832](https://github.com/RaRe-Technologies/gensim/pull/2832), [@FyzHsn](https://github.com/FyzHsn)) + * Fix "generator" language in word2vec docs (PR [#2935](https://github.com/RaRe-Technologies/gensim/pull/2935), __[@polm](https://github.com/polm)__) ## :warning: 3.8.x will be the last gensim version to support Py2.7. Starting with 4.0.0, gensim will only support Py3.5 and above diff --git a/gensim/models/word2vec.py b/gensim/models/word2vec.py index 5a74796c3c..1c042ba851 100755 --- a/gensim/models/word2vec.py +++ b/gensim/models/word2vec.py @@ -39,18 +39,22 @@ .. sourcecode:: pycon - >>> from gensim.test.utils import common_texts, get_tmpfile + >>> from gensim.test.utils import common_texts >>> from gensim.models import Word2Vec >>> - >>> path = get_tmpfile("word2vec.model") - >>> >>> model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4) >>> model.save("word2vec.model") -The training is streamed, meaning `sentences` can be a generator, reading input data -from disk on-the-fly, without loading the entire corpus into RAM. -It also means you can continue training the model later: +The training is streamed, so ``sentences`` can be an iterable, reading input data +from disk on-the-fly. This lets you avoid loading the entire corpus into RAM. +However, note that because the iterable must be re-startable, `sentences` must +not be a generator. For an example of an appropriate iterator see +:class:`~gensim.models.word2vec.BrownCorpus`, +:class:`~gensim.models.word2vec.Text8Corpus` or +:class:`~gensim.models.word2vec.LineSentence`. + +If you save the model you can continue training it later: .. sourcecode:: pycon