-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor documentation for *2Vec
models
#1944
Conversation
gensim/models/base_any2vec.py
Outdated
@@ -288,18 +281,67 @@ class BaseWordEmbeddingsModel(BaseAny2VecModel): | |||
|
|||
""" | |||
|
|||
def _clear_post_train(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't change the code in current PR, only docstrings.
This reverts commit feb3c32.
Great start @steremma, please continue! |
Hey @steremma can I help you with the documentation? I have been working on the same. |
Hello @somnathrakshit I believe the PR is almost finished, however I am unsure about certain argument in some of the helper methods, specifically those that are conditionally defined in It would be nice if you could take a look at the type and description I gave for these arguments and let me know if you disagree or if you can improve any of them. For more specific info feel free to ping me in gitter so that we don't spam this discussion. Another way to cooperate might be for you to submit PRs to my fork directly instead of telling me what to change.I think in this way, after merging you will get credit for your contribution as well. |
@CLearERR @anotherbugmaster @yurkai - guys, please discuss with @steremma how to make docstring guideline better (we need our contribution documentation guide) |
""" | ||
Base class containing common methods for training, using & evaluating word embeddings learning models. | ||
For example - `Word2Vec`, `FastText`, etc. | ||
"""Base class containing common methods for training, using & evaluating word embeddings learning models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget about docstrings for BaseAny2VecModel
.
Also, don't forget about _
methods, this is really important for persons, who will add some *2vec implementations in future.
gensim/models/base_any2vec.py
Outdated
Can be simply a list of lists of tokens, but for larger corpora, | ||
consider an iterable that streams the sentences directly from disk/network. | ||
See :class:`~gensim.models.word2vec.BrownCorpus`, :class:`~gensim.models.word2vec.Text8Corpus` | ||
or :class:`~gensim.models.word2vec.LineSentence` in :mod:`~gensim.models.word2vec` module for such examples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to mention :mod:
here (because you already mentioned concrete class)
gensim/models/base_any2vec.py
Outdated
consider an iterable that streams the sentences directly from disk/network. | ||
See :class:`~gensim.models.word2vec.BrownCorpus`, :class:`~gensim.models.word2vec.Text8Corpus` | ||
or :class:`~gensim.models.word2vec.LineSentence` in :mod:`~gensim.models.word2vec` module for such examples. | ||
workers : int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
almost all arguments is optional, no?
gensim/models/base_any2vec.py
Outdated
|
||
Parameters | ||
---------- | ||
sentences : iterable of iterable of str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iterable of list of str
better (here and everywhere)
gensim/models/base_any2vec.py
Outdated
List of callbacks that need to be executed/run at specific stages during training. | ||
batch_words : int | ||
Number of words to be processed by a single job. | ||
trim_rule : function, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to describe function signature (in the description of the parameter). I see (word, count, min_count)
, but I need types here + show it more explicitly (maybe new sentence)
gensim/models/base_any2vec.py
Outdated
keep_raw_vocab : bool | ||
If not true, delete the raw vocabulary after the scaling is done and free up RAM. | ||
corpus_count : int | ||
word_freq : dict of (unicode str, int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dict of (str, int)
gensim/models/base_any2vec.py
Outdated
|
||
Returns | ||
------- | ||
dict of (str, int), optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How Returns
can be optional?
gensim/models/word2vec.py
Outdated
Examples | ||
-------- | ||
|
||
#. Initialize a model with e.g.:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to fix example (this should works fine + demonstrate more methods)
gensim/models/word2vec.py
Outdated
Obtain likelihood score for a single sentence in a fitted skip-gram representaion. | ||
The sentence is a list of Vocab objects (or None, when the corresponding | ||
word is not in the vocabulary). Called internally from `Word2Vec.score()`. | ||
Obtain likelihood score for a single sentence in a fitted skip-gram representation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be
"""Obtain ...
here and everywhere
gensim/models/doc2vec.py
Outdated
@@ -512,32 +648,49 @@ def train(self, documents, total_examples=None, total_words=None, | |||
queue_factor=queue_factor, report_delay=report_delay, callbacks=callbacks) | |||
|
|||
def _raw_word_count(self, job): | |||
"""Return the number of words in a given job.""" | |||
"""Return the number of words in a given job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to use always Get
instead of Return
in docstring (here and everywhere)
*2Vec
models.
*2Vec
models.*2Vec
models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general - looks awesome! Good work 👍
Missed things (for current moment)
- Examples for d2v/w2v
- Fasttext
- Poincare
- Docstrings for
.pyx
files (this will be done in last order, we'll discuss it later)
gensim/models/base_any2vec.py
Outdated
@@ -28,18 +28,40 @@ | |||
|
|||
class BaseAny2VecModel(utils.SaveLoad): | |||
"""Base class for training, using and evaluating any2vec model. | |||
Contains implementation for multi-threaded training. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix description in the head of current file please (make it more detailed, links to several related classes, etc).
gensim/models/base_any2vec.py
Outdated
A subclass should initialize the following attributes: | ||
- self.kv (instance of concrete implementation of `BaseKeyedVectors` interface) | ||
- self.vocabulary (instance of concrete implementation of `BaseVocabBuilder` abstract class) | ||
- self.trainables (instance of concrete implementation of `BaseTrainables` abstract class) | ||
|
||
Parameters | ||
---------- | ||
workers : int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional parameters (here and everywhere)
gensim/models/base_any2vec.py
Outdated
Parameters | ||
---------- | ||
workers : int | ||
Number of working threads, used for multiprocessing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multithreading :)
gensim/models/base_any2vec.py
Outdated
---------- | ||
job_queue : Queue of (list of object, dict) | ||
A queue of jobs still to be processed. The worker will take up jobs from this queue. | ||
Each job is represented by a tuple where the first element is the corpus chunk to be processed and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to add "toy" example, how looks element of queue
gensim/models/base_any2vec.py
Outdated
Parameters | ||
---------- | ||
data_iterator : iterable of list of object | ||
The input corpus. This will be split in chunks and these chunks will be pushed to the queue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not always corpus
gensim/models/base_any2vec.py
Outdated
---------- | ||
data_iterator : iterable of list of object | ||
The input corpus. This will be split in chunks and these chunks will be pushed to the queue. | ||
job_queue : Queue of (list of object, dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dict of ? here and everywhere
gensim/models/base_any2vec.py
Outdated
Multiplier for size of queue -> size = number of workers * queue_factor. | ||
report_delay : float, optional | ||
Number of seconds between two consecutive progress report messages in the logger. | ||
callbacks : list of :class: `~gensim.models.callbacks.CallbackAny2Vec`, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to add space between :class:
and link to model, please check all in rendered documentation (how this looks), this is important
gensim/models/base_any2vec.py
Outdated
@@ -526,7 +806,7 @@ def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=No | |||
len(raw_vocab), sum(itervalues(raw_vocab)) | |||
) | |||
|
|||
# Since no sentences are provided, this is to control the corpus_count | |||
# Since no sentences are provided, this is to control the `corpus_count` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this have no sense to use in comments (that presented as
# ... `)
Returns | ||
------- | ||
(np.ndarray, np.ndarray) | ||
Each worker threads private work memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't forget about empty line at the end of each docstring (except one-line docstrings), here and everywhere
gensim/models/word2vec.py
Outdated
---------- | ||
model : :class:`~gensim.models.word2Vec.Word2Vec` | ||
The Word2Vec model instance to train. | ||
sentences : iterable of iterable of str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iterable of list of str
better (here and everywhere)
@steremma If you'll finish with all mentioned stuff, some hints about docstrings for
|
…ome intuitive information taken from the papers but also references to usage examples for users that do not wish to understand the underlying theory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work @steremma, you are one of the best persons who worked on documentation 🔥
So, what're additional things should be done before a merge
- More examples (wider for module + examples in concrete methods)
- Cover
.pyx
files (instruction - Refactor documentation for*2Vec
models #1944 (comment))
gensim/models/base_any2vec.py
Outdated
----- | ||
Even though this is the usual case, not all embeddings transform text. | ||
For example :class:`~gensim.models.poincare.PoincareModel` operates on graph representations. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add See also
section here and mention several concrete implementations as w2v, fasttext, etc.
gensim/models/base_any2vec.py
Outdated
|
||
""" | ||
|
||
def __init__(self, workers=3, vector_size=100, epochs=5, callbacks=(), batch_words=10000): | ||
"""Initialize model parameters. | ||
|
||
Notes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better place for this notes is below (in class docstring instead of __init__
docstring)
gensim/models/base_any2vec.py
Outdated
------ | ||
IOError | ||
When methods are called on instance (should be called from class). | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing emptyline at the end of docstring (here and everywhere)
gensim/models/base_any2vec.py
Outdated
Parameters | ||
---------- | ||
job_params : dict of (str, obj) | ||
Unused. TODO: Delete this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can write something like UNUSED.
(without TODO)
gensim/models/doc2vec.py
Outdated
|
||
>>> from gensim.test.utils import common_texts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to add \t
before >>>
(here and everywhere)
return 60 * len(self.docvecs.offset2doctag) + 140 * len(self.docvecs.doctags) | ||
|
||
def infer_vector(self, doc_words, alpha=0.1, min_alpha=0.0001, steps=5): | ||
""" | ||
Infer a vector for given post-bulk training document. | ||
"""Infer a vector for given post-bulk training document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a warning about "this infer new vectors each time, it's fine that you retrieve different vectors for the same document, increate steps
for more stable representations (and similar but short tip for steps
parameter)
gensim/models/fasttext.py
Outdated
@@ -45,6 +81,7 @@ | |||
logger = logging.getLogger(__name__) | |||
|
|||
try: | |||
raise ImportError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to make sure that doctests work even without numpy installed so I did this hack to force my computer to use the python implementation. And then i forgot to remove it!
gensim/models/fasttext.py
Outdated
@@ -519,20 +601,31 @@ def train(self, sentences, total_examples=None, total_words=None, | |||
self.trainables.get_vocab_word_vecs(self.wv) | |||
|
|||
def init_sims(self, replace=False): | |||
""" | |||
"""Deletes the keyed vector syn1 structure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quiet
@@ -73,22 +74,32 @@ class PoincareModel(utils.SaveLoad): | |||
and :meth:`~gensim.models.poincare.PoincareModel.load` methods, or stored/loaded in the word2vec format | |||
via `model.kv.save_word2vec_format` and :meth:`~gensim.models.poincare.PoincareKeyedVectors.load_word2vec_format`. | |||
|
|||
Note that training cannot be resumed from a model loaded via `load_word2vec_format`, if you wish to train further, | |||
Notes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's about examples here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have example for each method in the PoincareKeyedVectors
class.
gensim/models/poincare.py
Outdated
"""Load model from disk, inherited from :class:`~gensim.utils.SaveLoad`.""" | ||
"""Load model from disk, inherited from :class:`~gensim.utils.SaveLoad`. | ||
|
||
See also :meth:`~gensim.models.poincare.PoincareModel.save` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace all "See also" to section
See also
------------
...
here and everywhere
PR continued in #2087. |
* Remove useless methods * started working on docstrings * more work done * Finished documentation for the `BaseWordEmbeddingsModel * PEP-8 * Revert "Remove useless methods" This reverts commit feb3c32. * added documentation for the class and all its helper methods * remove duplicated type info * Added documentation for `Doc2vec` model and all its helper methods * Fixed paper references and added documentation for �Doc2VecVocab * Fixed paper references * minor referencing fixes * sphinx identation * Added docstrings for the private methods in `BaseAny2Vec` * Applied all code review corrections, example fix still pending * Added missing docstrings * Fixed `int {1, 0}` -> `{1, 0}` * Fixed examples and code review corrections * Fixed examples and applied code review corrections (optional arguments, correct types, blank lines at end of docstrings * Applied code review corrections and added top level usage examples * Added high level explanation of the class hierarchy, fixed code review corrections * Final identation fixes * Documentation fixes * Fixed all examples * delete redundant reference to module * Added explanation for all important class attributes. These include some intuitive information taken from the papers but also references to usage examples for users that do not wish to understand the underlying theory. * documented public cython functions * documented public cython functions in doc2vec * Applied code review corrections * added documentation for public cython methods in `fasttext` * added documentation for C functions in the word2vec * fix build issues * add missing rst * fix base_any2vec * fix doc2vec[1] * fix doc2vec[2] * fix doc2vec[3ъ * fix doc2vec[4] * fix doc2vec_inner + remove unused imports * fix fasttext[1] * reformat example sections * word2vec doc fixes * more doc fixes * merging in changes from #1944 * review docs for doc2vec, base_any2vec * review fasttext docs * review poincare docs * minor typo fixes * simplify word2vec.train() docs * update alpha & epoch docs for *2vec models * add *_inner docs * fixing KeyedVectors docs * disable sphinx latex and errors (temporary, revert later) * hyperlink fixes * Fix build warnings * fix flake8 * enable strict doc building * embedsignature for w2v & ft * yes/no -> ✅/❌ * cleanup base_any2vec * clenup cython files * cleanup doc2vec * improve d2v example * cleanup fasttext * clenup utils_any2vec * clenup poincare * clenup keyedvectors * cleanup word2vec * add newline around module docstrings + re-generate *.c files (for correct doc building)
In the end this PR will fix the documentation for
BaseWordEmbeddingsModel
,BaseAny2Vec
,Word2Vec
and possibleDoc2Vec
. At the moment of opening only the first is done, however its useful to keep it online to get feedback as early as possible