-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Sent2Vec model. Fix #1376 #1619
Conversation
Fixes warnings in the .py files
@menshikh-iv Fixing warnings in the .py files according to the Google Code Style. Most of the warnings were due to indentation errors.
build succeeded, 21 warnings. Getting there. :-)
Now I'm down to, `build succeeded, 5 warnings.` However, I'm in a bit of a fix. Changing `doc2vec.rst` and `word2vec.rst` to `.inc` files removed the duplicate warnings but it also invalidates the references to these documents from my main toctree and the following warnings are produced. `apiref.rst:8: WARNING: toctree contains reference to nonexisting document u'models/doc2vec'` `apiref.rst:8: WARNING: toctree contains reference to nonexisting document u'models/word2vec'`
Rough initial code for sent2vec and tests in jupyter notebook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start:+1:
What's need to add
- Add docstrings in Numpy format https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt
- Add
sent2vec.rst
todocs/src/model
+ updateapiref.rst
- Remove code duplication (exactly same methods with fasttext)
- Add tests for your class
- Notebook with reproducing results
gensim/models/sent2vec.py
Outdated
|
||
logger = logging.getLogger(__name__) | ||
|
||
MAX_WORDS_IN_BATCH = 10000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can import this constant instead of explicit definition
gensim/models/sent2vec.py
Outdated
self, sentences=None, sg=0, hs=0, size=100, alpha=0.2, window=5, min_count=5, | ||
max_vocab_size=None, word_ngrams=2, loss='ns', sample=1e-3, seed=1, workers=3, min_alpha=0.0001, | ||
negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0, min_n=3, max_n=6, sorted_vocab=1, bucket=2000000, | ||
trim_rule=None, batch_words=MAX_WORDS_IN_BATCH, dropoutK=2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All parameters should be in lowercase (dropoutK
)
gensim/models/sent2vec.py
Outdated
from numpy import dot | ||
from gensim import utils, matutils | ||
|
||
from gensim.models.word2vec import Word2Vec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Useless import
gensim/models/sent2vec.py
Outdated
trim_rule=None, batch_words=MAX_WORDS_IN_BATCH, dropoutK=2): | ||
|
||
# sent2vec specific params | ||
#dropoutK is the number of ngrams dropped while training a sent2vec model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
misssing space after #
(here and anywhere)
gensim/models/sent2vec.py
Outdated
trim_rule=trim_rule, sorted_vocab=sorted_vocab, batch_words=batch_words) | ||
|
||
def scan_vocab(self, sentences, progress_per=10000, trim_rule=None): | ||
"""Do an initial scan of all words appearing in sentences.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add more description in docstring, what's a difference between fasttext and sent2vec.
gensim/models/sent2vec.py
Outdated
line_size = len(sentence) | ||
discard = [False] * line_size | ||
while (num_discarded < self.dropoutK and line_size - num_discarded > 2): | ||
token_to_discard = randint(0,line_size-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need space after ,
+ spaces around -
gensim/models/sent2vec.py
Outdated
discard = [False] * line_size | ||
while (num_discarded < self.dropoutK and line_size - num_discarded > 2): | ||
token_to_discard = randint(0,line_size-1) | ||
if discard[token_to_discard] == False: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if discard[token_to_discard] == False:
-> if not discard[token_to_discard]:
gensim/models/sent2vec.py
Outdated
def word_vec(self, word, use_norm=False): | ||
return FastTextKeyedVectors.word_vec(self.wv, word, use_norm=use_norm) | ||
|
||
def sent_vec(self, sentence): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is bad, don't need to split the raw string to tokens, should pass for example list of tokens.
gensim/models/sent2vec.py
Outdated
word_count=0, queue_factor=2, report_delay=1.0): | ||
super(Sent2Vec, self).train(sentences, total_examples=total_examples, epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha) | ||
|
||
def __getitem__(self, word): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's a difference between __getitem__
and word_vec
?
Adding word ngrams for sentences and changing training function accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one question - why you don't re-use same functionality from current fasttext
implementation?
gensim/models/sent2vec.py
Outdated
# TODO: add docstrings and tests | ||
|
||
|
||
class Entry(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add inheritance from object explicitly (here and anywhere)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Entry
proposes, you can use namedtuple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't use tuple because it is immutable.
gensim/models/sent2vec.py
Outdated
self.subwords = subwords | ||
|
||
|
||
class Dictionary(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have Dictionary
class in gensim.corpora
, please rename to avoid confusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Kindly verify in the current commit.
gensim/models/sent2vec.py
Outdated
self.words[self.word2int[h]].count += 1 | ||
|
||
def read(self, sentences, min_count): | ||
minThreshold = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No camelCase
, only lowercase_with_underscores
(here and everywhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Kindly verify in the current commit.
gensim/models/sent2vec.py
Outdated
self.threshold(minThreshold) | ||
|
||
self.threshold(min_count) | ||
self.initTableDiscard() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as for variables (no camel case), here and everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Kindly verify in the current commit.
gensim/models/sent2vec.py
Outdated
return ntokens, hashes, words | ||
|
||
|
||
class Model(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you really need this class (why this isn't a part of Sent2Vec
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Merged with sent2vec now.
gensim/models/sent2vec.py
Outdated
|
||
logger = logging.getLogger(__name__) | ||
# TODO: add logger statements instead of print statements | ||
# TODO: add docstrings and tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's time to make resolve this TODO's, start from logger
and tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Kindly verify in the current commit.
gensim/models/sent2vec.py
Outdated
print "Progress: ", progress * 100, "% lr: ", lr, " loss: ", self.model.loss / self.model.nexamples | ||
print "\n\nTotal training time: %s seconds" % (time.time() - start_time) | ||
|
||
def sentence_vectors(self, sentence_string): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Useless method (no need to make tokenization in model)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Kindly verify in the current commit. Now sentence is passed as a list of unicode strings.
sent_vec *= (1.0 / len(line)) | ||
return sent_vec | ||
|
||
def similarity(self, sent1, sent2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sent1 and sent2 should be already list of tokens (no need to tokenize it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Kindly verify in the current commit.
gensim/models/sent2vec.py
Outdated
""" | ||
|
||
def __init__(self, vector_size=100, lr=0.2, lr_update_rate=100, epochs=5, | ||
min_count=5, neg=10, word_ngrams=2, loss_type='ns', bucket=2000000, t=0.0001, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need vertical intend (for method definition only).
""" | ||
|
||
def __init__(self, word=None, count=0, subwords=[]): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use numpy-style, here and everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Kindly verify in the current commit.
gensim/models/sent2vec.py
Outdated
word and character ngrams. | ||
""" | ||
|
||
def __init__(self, t, bucket, minn, maxn, max_vocab_size=30000000, max_line_size=1024): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add docstrings everywhere (with parameter description + types)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Kindly verify in the current commit.
gensim/models/sent2vec.py
Outdated
`dropoutk` = Number of ngrams dropped when training a sent2vec model. Default is 2. | ||
""" | ||
|
||
random.seed(seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect, you pin "global" random seed, please use
from gensim.utils import get_random_state
random_state = get_random_state(seed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Kindly verify in the current commit.
gensim/models/sent2vec.py
Outdated
For Sent2Vec, each sentence must be a list of unicode strings. | ||
""" | ||
|
||
logger.info("Creating dictionary...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this method according to w2v (train in init if sentences
is provided and so on)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Kindly verify in the current commit.
gensim/models/sent2vec.py
Outdated
@@ -75,6 +75,25 @@ class ModelDictionary(): | |||
""" | |||
|
|||
def __init__(self, t, bucket, minn, maxn, max_vocab_size=30000000, max_line_size=1024): | |||
""" | |||
Initialize a sent2vec dictionary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numpy-style please: https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt and http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html (here and anywhere)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my bad! I used the word2vec code as a reference. I've updated the docstrings. Kindly verify in the latest commit.
Adding numpy docstrings, function to read corpus directly from disk, link to evaluation scripts in the notebook, evaluation of original c++ sent2vec to final table
for j from i + 1 <= j < line_size: | ||
if j >= i + n or discard[j] == 1: | ||
break | ||
h = h * 116049371 + line[j] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's quite not obvious what a magic number is 116049371
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is mimic to FB implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good idea to include that info in a comment -- so someone doesn't accidentally change the magic numbers in the future.
gensim/models/sent2vec_inner.pyx
Outdated
return ntokens | ||
|
||
|
||
cdef void add_ngrams_train(vector[int] &line, int n, int k, int bucket, int size)nogil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't empty space stand before nogil ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary as I remember
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, unnecessary, but r e a d a b i l i t y
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix that myself of course
blocked by #2313 (should be merged before we can continue with current PR) |
@prerna135 sent2vec built successfully (some issues with FT and python2, but unrelated)
import logging
from gensim.models import Sent2Vec
import gensim.downloader as api
from gensim.utils import simple_preprocess
import numpy as np
from scipy.spatial.distance import cdist
logging.basicConfig(level=logging.INFO)
corpus = [simple_preprocess(_["data"]) for _ in api.load("20-newsgroups")]
model = Sent2Vec(corpus)
c_vectors = np.array([model[d] for d in corpus])
fst_vector = c_vectors[0]
similarities = (1 - cdist(new_vector.reshape((1, new_vector.shape[0])), c_vectors, metric='cosine')).reshape(-1)
print(similarities, similarities.mean())
# (array([0.99890759, 0.99885709, 0.99865863, ..., 0.99843101, 0.99866404,
# 0.99762219]), 0998304) I.e. all vectors from corpus super-near, that's very suspicious IMO, even if I try to
Unfortunatelly, I can't merge PR in current state :( Not ready for |
gensim/models/sent2vec.py
Outdated
if line not in ['\n', '\r\n']: | ||
sentence = list(tokenize(line)) | ||
if sentence: | ||
yield sentence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if line is a new line chars then previous sentence will be yielded twice.
Is it bug or a feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, I guess that's a bug
Hi @menshikh-iv. The bad performance part is surprising since it outperformed |
@prerna135 I guess I know what’s a reason of bad performance (problem in hash function), @mpenkov will fix it soon and I’ll update your PR and ping you, ok ? |
Possible problem (hash func): |
# Conflicts: # gensim/models/_utils_any2vec.c # gensim/models/word2vec_inner.c # setup.py
@prerna135 I fixed a performance issue (this was really hashing issue) & build itself. So, TODO for you
|
@prerna135 also, please fix calls (to avoid deprecation warning)
|
ping @prerna135, any updates? |
@prerna135 are you able to finish this PR? |
@menshikh-iv @mpenkov I have been very busy with grad school in the past year. Apologies for the delay. I'll try to look into this over the summer. |
@prerna135 ping. This project has been under way for nearly 2 years. |
@prerna135 what's the status? Summer is nearly over. @mpenkov unless Prerna finishes the PR, we'll have to kill it + her incubator blog post. It's getting ridiculous. |
@menshikh-iv @piskvorky I tried to look into the code over the summer (hash function and distance computation issue). I'd have to go over the entire original c++ code to detect the bug, refractor code according to avoid deprecation warnings, retrain models and run benchmarking experiments again. I'm afraid I won't be able to devote the time required to do this. Apologies for dragging this out. I tried to run the existing code to replicate blog post results, but too many things were breaking for me to figure out the issue quickly. |
Rough initial code for sent2vec.