Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sent2Vec model. Fix #1376 #1619

Closed
wants to merge 143 commits into from
Closed

Conversation

prerna135
Copy link
Contributor

@prerna135 prerna135 commented Oct 10, 2017

Rough initial code for sent2vec.

prerna135 added 16 commits June 23, 2017 02:10
Fixes warnings in the .py files
@menshikh-iv Fixing warnings in the .py files according to the Google Code Style. Most of the warnings were due to indentation errors.
build succeeded, 21 warnings.
Getting there. :-)
Now I'm down to,
`build succeeded, 5 warnings.`

However, I'm in a bit of a fix. Changing `doc2vec.rst` and `word2vec.rst` to `.inc` files removed the duplicate warnings but it also invalidates the references to these documents from my main toctree and the following warnings are produced.

`apiref.rst:8: WARNING: toctree contains reference to nonexisting document u'models/doc2vec'`
`apiref.rst:8: WARNING: toctree contains reference to nonexisting document u'models/word2vec'`
Rough initial code for sent2vec and tests in jupyter notebook
@menshikh-iv menshikh-iv added the incubator project PR is RaRe incubator project label Oct 10, 2017
Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start:+1:

What's need to add


logger = logging.getLogger(__name__)

MAX_WORDS_IN_BATCH = 10000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can import this constant instead of explicit definition

self, sentences=None, sg=0, hs=0, size=100, alpha=0.2, window=5, min_count=5,
max_vocab_size=None, word_ngrams=2, loss='ns', sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0, min_n=3, max_n=6, sorted_vocab=1, bucket=2000000,
trim_rule=None, batch_words=MAX_WORDS_IN_BATCH, dropoutK=2):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All parameters should be in lowercase (dropoutK)

from numpy import dot
from gensim import utils, matutils

from gensim.models.word2vec import Word2Vec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless import

trim_rule=None, batch_words=MAX_WORDS_IN_BATCH, dropoutK=2):

# sent2vec specific params
#dropoutK is the number of ngrams dropped while training a sent2vec model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

misssing space after # (here and anywhere)

trim_rule=trim_rule, sorted_vocab=sorted_vocab, batch_words=batch_words)

def scan_vocab(self, sentences, progress_per=10000, trim_rule=None):
"""Do an initial scan of all words appearing in sentences."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add more description in docstring, what's a difference between fasttext and sent2vec.

line_size = len(sentence)
discard = [False] * line_size
while (num_discarded < self.dropoutK and line_size - num_discarded > 2):
token_to_discard = randint(0,line_size-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need space after , + spaces around -

discard = [False] * line_size
while (num_discarded < self.dropoutK and line_size - num_discarded > 2):
token_to_discard = randint(0,line_size-1)
if discard[token_to_discard] == False:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if discard[token_to_discard] == False: -> if not discard[token_to_discard]:

def word_vec(self, word, use_norm=False):
return FastTextKeyedVectors.word_vec(self.wv, word, use_norm=use_norm)

def sent_vec(self, sentence):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is bad, don't need to split the raw string to tokens, should pass for example list of tokens.

word_count=0, queue_factor=2, report_delay=1.0):
super(Sent2Vec, self).train(sentences, total_examples=total_examples, epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha)

def __getitem__(self, word):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's a difference between __getitem__ and word_vec ?

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one question - why you don't re-use same functionality from current fasttext implementation?

# TODO: add docstrings and tests


class Entry():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add inheritance from object explicitly (here and anywhere)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Entry proposes, you can use namedtuple

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't use tuple because it is immutable.

self.subwords = subwords


class Dictionary():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have Dictionary class in gensim.corpora, please rename to avoid confusion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Kindly verify in the current commit.

self.words[self.word2int[h]].count += 1

def read(self, sentences, min_count):
minThreshold = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No camelCase, only lowercase_with_underscores (here and everywhere).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Kindly verify in the current commit.

self.threshold(minThreshold)

self.threshold(min_count)
self.initTableDiscard()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as for variables (no camel case), here and everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Kindly verify in the current commit.

return ntokens, hashes, words


class Model():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you really need this class (why this isn't a part of Sent2Vec?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Merged with sent2vec now.


logger = logging.getLogger(__name__)
# TODO: add logger statements instead of print statements
# TODO: add docstrings and tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's time to make resolve this TODO's, start from logger and tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Kindly verify in the current commit.

print "Progress: ", progress * 100, "% lr: ", lr, " loss: ", self.model.loss / self.model.nexamples
print "\n\nTotal training time: %s seconds" % (time.time() - start_time)

def sentence_vectors(self, sentence_string):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless method (no need to make tokenization in model)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Kindly verify in the current commit. Now sentence is passed as a list of unicode strings.

sent_vec *= (1.0 / len(line))
return sent_vec

def similarity(self, sent1, sent2):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sent1 and sent2 should be already list of tokens (no need to tokenize it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Kindly verify in the current commit.

"""

def __init__(self, vector_size=100, lr=0.2, lr_update_rate=100, epochs=5,
min_count=5, neg=10, word_ngrams=2, loss_type='ns', bucket=2000000, t=0.0001,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need vertical intend (for method definition only).

"""

def __init__(self, word=None, count=0, subwords=[]):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use numpy-style, here and everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Kindly verify in the current commit.

word and character ngrams.
"""

def __init__(self, t, bucket, minn, maxn, max_vocab_size=30000000, max_line_size=1024):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add docstrings everywhere (with parameter description + types)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Kindly verify in the current commit.

`dropoutk` = Number of ngrams dropped when training a sent2vec model. Default is 2.
"""

random.seed(seed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect, you pin "global" random seed, please use

from gensim.utils import get_random_state
random_state = get_random_state(seed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Kindly verify in the current commit.

For Sent2Vec, each sentence must be a list of unicode strings.
"""

logger.info("Creating dictionary...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this method according to w2v (train in init if sentences is provided and so on)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Kindly verify in the current commit.

@@ -75,6 +75,25 @@ class ModelDictionary():
"""

def __init__(self, t, bucket, minn, maxn, max_vocab_size=30000000, max_line_size=1024):
"""
Initialize a sent2vec dictionary.
Copy link
Contributor

@menshikh-iv menshikh-iv Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my bad! I used the word2vec code as a reference. I've updated the docstrings. Kindly verify in the latest commit.

Adding numpy docstrings, function to read corpus directly from disk, link to evaluation scripts in the notebook, evaluation of original c++ sent2vec to final table
for j from i + 1 <= j < line_size:
if j >= i + n or discard[j] == 1:
break
h = h * 116049371 + line[j]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's quite not obvious what a magic number is 116049371

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is mimic to FB implementation

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good idea to include that info in a comment -- so someone doesn't accidentally change the magic numbers in the future.

return ntokens


cdef void add_ngrams_train(vector[int] &line, int n, int k, int bucket, int size)nogil:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't empty space stand before nogil ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary as I remember

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, unnecessary, but r e a d a b i l i t y

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix that myself of course

@menshikh-iv
Copy link
Contributor

blocked by #2313 (should be merged before we can continue with current PR)

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jan 15, 2019

@prerna135 sent2vec built successfully (some issues with FT and python2, but unrelated)

  1. Bad performance (you can check it yourself, can be related with correct hash function, but I'm not sure)
  2. Looks like model doesn't really learn, for example
import logging
from gensim.models import Sent2Vec
import gensim.downloader as api
from gensim.utils import simple_preprocess
import numpy as np
from scipy.spatial.distance import cdist

logging.basicConfig(level=logging.INFO)

corpus = [simple_preprocess(_["data"]) for _ in api.load("20-newsgroups")]
model = Sent2Vec(corpus)

c_vectors = np.array([model[d] for d in corpus])
fst_vector = c_vectors[0]

similarities = (1 - cdist(new_vector.reshape((1, new_vector.shape[0])), c_vectors, metric='cosine')).reshape(-1)
print(similarities, similarities.mean())

# (array([0.99890759, 0.99885709, 0.99865863, ..., 0.99843101, 0.99866404,
#       0.99762219]), 0998304)

I.e. all vectors from corpus super-near, that's very suspicious IMO, even if I try to model.similarity(random_words, different_random_words) - it's also too high, any ideas, what's wrong here?

  1. Original sent2vec released several models that we can't load (important feature I think, see https://github.com/epfml/sent2vec#downloading-pre-trained-models)

Unfortunatelly, I can't merge PR in current state :( Not ready for 3.7.0.
@prerna135 when you'll have a time to resolve mentioned problems?

@menshikh-iv menshikh-iv removed the 3.7.0 label Jan 15, 2019
if line not in ['\n', '\r\n']:
sentence = list(tokenize(line))
if sentence:
yield sentence
Copy link
Contributor

@horpto horpto Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if line is a new line chars then previous sentence will be yielded twice.
Is it bug or a feature?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I guess that's a bug

@prerna135
Copy link
Contributor Author

prerna135 commented Jan 20, 2019

@prerna135 sent2vec built successfully (some issues with FT and python2, but unrelated)

  1. Bad performance (you can check it yourself, can be related with correct hash function, but I'm not sure)
  2. Looks like model doesn't really learn, for example
import logging
from gensim.models import Sent2Vec
import gensim.downloader as api
from gensim.utils import simple_preprocess
import numpy as np
from scipy.spatial.distance import cdist

logging.basicConfig(level=logging.INFO)

corpus = [simple_preprocess(_["data"]) for _ in api.load("20-newsgroups")]
model = Sent2Vec(corpus)

c_vectors = np.array([model[d] for d in corpus])
fst_vector = c_vectors[0]

similarities = (1 - cdist(new_vector.reshape((1, new_vector.shape[0])), c_vectors, metric='cosine')).reshape(-1)
print(similarities, similarities.mean())

# (array([0.99890759, 0.99885709, 0.99865863, ..., 0.99843101, 0.99866404,
#       0.99762219]), 0998304)

I.e. all vectors from corpus super-near, that's very suspicious IMO, even if I try to model.similarity(random_words, different_random_words) - it's also too high, any ideas, what's wrong here?

  1. Original sent2vec released several models that we can't load (important feature I think, see https://github.com/epfml/sent2vec#downloading-pre-trained-models)

Unfortunatelly, I can't merge PR in current state :( Not ready for 3.7.0.
@prerna135 when you'll have a time to resolve mentioned problems?

Hi @menshikh-iv. The bad performance part is surprising since it outperformed doc2vec in all the evaluation tasks as you can see here. I'll try to check what could be causing the problem while calculating sentence similarities. Can't promise quick results though, as my semester begins next week.

@menshikh-iv
Copy link
Contributor

@prerna135 I guess I know what’s a reason of bad performance (problem in hash function), @mpenkov will fix it soon and I’ll update your PR and ping you, ok ?

@menshikh-iv
Copy link
Contributor

Possible problem (hash func):
#1261 (comment)
Will be fixed in #2340 (block current PR)

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jan 24, 2019

@prerna135 I fixed a performance issue (this was really hashing issue) & build itself.
Unfortunately, the quality of result still low (2 from #1619 (comment)), feel free to investigate (btw, check also my changes that I don't break any myself).

So, TODO for you

@menshikh-iv
Copy link
Contributor

@prerna135 also, please fix calls (to avoid deprecation warning)

gensim/test/test_sent2vec.py::TestSent2VecModel::test_online_learning
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:397: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.min_count = min_count
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:400: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.sample = sample
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:619: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.corpus_count = self.vocabulary.read(sentences=sentences, min_count=self.min_count)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.workers, self.vocabulary.size, self.vector_size, self.sample, self.negative)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:638: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.corpus_count = self.vocabulary.read(sentences=sentences, min_count=self.min_count)
gensim/test/test_sent2vec.py::TestSent2VecModel::test_online_learning2
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:397: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.min_count = min_count
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:400: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.sample = sample
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:619: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.corpus_count = self.vocabulary.read(sentences=sentences, min_count=self.min_count)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.workers, self.vocabulary.size, self.vector_size, self.sample, self.negative)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.workers, self.vocabulary.size, self.vector_size, self.sample, self.negative)
gensim/test/test_sent2vec.py::TestSent2VecModel::test_online_learning_after_save
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:397: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.min_count = min_count
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:400: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.sample = sample
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:619: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.corpus_count = self.vocabulary.read(sentences=sentences, min_count=self.min_count)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.workers, self.vocabulary.size, self.vector_size, self.sample, self.negative)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.workers, self.vocabulary.size, self.vector_size, self.sample, self.negative)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:638: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.corpus_count = self.vocabulary.read(sentences=sentences, min_count=self.min_count)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.workers, self.vocabulary.size, self.vector_size, self.sample, self.negative)
gensim/test/test_sent2vec.py::TestSent2VecModel::test_persistence
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:397: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.min_count = min_count
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:400: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.sample = sample
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:619: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.corpus_count = self.vocabulary.read(sentences=sentences, min_count=self.min_count)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.workers, self.vocabulary.size, self.vector_size, self.sample, self.negative)
gensim/test/test_sent2vec.py::TestSent2VecModel::test_sent2vec_for_document
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:397: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.min_count = min_count
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:400: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.sample = sample
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:619: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.corpus_count = self.vocabulary.read(sentences=sentences, min_count=self.min_count)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.workers, self.vocabulary.size, self.vector_size, self.sample, self.negative)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:397: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.min_count = min_count
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:400: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.sample = sample
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:619: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.corpus_count = self.vocabulary.read(sentences=sentences, min_count=self.min_count)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.workers, self.vocabulary.size, self.vector_size, self.sample, self.negative)
gensim/test/test_sent2vec.py::TestSent2VecModel::test_training
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:397: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.min_count = min_count
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:400: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.sample = sample
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:619: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.corpus_count = self.vocabulary.read(sentences=sentences, min_count=self.min_count)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.workers, self.vocabulary.size, self.vector_size, self.sample, self.negative)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:397: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.min_count = min_count
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:400: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).
    self.sample = sample
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:619: DeprecationWarning: Call to deprecated `min_count` (Attribute will be removed in 4.0.0, use self.vocabulary.min_count instead).
    self.corpus_count = self.vocabulary.read(sentences=sentences, min_count=self.min_count)
  C:\projects\gensim-431bq\gensim\models\sent2vec.py:452: DeprecationWarning: Call to deprecated `sample` (Attribute will be removed in 4.0.0, use self.vocabulary.sample instead).

@menshikh-iv
Copy link
Contributor

ping @prerna135, any updates?

@mpenkov
Copy link
Collaborator

mpenkov commented Apr 21, 2019

@prerna135 are you able to finish this PR?

@prerna135
Copy link
Contributor Author

@menshikh-iv @mpenkov I have been very busy with grad school in the past year. Apologies for the delay. I'll try to look into this over the summer.

@piskvorky
Copy link
Owner

@prerna135 ping. This project has been under way for nearly 2 years.

@piskvorky
Copy link
Owner

piskvorky commented Aug 21, 2019

@prerna135 what's the status? Summer is nearly over.

@mpenkov unless Prerna finishes the PR, we'll have to kill it + her incubator blog post. It's getting ridiculous.

@prerna135
Copy link
Contributor Author

@menshikh-iv @piskvorky I tried to look into the code over the summer (hash function and distance computation issue). I'd have to go over the entire original c++ code to detect the bug, refractor code according to avoid deprecation warnings, retrain models and run benchmarking experiments again. I'm afraid I won't be able to devote the time required to do this. Apologies for dragging this out. I tried to run the existing code to replicate blog post results, but too many things were breaking for me to figure out the issue quickly.

@mpenkov mpenkov closed this Jun 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incubator project PR is RaRe incubator project interesting PR ⭐ Interesting PR topic, but not ready (need much work to finish)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants