Fix test for `gensim.models.KeyedVectors.similarity_matrix` method. Fix #1961 #1966

nataliedurgin · 2018-03-08T21:53:00Z

Resolves #1961.

Changes allow the similarity_matrix unit tests to be found and run.

menshikh-iv · 2018-03-09T06:13:02Z

CC: @Witiko

menshikh-iv

Good start @nataliedurgin, need to make very small changes & I'll merge current PR!

menshikh-iv · 2018-03-09T06:15:09Z

gensim/test/test_keyedvectors.py

        self.assertEquals(18, np.sum(similarity_matrix == 0))

        # checking that exponent works as expected
-        similarity_matrix = self.similarity_matrix(corpus, dictionary, exponent=1.0).todense()
+        similarity_matrix = self.vectors.similarity_matrix(dictionary, exponent=1.0).todense()
        self.assertAlmostEqual(9.5788956, np.sum(similarity_matrix))


Better to add places=5 to relax threshold (to avoiding FP failures like this one https://ci.appveyor.com/project/piskvorky/gensim-431bq/build/1.0.1883/job/03r7weolpj81olp7)

menshikh-iv · 2018-03-09T06:15:27Z

gensim/test/test_keyedvectors.py

        self.assertEquals(20, np.sum(similarity_matrix == 0))

+        # TODO: Add unit test to check that supplied tfidf has desired effect


we are waiting for :)

menshikh-iv · 2018-03-09T06:16:17Z

gensim/test/test_keyedvectors.py

-        corpus = [["government", "denied", "holiday"], ["holiday", "slowing", "hollingworth"]]
-        dictionary = Dictionary(corpus)
-        corpus = [dictionary.doc2bow(document) for document in corpus]
+        documents = [["government", "denied", "holiday"],


nitpick: we using 120-characters limit, no need to move part of documents to new line

menshikh-iv · 2018-03-09T06:16:33Z

gensim/models/keyedvectors.py

@@ -511,6 +511,7 @@ def similarity_matrix(self, dictionary, tfidf=None, threshold=0.0, exponent=2.0,
            if w1 not in self.vocab:
                num_skipped += 1
                continue  # A word from the dictionary is not present in the word2vec model.
+


unrelated change

Witiko · 2018-03-09T12:29:39Z

Looks ok to me except for the exponent test, which requires smaller precision (places=…) in order to pass. Thank you, @nataliedurgin, for taking your time fixing my mess. 😅

nataliedurgin · 2018-03-09T17:47:42Z

Regarding the tfidf test: It seems like the test should highlight that fewer operations are being performed when the tfidf model is supplied. Does python have a good way to measure such a thing (it seems that things like timeit are not exactly what we want here)? If python does, what would be the smallest example for which it would make a difference? Does the benefit kick in for the common_corpus in any measurable way? Are there examples of similar testing scenarios in the repository that I could look at?

A low-hanging test would be to check that the output is consistent whether or not the tfidf is supplied. I can do this with the common corpus. However, it seems like there should also be some test that helps to demonstrate/highlight the specific purpose of the tfidf parameter.

Witiko · 2018-03-09T19:02:45Z

A low-hanging test would be to check that the output is consistent whether or not the tfidf is supplied.

I suggest that the test should check that the output is not consistent. Different line processing order will yield a different similarity matrix:

nataliedurgin · 2018-03-09T20:08:18Z

Cool. I beg your patience for a few more questions so I can make sure I understand the big picture. If we are doing reordering of any kind (whether rows or columns) is it correct to say that before we go to compute softcossim here,

https://github.com/RaRe-Technologies/gensim/blob/2ee7a2c3b376d62020992f4b31bdaa79cf9de8db/gensim/matutils.py#L825

both the documents (vec1, vec2) and the similarity matrix (both rows and columns) need to have a consistent notion of term/feature order? If so, is that part of what this block of code is meant to accomplish through word_indices?

https://github.com/RaRe-Technologies/gensim/blob/2ee7a2c3b376d62020992f4b31bdaa79cf9de8db/gensim/matutils.py#L810

Witiko · 2018-03-09T23:14:06Z

It is assumed that the term ids are consistent between the vectors and the similarity matrix, i.e. that they were all produced using the same dictionary.

With vec1 = dict(vec1), we are just extracting the set word_indices of all unique term ids in the word vectors. Later in the softcossim function, we take a small dense slice of the term similarity matrix S that corresponds to just the terms in the two vectors. This is for performance reasons, because Scipy's x^T * S * y is suboptimal and computes x^T * S for all columns of S even if most corresponding elements in y are zero.

Witiko · 2018-03-11T13:10:16Z

Do you want to finish the PR, @nataliedurgin, or should I add the finishing touches?

nataliedurgin · 2018-03-12T15:36:36Z

I think I will have some time today/tonight, but if I don't have it by tomorrow morning, feel free to finish it off.

Witiko · 2018-03-14T22:04:55Z

Ok, I will finish the missing test during the weekend.

menshikh-iv · 2018-03-18T11:33:25Z

@nataliedurgin continued by @Witiko in #1984.

Natalie Durgin added 2 commits March 8, 2018 11:00

Fix unit tests for similarity_matrix property

da492b3

Fix unit test to check for ones on diagonals

1a3be38

menshikh-iv suggested changes Mar 9, 2018

View reviewed changes

menshikh-iv changed the title ~~Similarity matrix unit tests: Resolves #1961~~ Fix test for gensim.models.KeyedVectors.similarity_matrix method. Fix #1961 Mar 9, 2018

Witiko mentioned this pull request Mar 17, 2018

Fix tests for EuclideanKeyedVectors.similarity_matrix. Fix #1961 #1984

Merged

menshikh-iv closed this Mar 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix test for `gensim.models.KeyedVectors.similarity_matrix` method. Fix #1961 #1966

Fix test for `gensim.models.KeyedVectors.similarity_matrix` method. Fix #1961 #1966

nataliedurgin commented Mar 8, 2018

menshikh-iv commented Mar 9, 2018

menshikh-iv left a comment •

edited

Loading

menshikh-iv Mar 9, 2018

menshikh-iv Mar 9, 2018

menshikh-iv Mar 9, 2018

menshikh-iv Mar 9, 2018

Witiko commented Mar 9, 2018 •

edited

Loading

nataliedurgin commented Mar 9, 2018 •

edited

Loading

Witiko commented Mar 9, 2018 •

edited

Loading

nataliedurgin commented Mar 9, 2018

Witiko commented Mar 9, 2018 •

edited

Loading

Witiko commented Mar 11, 2018

nataliedurgin commented Mar 12, 2018

Witiko commented Mar 14, 2018

menshikh-iv commented Mar 18, 2018

		self.assertEquals(20, np.sum(similarity_matrix == 0))

		# TODO: Add unit test to check that supplied tfidf has desired effect

Fix test for gensim.models.KeyedVectors.similarity_matrix method. Fix #1961 #1966

Fix test for gensim.models.KeyedVectors.similarity_matrix method. Fix #1961 #1966

Conversation

nataliedurgin commented Mar 8, 2018

menshikh-iv commented Mar 9, 2018

menshikh-iv left a comment • edited Loading

Choose a reason for hiding this comment

menshikh-iv Mar 9, 2018

Choose a reason for hiding this comment

menshikh-iv Mar 9, 2018

Choose a reason for hiding this comment

menshikh-iv Mar 9, 2018

Choose a reason for hiding this comment

menshikh-iv Mar 9, 2018

Choose a reason for hiding this comment

Witiko commented Mar 9, 2018 • edited Loading

nataliedurgin commented Mar 9, 2018 • edited Loading

Witiko commented Mar 9, 2018 • edited Loading

nataliedurgin commented Mar 9, 2018

Witiko commented Mar 9, 2018 • edited Loading

Witiko commented Mar 11, 2018

nataliedurgin commented Mar 12, 2018

Witiko commented Mar 14, 2018

menshikh-iv commented Mar 18, 2018

Fix test for `gensim.models.KeyedVectors.similarity_matrix` method. Fix #1961 #1966

Fix test for `gensim.models.KeyedVectors.similarity_matrix` method. Fix #1961 #1966

menshikh-iv left a comment •

edited

Loading

Witiko commented Mar 9, 2018 •

edited

Loading

nataliedurgin commented Mar 9, 2018 •

edited

Loading

Witiko commented Mar 9, 2018 •

edited

Loading

Witiko commented Mar 9, 2018 •

edited

Loading