Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test for gensim.models.KeyedVectors.similarity_matrix method. Fix #1961 #1966

Conversation

nataliedurgin
Copy link

Resolves #1961.

Changes allow the similarity_matrix unit tests to be found and run.

@menshikh-iv
Copy link
Contributor

CC: @Witiko

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start @nataliedurgin, need to make very small changes & I'll merge current PR!

self.assertEquals(18, np.sum(similarity_matrix == 0))

# checking that exponent works as expected
similarity_matrix = self.similarity_matrix(corpus, dictionary, exponent=1.0).todense()
similarity_matrix = self.vectors.similarity_matrix(dictionary, exponent=1.0).todense()
self.assertAlmostEqual(9.5788956, np.sum(similarity_matrix))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to add places=5 to relax threshold (to avoiding FP failures like this one https://ci.appveyor.com/project/piskvorky/gensim-431bq/build/1.0.1883/job/03r7weolpj81olp7)

self.assertEquals(20, np.sum(similarity_matrix == 0))

# TODO: Add unit test to check that supplied tfidf has desired effect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are waiting for :)

corpus = [["government", "denied", "holiday"], ["holiday", "slowing", "hollingworth"]]
dictionary = Dictionary(corpus)
corpus = [dictionary.doc2bow(document) for document in corpus]
documents = [["government", "denied", "holiday"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: we using 120-characters limit, no need to move part of documents to new line

@@ -511,6 +511,7 @@ def similarity_matrix(self, dictionary, tfidf=None, threshold=0.0, exponent=2.0,
if w1 not in self.vocab:
num_skipped += 1
continue # A word from the dictionary is not present in the word2vec model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated change

@menshikh-iv menshikh-iv changed the title Similarity matrix unit tests: Resolves #1961 Fix test for gensim.models.KeyedVectors.similarity_matrix method. Fix #1961 Mar 9, 2018
@Witiko
Copy link
Contributor

Witiko commented Mar 9, 2018

Looks ok to me except for the exponent test, which requires smaller precision (places=…) in order to pass. Thank you, @nataliedurgin, for taking your time fixing my mess. 😅

@nataliedurgin
Copy link
Author

nataliedurgin commented Mar 9, 2018

Regarding the tfidf test: It seems like the test should highlight that fewer operations are being performed when the tfidf model is supplied. Does python have a good way to measure such a thing (it seems that things like timeit are not exactly what we want here)? If python does, what would be the smallest example for which it would make a difference? Does the benefit kick in for the common_corpus in any measurable way? Are there examples of similar testing scenarios in the repository that I could look at?

A low-hanging test would be to check that the output is consistent whether or not the tfidf is supplied. I can do this with the common corpus. However, it seems like there should also be some test that helps to demonstrate/highlight the specific purpose of the tfidf parameter.

@Witiko
Copy link
Contributor

Witiko commented Mar 9, 2018

A low-hanging test would be to check that the output is consistent whether or not the tfidf is supplied.

I suggest that the test should check that the output is not consistent. Different line processing order will yield a different similarity matrix:

tfidf_test

@nataliedurgin
Copy link
Author

Cool. I beg your patience for a few more questions so I can make sure I understand the big picture. If we are doing reordering of any kind (whether rows or columns) is it correct to say that before we go to compute softcossim here,

https://github.com/RaRe-Technologies/gensim/blob/2ee7a2c3b376d62020992f4b31bdaa79cf9de8db/gensim/matutils.py#L825

both the documents (vec1, vec2) and the similarity matrix (both rows and columns) need to have a consistent notion of term/feature order? If so, is that part of what this block of code is meant to accomplish through word_indices?

https://github.com/RaRe-Technologies/gensim/blob/2ee7a2c3b376d62020992f4b31bdaa79cf9de8db/gensim/matutils.py#L810

@Witiko
Copy link
Contributor

Witiko commented Mar 9, 2018

It is assumed that the term ids are consistent between the vectors and the similarity matrix, i.e. that they were all produced using the same dictionary.

With vec1 = dict(vec1), we are just extracting the set word_indices of all unique term ids in the word vectors. Later in the softcossim function, we take a small dense slice of the term similarity matrix S that corresponds to just the terms in the two vectors. This is for performance reasons, because Scipy's x^T * S * y is suboptimal and computes x^T * S for all columns of S even if most corresponding elements in y are zero.

@Witiko
Copy link
Contributor

Witiko commented Mar 11, 2018

Do you want to finish the PR, @nataliedurgin, or should I add the finishing touches?

@nataliedurgin
Copy link
Author

I think I will have some time today/tonight, but if I don't have it by tomorrow morning, feel free to finish it off.

@Witiko
Copy link
Contributor

Witiko commented Mar 14, 2018

Ok, I will finish the missing test during the weekend.

@menshikh-iv
Copy link
Contributor

@nataliedurgin continued by @Witiko in #1984.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants