-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix test for gensim.models.KeyedVectors.similarity_matrix
method. Fix #1961
#1966
Fix test for gensim.models.KeyedVectors.similarity_matrix
method. Fix #1961
#1966
Conversation
CC: @Witiko |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start @nataliedurgin, need to make very small changes & I'll merge current PR!
self.assertEquals(18, np.sum(similarity_matrix == 0)) | ||
|
||
# checking that exponent works as expected | ||
similarity_matrix = self.similarity_matrix(corpus, dictionary, exponent=1.0).todense() | ||
similarity_matrix = self.vectors.similarity_matrix(dictionary, exponent=1.0).todense() | ||
self.assertAlmostEqual(9.5788956, np.sum(similarity_matrix)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to add places=5
to relax threshold (to avoiding FP failures like this one https://ci.appveyor.com/project/piskvorky/gensim-431bq/build/1.0.1883/job/03r7weolpj81olp7)
self.assertEquals(20, np.sum(similarity_matrix == 0)) | ||
|
||
# TODO: Add unit test to check that supplied tfidf has desired effect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are waiting for :)
corpus = [["government", "denied", "holiday"], ["holiday", "slowing", "hollingworth"]] | ||
dictionary = Dictionary(corpus) | ||
corpus = [dictionary.doc2bow(document) for document in corpus] | ||
documents = [["government", "denied", "holiday"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: we using 120-characters limit, no need to move part of documents
to new line
@@ -511,6 +511,7 @@ def similarity_matrix(self, dictionary, tfidf=None, threshold=0.0, exponent=2.0, | |||
if w1 not in self.vocab: | |||
num_skipped += 1 | |||
continue # A word from the dictionary is not present in the word2vec model. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated change
gensim.models.KeyedVectors.similarity_matrix
method. Fix #1961
Looks ok to me except for the exponent test, which requires smaller precision ( |
Regarding the tfidf test: It seems like the test should highlight that fewer operations are being performed when the tfidf model is supplied. Does python have a good way to measure such a thing (it seems that things like timeit are not exactly what we want here)? If python does, what would be the smallest example for which it would make a difference? Does the benefit kick in for the common_corpus in any measurable way? Are there examples of similar testing scenarios in the repository that I could look at? A low-hanging test would be to check that the output is consistent whether or not the tfidf is supplied. I can do this with the common corpus. However, it seems like there should also be some test that helps to demonstrate/highlight the specific purpose of the tfidf parameter. |
Cool. I beg your patience for a few more questions so I can make sure I understand the big picture. If we are doing reordering of any kind (whether rows or columns) is it correct to say that before we go to compute softcossim here, both the documents (vec1, vec2) and the similarity matrix (both rows and columns) need to have a consistent notion of term/feature order? If so, is that part of what this block of code is meant to accomplish through word_indices? |
It is assumed that the term ids are consistent between the vectors and the similarity matrix, i.e. that they were all produced using the same dictionary. With |
Do you want to finish the PR, @nataliedurgin, or should I add the finishing touches? |
I think I will have some time today/tonight, but if I don't have it by tomorrow morning, feel free to finish it off. |
Ok, I will finish the missing test during the weekend. |
@nataliedurgin continued by @Witiko in #1984. |
Resolves #1961.
Changes allow the similarity_matrix unit tests to be found and run.