Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SparseTermSimilarityMatrix - TypeError: 'numpy.float32' object is not iterable #2496

Closed
magiob opened this issue May 17, 2019 · 12 comments · Fixed by #2497
Closed

SparseTermSimilarityMatrix - TypeError: 'numpy.float32' object is not iterable #2496

magiob opened this issue May 17, 2019 · 12 comments · Fixed by #2497
Assignees

Comments

@magiob
Copy link

magiob commented May 17, 2019

I am using gensim 3.7.3 and python3.6.

I am following the exact example of SoftCosineSimilarity at https://radimrehurek.com/gensim/similarities/docsim.html
but with my own dataset and embeddings trained on Fasttext.
Dictionary and WordEmbeddingSimilarityIndex are executed properly but then I get an error when trying SparseTermSimilarityMatrix. I found a similar issue, that was solved in the pull below, but I still seem to get this error. However, I tried the exact same code with Word2Vec and the gensim imported common_texts and it worked. Why it doesnt work in my case, is it related to FastText?

#2356

My code:

from gensim.models import FastText
from gensim.corpora import Dictionary
from gensim.models import WordEmbeddingSimilarityIndex
from gensim.similarities import SoftCosineSimilarity, SparseTermSimilarityMatrix

model = FastText.load('fasttext_vector_100')
# this line works
model.wv.most_similar(positive=['test'], topn=2)
termsim_index = WordEmbeddingSimilarityIndex(model.wv)
# texts is similar to common_texts, list of lists of strings
dictionary = Dictionary(texts)
bow_corpus = [dictionary.doc2bow(document) for document in texts]
# it fails here
similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)
TypeError                                 Traceback (most recent call last)
<ipython-input-129-c33cf3beaa3e> in <module>()
----> 1 similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)  # construct similarity matrix

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\similarities\termsim.py in __init__(self, source, dictionary, tfidf, symmetric, positive_definite, nonzero_limit, dtype)
    232             most_similar = [
    233                 (dictionary.token2id[term], similarity)
--> 234                 for term, similarity in index.most_similar(t1, num_rows)
    235                 if term in dictionary.token2id]
    236 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\similarities\termsim.py in <listcomp>(.0)
    231             num_rows = nonzero_limit - num_nonzero
    232             most_similar = [
--> 233                 (dictionary.token2id[term], similarity)
    234                 for term, similarity in index.most_similar(t1, num_rows)
    235                 if term in dictionary.token2id]

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\keyedvectors.py in most_similar(self, t1, topn)
   1418         else:
   1419             most_similar = self.keyedvectors.most_similar(positive=[t1], topn=topn, **self.kwargs)
-> 1420             for t2, similarity in most_similar:
   1421                 if similarity > self.threshold:
   1422                     yield (t2, similarity**self.exponent)

TypeError: 'numpy.float32' object is not iterable

@piskvorky
Copy link
Owner

@Witiko can you please have a look?

@mkoa
Copy link

mkoa commented May 17, 2019

Hi there,

I came across the same issue a few days ago.
I was about to open an issue as well to offer a fix I found.

The issue lies in line 248 of gensim/termsim.py
There is a test intended to ensure no more than nonzero_limit elements are inserted. The test is off and allows this limit to be broken, which makes the loop break at the next iteration.

The fix is simply to replace the "<=" test with a "<".

Hope this helps you guys!

@Witiko
Copy link
Contributor

Witiko commented May 17, 2019

@piskvorky I will take a closer look when I am on a PC, but this seems to be the same issue as the one reported by @tvrbanec earlier (#2105 (comment), #2356, #2461). There is a large amount of duplication with the most_similar methods and it seems like some implementations (FastText) still interpret topn=0 as topn=None, returning an array of all similarities instead of a (word_id, similarity) list.

@mkoa Thank you for the report, but this seems unrelated. Moreover, the limit should not be broken, because column_nonzero counts the diagonal elements, whereas nonzero_limit is the maximum number of nonzero elements outside the diagonal, so the invariant should be preserved (although the naming is a little confusing, I'll admit).

@mkoa
Copy link

mkoa commented May 17, 2019

@Witiko Thank you very much for the heads up!
You are right, this is the same issue as that you reference above, which I missed.

I can confirm I got the same error as @magiob but with a word2vec model on my side. index.most_similar(t1, num_rows) is called with num_rows=0 and returns a numeric array even with the latest pull.

My fix aims at directly preventing a call to index.most_similar with topn=0 but does not address the actual root cause then. Thanks for the explanation!

@Witiko
Copy link
Contributor

Witiko commented May 17, 2019

@mkoa That is a useful suggestion. Changing termsim.py as follows should fix this issue:

232,235c232,238
<             most_similar = [
<                 (dictionary.token2id[term], similarity)
<                 for term, similarity in index.most_similar(t1, num_rows)
<                 if term in dictionary.token2id]
---
>             if num_rows > 0:
>                 most_similar = [
>                     (dictionary.token2id[term], similarity)
>                     for term, similarity in index.most_similar(t1, topn=num_rows)
>                     if term in dictionary.token2id]
>             else:
>                 most_similar = []

Even though this does not address the root cause, it is still good defensive programming.

@Witiko
Copy link
Contributor

Witiko commented May 17, 2019

Suggested fixes:

@Witiko
Copy link
Contributor

Witiko commented May 17, 2019

I found the root cause: num_rows is np.int64, whereas most_similar requires topn to be an int. That is good news, because it means that the fixes in #2356 and #2461 were ok. It is also a good motivation for another fix, since most_similar should accept any integer (type numbers.Integral), not just int.

@piskvorky I hope I will not be the root cause of a 3.7.4 bugfix release. 😅

@tvrbanec
Copy link

Yes, I can confirm that code used to work on gensim==3.7.2 now on gensim=3.7.3 throw the error:
TypeError: cannot unpack non-iterable numpy.float32 object when executing code:
SparseTermSimilarityMatrix(similarity_index, dictionary)

@mpenkov mpenkov self-assigned this May 21, 2019
@piofel
Copy link

piofel commented Jun 12, 2019

I use Gensim 3.7.3. When I executed:

word_vectors = Word2Vec.load(WORD_EMBEDDING_DIR + WORD_EMBEDDING_FILENAME).wv
similarity_matrix = word_vectors.similarity_matrix(dictionary, tfidf=None, threshold=0.0, exponent=2.0, nonzero_limit=100)

I received:
File "/home/piotr/.local/lib/python3.7/site-packages/gensim/models/keyedvectors.py", line 1420, in most_similar for t2, similarity in most_similar: TypeError: cannot unpack non-iterable numpy.float32 object

And I fixed it completely by #2356 (comment)

@Witiko
Copy link
Contributor

Witiko commented Jun 12, 2019

@piofel Thank you for confirming the fix. After #2497 is merged, this should no longer be an issue.

@mehmetilker
Copy link

I am having same problem with my own word2vec model while following tutorial here:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb

Is there any time table to publish the fix? Or any workaround other then downgrade to 3.7.2?

@Witiko
Copy link
Contributor

Witiko commented Jun 19, 2019

@mehmetilker: The fix is published, see #2496 (comment). Hopefully, #2497 will be merged soon; what do you think, @mpenkov?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants