SparseTermSimilarityMatrix - TypeError: 'numpy.float32' object is not iterable #2496

magiob · 2019-05-17T08:39:45Z

I am using gensim 3.7.3 and python3.6.

I am following the exact example of SoftCosineSimilarity at https://radimrehurek.com/gensim/similarities/docsim.html
but with my own dataset and embeddings trained on Fasttext.
Dictionary and WordEmbeddingSimilarityIndex are executed properly but then I get an error when trying SparseTermSimilarityMatrix. I found a similar issue, that was solved in the pull below, but I still seem to get this error. However, I tried the exact same code with Word2Vec and the gensim imported common_texts and it worked. Why it doesnt work in my case, is it related to FastText?

#2356

My code:

from gensim.models import FastText
from gensim.corpora import Dictionary
from gensim.models import WordEmbeddingSimilarityIndex
from gensim.similarities import SoftCosineSimilarity, SparseTermSimilarityMatrix

model = FastText.load('fasttext_vector_100')
# this line works
model.wv.most_similar(positive=['test'], topn=2)
termsim_index = WordEmbeddingSimilarityIndex(model.wv)
# texts is similar to common_texts, list of lists of strings
dictionary = Dictionary(texts)
bow_corpus = [dictionary.doc2bow(document) for document in texts]
# it fails here
similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)

TypeError                                 Traceback (most recent call last)
<ipython-input-129-c33cf3beaa3e> in <module>()
----> 1 similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)  # construct similarity matrix

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\similarities\termsim.py in __init__(self, source, dictionary, tfidf, symmetric, positive_definite, nonzero_limit, dtype)
    232             most_similar = [
    233                 (dictionary.token2id[term], similarity)
--> 234                 for term, similarity in index.most_similar(t1, num_rows)
    235                 if term in dictionary.token2id]
    236 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\similarities\termsim.py in <listcomp>(.0)
    231             num_rows = nonzero_limit - num_nonzero
    232             most_similar = [
--> 233                 (dictionary.token2id[term], similarity)
    234                 for term, similarity in index.most_similar(t1, num_rows)
    235                 if term in dictionary.token2id]

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\keyedvectors.py in most_similar(self, t1, topn)
   1418         else:
   1419             most_similar = self.keyedvectors.most_similar(positive=[t1], topn=topn, **self.kwargs)
-> 1420             for t2, similarity in most_similar:
   1421                 if similarity > self.threshold:
   1422                     yield (t2, similarity**self.exponent)

TypeError: 'numpy.float32' object is not iterable

The text was updated successfully, but these errors were encountered:

piskvorky · 2019-05-17T08:43:48Z

@Witiko can you please have a look?

mkoa · 2019-05-17T10:16:02Z

Hi there,

I came across the same issue a few days ago.
I was about to open an issue as well to offer a fix I found.

The issue lies in line 248 of gensim/termsim.py
There is a test intended to ensure no more than nonzero_limit elements are inserted. The test is off and allows this limit to be broken, which makes the loop break at the next iteration.

The fix is simply to replace the "<=" test with a "<".

Hope this helps you guys!

Witiko · 2019-05-17T13:52:29Z

@piskvorky I will take a closer look when I am on a PC, but this seems to be the same issue as the one reported by @tvrbanec earlier (#2105 (comment), #2356, #2461). There is a large amount of duplication with the most_similar methods and it seems like some implementations (FastText) still interpret topn=0 as topn=None, returning an array of all similarities instead of a (word_id, similarity) list.

@mkoa Thank you for the report, but this seems unrelated. Moreover, the limit should not be broken, because column_nonzero counts the diagonal elements, whereas nonzero_limit is the maximum number of nonzero elements outside the diagonal, so the invariant should be preserved (although the naming is a little confusing, I'll admit).

mkoa · 2019-05-17T15:07:46Z

@Witiko Thank you very much for the heads up!
You are right, this is the same issue as that you reference above, which I missed.

I can confirm I got the same error as @magiob but with a word2vec model on my side. index.most_similar(t1, num_rows) is called with num_rows=0 and returns a numeric array even with the latest pull.

My fix aims at directly preventing a call to index.most_similar with topn=0 but does not address the actual root cause then. Thanks for the explanation!

Witiko · 2019-05-17T15:36:35Z

@mkoa That is a useful suggestion. Changing termsim.py as follows should fix this issue:

232,235c232,238
<             most_similar = [
<                 (dictionary.token2id[term], similarity)
<                 for term, similarity in index.most_similar(t1, num_rows)
<                 if term in dictionary.token2id]
---
>             if num_rows > 0:
>                 most_similar = [
>                     (dictionary.token2id[term], similarity)
>                     for term, similarity in index.most_similar(t1, topn=num_rows)
>                     if term in dictionary.token2id]
>             else:
>                 most_similar = []

Even though this does not address the root cause, it is still good defensive programming.

Witiko · 2019-05-17T15:47:46Z

Suggested fixes:

Apply the defensive patch from SparseTermSimilarityMatrix - TypeError: 'numpy.float32' object is not iterable #2496 (comment), because code purity seems less of an issue than a broken implementation, especially when a proper fix of most_similar has proven to be so elusive. This fix closes the issue.
Fix all implementations of most_similar, so that they correctly handle topn=0. Why this has not been fixed by Fix WordEmbeddingsKeyedVectors.most_similar #2356 and Fix WordEmbeddingsKeyedVectors.most_similar #2461 remains to be investigated. 🤔 The third time is the charm, I suppose. This is less of a priority, since no code seems to rely on the behavior of most_similar(topn=0).
Make column_nonzero count from zero, so that variables are named consistently, as suggested in SparseTermSimilarityMatrix - TypeError: 'numpy.float32' object is not iterable #2496 (comment). This is not a priority, but it will make a reader's life a little easier given how terse the SparseTermSimilarityMatrix code already is.

Witiko · 2019-05-17T21:21:02Z

I found the root cause: num_rows is np.int64, whereas most_similar requires topn to be an int. That is good news, because it means that the fixes in #2356 and #2461 were ok. It is also a good motivation for another fix, since most_similar should accept any integer (type numbers.Integral), not just int.

@piskvorky I hope I will not be the root cause of a 3.7.4 bugfix release. 😅

tvrbanec · 2019-05-19T13:39:14Z

Yes, I can confirm that code used to work on gensim==3.7.2 now on gensim=3.7.3 throw the error:
TypeError: cannot unpack non-iterable numpy.float32 object when executing code:
SparseTermSimilarityMatrix(similarity_index, dictionary)

piofel · 2019-06-12T09:48:48Z

I use Gensim 3.7.3. When I executed:

word_vectors = Word2Vec.load(WORD_EMBEDDING_DIR + WORD_EMBEDDING_FILENAME).wv
similarity_matrix = word_vectors.similarity_matrix(dictionary, tfidf=None, threshold=0.0, exponent=2.0, nonzero_limit=100)

I received:
File "/home/piotr/.local/lib/python3.7/site-packages/gensim/models/keyedvectors.py", line 1420, in most_similar for t2, similarity in most_similar: TypeError: cannot unpack non-iterable numpy.float32 object

And I fixed it completely by #2356 (comment)

Witiko · 2019-06-12T11:13:13Z

@piofel Thank you for confirming the fix. After #2497 is merged, this should no longer be an issue.

mehmetilker · 2019-06-19T06:04:55Z

I am having same problem with my own word2vec model while following tutorial here:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb

Is there any time table to publish the fix? Or any workaround other then downgrade to 3.7.2?

Witiko · 2019-06-19T08:19:37Z

@mehmetilker: The fix is published, see #2496 (comment). Hopefully, #2497 will be merged soon; what do you think, @mpenkov?

Witiko mentioned this issue May 17, 2019

Make most_similar accept topn of any integer type #2497

Merged

mpenkov self-assigned this May 21, 2019

mpenkov closed this as completed in #2497 Jun 21, 2019

Witiko mentioned this issue Nov 29, 2019

Rerun Soft Cosine Measure tutorial notebook #2691

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SparseTermSimilarityMatrix - TypeError: 'numpy.float32' object is not iterable #2496

SparseTermSimilarityMatrix - TypeError: 'numpy.float32' object is not iterable #2496

magiob commented May 17, 2019 •

edited

Loading

piskvorky commented May 17, 2019

mkoa commented May 17, 2019

Witiko commented May 17, 2019 •

edited

Loading

mkoa commented May 17, 2019

Witiko commented May 17, 2019 •

edited

Loading

Witiko commented May 17, 2019 •

edited

Loading

Witiko commented May 17, 2019 •

edited

Loading

tvrbanec commented May 19, 2019

piofel commented Jun 12, 2019 •

edited

Loading

Witiko commented Jun 12, 2019 •

edited

Loading

mehmetilker commented Jun 19, 2019

Witiko commented Jun 19, 2019 •

edited

Loading

SparseTermSimilarityMatrix - TypeError: 'numpy.float32' object is not iterable #2496

SparseTermSimilarityMatrix - TypeError: 'numpy.float32' object is not iterable #2496

Comments

magiob commented May 17, 2019 • edited Loading

piskvorky commented May 17, 2019

mkoa commented May 17, 2019

Witiko commented May 17, 2019 • edited Loading

mkoa commented May 17, 2019

Witiko commented May 17, 2019 • edited Loading

Witiko commented May 17, 2019 • edited Loading

Witiko commented May 17, 2019 • edited Loading

tvrbanec commented May 19, 2019

piofel commented Jun 12, 2019 • edited Loading

Witiko commented Jun 12, 2019 • edited Loading

mehmetilker commented Jun 19, 2019

Witiko commented Jun 19, 2019 • edited Loading

magiob commented May 17, 2019 •

edited

Loading

Witiko commented May 17, 2019 •

edited

Loading

Witiko commented May 17, 2019 •

edited

Loading

Witiko commented May 17, 2019 •

edited

Loading

Witiko commented May 17, 2019 •

edited

Loading

piofel commented Jun 12, 2019 •

edited

Loading

Witiko commented Jun 12, 2019 •

edited

Loading

Witiko commented Jun 19, 2019 •

edited

Loading