Custom glove vectors throw tuple index out of range error #1831

samrensenhouse · 2018-01-12T00:30:06Z

I tried loading in some custom glove vectors using the demo provided here:
https://github.com/stanfordnlp/GloVe/blob/master/demo.sh

I then made a directory called vectors with a vectors.50.d.bin inside as well as a vectors.txt

However, when I use the code below I get an IndexError:tuple index out of range

parser = spacy.load('en_core_web_sm')
parser.vocab.vectors.from_glove('C:\dev\glovepy\\vectors')
spacy_doc = parser('I am happy.')
for word in spacy_doc:
   print(t.vector)

Info about spaCy

spaCy version: 2.0.5
Platform: Windows-10-10.0.16299-SP0
Python version: 3.6.3
Models: en

The text was updated successfully, but these errors were encountered:

fabiocapsouza · 2018-01-17T20:01:25Z

I'm experiencing the same issue.
I downloaded a trained a GloVe model for Portuguese from this repository. It comes as a single .txt file, so I loaded it using gensim's KeyedVectors and converted it to binary format with the vocab.txt files, using this command:

word_vectors.save_word2vec_format('vectors.50.f.bin', fvocab='vocab.txt', binary=True)

Then I loaded it into spaCy:

nlp = spacy.load('pt')
nlp.vocab.vectors.from_glove('/path/to/vectors')

The error happens if I try to read has_vector or vector properties.

Informations

spaCy version: 2.0.5
Platform: Ubuntu 16.04
Python version: 3.6.4
Model: pt
GloVe model: GloVe 50 dimensions

ZackKorman · 2018-01-18T14:11:43Z

I think the problem is in self.data.shape[0] * self.data.shape[1], as the GloVe array is shape (some_num,). self.data.shape[1] therefore returns the index out of range error. I don't have a fix for this, though.

imranarshad · 2018-01-18T14:34:20Z

having the same issue
@honnibal any workaround? until you get it fixed.

honnibal · 2018-01-22T18:15:23Z

Thanks for the report, especially @Lankey22 for the suggestion.

Perhaps we need this in from_glove()?

if self.data.ndim == 1:
        self.data = self.data.reshape((self.data.size//width, width))

If so the following mitigation should work for now until the next version:

nlp = spacy.load('pt')
nlp.vocab.vectors.from_glove('/path/to/vectors')
if nlp.vocab.vectors.data.ndim == 1:
    nlp.vocab.vectors.data = nlp.vocab.vectors.data.reshape((nlp.vocab.vectors.data.size//width, width))

You'll need to know the width of the vectors you're loading.

fako · 2018-02-05T13:12:57Z

I also came across this issue and I'm using the same workaround. I find it weird that from_glove is using numpy.fromfile. The documentation states that using tofile and fromfile is not suitable for data storage: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.fromfile.html

If you'd use np.load then it would load a 2D array if it was stored as such. np.fromfile always loads a a 1D array. Not 100% sure how GloVe's binary format is stored, but I would expect as a 2D array. I'm loading word2vec embeddings myself and I saved the conversion in a 2D array.

Another thing that strikes me is that in the documentation it is stated that the dtype in the file format should either be 'f' or 'd'. That means that any file read in this manner will get flattened by np.ascontiguousarray, because neither equal the string 'float32'. After flattening it would get reshaped again to a 2D array. Relevant line is here:

spaCy/spacy/vectors.pyx

Line 311 in 2e7391e

if dtype != 'float32':

I might have made some wrong assumptions, but it seems to me that this code is not running as efficient as it could. Would be great to hear why certain choices were made. I love working with SpaCy and hope it becomes even better in the future :)

lock · 2018-05-08T00:55:09Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the bug Bugs and behaviour differing from documentation label Jan 12, 2018

honnibal closed this as completed in 29897ed Jan 22, 2018

fako mentioned this issue Apr 13, 2018

Loading vectors with from_glove seems inefficient #2216

Closed

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom glove vectors throw tuple index out of range error #1831

Custom glove vectors throw tuple index out of range error #1831

samrensenhouse commented Jan 12, 2018

fabiocapsouza commented Jan 17, 2018 •

edited

Loading

ZackKorman commented Jan 18, 2018 •

edited

Loading

imranarshad commented Jan 18, 2018

honnibal commented Jan 22, 2018

fako commented Feb 5, 2018

lock bot commented May 8, 2018

Custom glove vectors throw tuple index out of range error #1831

Custom glove vectors throw tuple index out of range error #1831

Comments

samrensenhouse commented Jan 12, 2018

Info about spaCy

fabiocapsouza commented Jan 17, 2018 • edited Loading

Informations

ZackKorman commented Jan 18, 2018 • edited Loading

imranarshad commented Jan 18, 2018

honnibal commented Jan 22, 2018

fako commented Feb 5, 2018

lock bot commented May 8, 2018

fabiocapsouza commented Jan 17, 2018 •

edited

Loading

ZackKorman commented Jan 18, 2018 •

edited

Loading