-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bucket bug for FastText.load_fasttext_format #1779
Comments
Thanks for report @saroufimc1, we'll try to reproduce and look into with @manneshiva later (file is downloaded for a very long time) |
Hi @saroufimc1 thanks for the detailed report and ideas. Re: computing ngrams, no the FastText .bin file doesn't include the ngrams themselves. But you are right, the loading time is quite high and it might be possible to get away without computing the ngrams from the vocab ourselves. Will post about it in more detail in #1261
Thanks! |
Edit: I'm mistaken, this indeed is a bug. |
Discussed this with @jayantj and narrowed the bug down to an error while trimming |
Thanks @manneshiva ! In other words, num_words_in_vocab correspond to the embeddings of the vocabulary words. |
@saroufimc1 The |
Reproduced with 3.6.0:
|
Hi,
I tried loading a pretrained model from Facebook's fasttext into gemsim using FastText.load_fasttext_format, and it looks like there is a bug to be fixed for bucket here as well.
I noticed that for bucket = 2,000,000 we had model.wv.syn0_ngrams.shape[0] = 7,221,731 instead of 2,000,000.
Also, in https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/wrappers/fasttext.py
I don't see why we do have to compute the n-grams from the word vocabulary again. Aren't these already imported from the fastText .bin file?
Steps/Code/Corpus to Reproduce
First download the zip file from https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip.
Then unzip and put the file 'wiki.en.bin' in your working directory.
Expected Results
Expected value of 2,000,000 which is the default value of bucket.
Actual Results
7,221,731 which is here equal to len(model.wv.ngrams).
In other words, it looks like there were no collusions although we had more n-grams than buckets.
Also, please note that it took 10 minutes to load the fasttext model. I wonder if some parts of the code (especially in load_vectors) are really needed.
Thanks,
Carl
Versions
The text was updated successfully, but these errors were encountered: