-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regarding the bio_embedding_intrinsic download file #1
Comments
@coolabhishek Wiki page explains how to load the BioWordVec: https://github.com/ncbi-nlp/BioSentVec/wiki |
Hi, The paper says that all words are converted to a lower case. So if I use the model file to get word vector for a word that contains capital letters (e.g Adrenaline), how will the word embedding be computed for such words?, since there will be no n-grams with Capital letters. Could you please help me here? Thanks, |
Here's my understanding of how the embedding vector would be computed for Adrenaline. BioWordVec - improving biomedical word embeddings with subword information and MeSH by Zhang et al(2019) mentions that all words were lower cased:
Here's it mentions how subword embedding model is used to compute word embedding:
Now let's first look into FastText's source code:
This calls
In this function,
https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc#L91 This checks whether word is present in vocabulary or not. https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc#L172
This function computes character ngrams which are hashed using the Fowler-Noll-Vo hashing function as mentioned in Enriching Word Vectors with Subword Information by Bojanowski et al(2017)
Now let's look into the gensim's FastText source code:Loading the model Extracting the word vector You can also refer to Polm23's answer in https://stackoverflow.com/questions/50828314/how-does-the-gensim-fasttext-pre-trained-model-get-vectors-for-out-of-vocabulary Importance of upper case in corpus text:Though in BioWordVec all words were lower cased, but there's a discussion thread in fastText where its mentioned why we might want to keep upper-case characters too in the corpus. |
Thanks for your reply. It was really informative. I still have two questions:
Thanks, |
Regarding your 1st questionExecute the following script in
I am getting the output as Regarding your 2nd question:From the FastText paper:
Along with the n-grams of subwords, it will create n-gram of the word itself provided its within the max size i.e. 6. Thanks for correcting me. In the source code, have a look at
For i = 0, this part of the code will create the n-gram for the entire word :: In case you are wondering what does |
Thanks @kaushikacharya, The first question is clear, the substr Adr would hash into some bucket and n-grams in that bucket will give the embeddings for the substring. For the second question, I understood that n-gram for entire word "ghrl" would be created, but my question is when I ask FastText model for embedding of "ghrl", will it just return the vector learned for entire word "ghrl" or all its constituent n-grams (<gh, ghr etc) will be added along with ghrl to get its embedding? |
@adijad20 |
I see the following error while loading the model
UnpicklingError: unpickling stack underflow
Looks like this can be because of the old scipy format of the saved file. Is there a way to get the txt file format?
Thanks
Abhishek
The text was updated successfully, but these errors were encountered: