Regarding the bio_embedding_intrinsic download file #1

coolabhishek · 2020-05-06T08:30:50Z

I see the following error while loading the model
UnpicklingError: unpickling stack underflow

Looks like this can be because of the old scipy format of the saved file. Is there a way to get the txt file format?

Thanks
Abhishek

kaushikacharya · 2020-05-12T18:54:50Z

@coolabhishek
You can use BioWordVec from BioSentVec which the authors claim as extension of the work in the current repository.

Wiki page explains how to load the BioWordVec: https://github.com/ncbi-nlp/BioSentVec/wiki

adijad20 · 2020-05-22T15:07:14Z

Hi,

The paper says that all words are converted to a lower case. So if I use the model file to get word vector for a word that contains capital letters (e.g Adrenaline), how will the word embedding be computed for such words?, since there will be no n-grams with Capital letters. Could you please help me here?

Thanks,
Aditya

kaushikacharya · 2020-05-23T14:40:15Z

@adijad20

how will the word embedding be computed for such words?, since there will be no n-grams with Capital letters. Could you please help me here?

Here's my understanding of how the embedding vector would be computed for Adrenaline.

BioWordVec - improving biomedical word embeddings with subword information and MeSH by Zhang et al(2019) mentions that all words were lower cased:

Implementation details. In our experiments, we downloaded the PubMed XML source files from https://
www.nlm.nih.gov/databases/download/pubmed_medline.html. Our PubMed data contains 27,599,238 articles including the titles and abstracts. We extracted the title and abstract texts from the PubMed XML files to construct the PubMed text data. All words were converted to lowercase. The final PubMed text data contain 3,658,450,658 tokens

Here's it mentions how subword embedding model is used to compute word embedding:

Subword embedding model. Bojanowski et al. proposed fastText: a subword embedding model based on the skip-gram model that learns the character n-grams distributed embeddings using unlabeled corpora where each word is represented as the sum of the vector representations of its n-grams. Compared to the word2vec model1, the subword embedding model can make effective use of the subword information and internal word structure to improve the embedding quality.

Now let's first look into FastText's source code:

https://github.com/facebookresearch/fastText/blob/master/python/fasttext_module/fasttext/FastText.py#L120

def get_word_vector(self, word):
      ...
      self.f.getWordVector(b, word)

This calls
https://github.com/facebookresearch/fastText/blob/master/src/fasttext.cc#L111

void FastText::getWordVector(Vector& vec, const std::string& word) const {
  const std::vector<int32_t>& ngrams = dict_->getSubwords(word);

In this function,

subwords are extracted
word vector is computed as sum of the subword embeddings

https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc#L91
const std::vector<int32_t> Dictionary::getSubwords( const std::string& word)

This checks whether word is present in vocabulary or not.
If not present i.e. out-of-vocabulary(OOV) word then it calls

https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc#L172

void Dictionary::computeSubwords(
    const std::string& word,
    std::vector<int32_t>& ngrams,
    std::vector<std::string>* substrings)

This function computes character ngrams which are hashed using the Fowler-Noll-Vo hashing function as mentioned in Enriching Word Vectors with Subword Information by Bojanowski et al(2017)

In order to bound the memory requirements of our model, we use a hashing function that maps n-grams to integers in 1 to K. We hash character sequences using the Fowler-Noll-Vo hashing function (specifically the FNV-1a variant).

Now let's look into the gensim's FastText source code:

Loading the model
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/fasttext.py#L472
self.wv = FastTextKeyedVectors(size, min_n, max_n, bucket, compatible_hash)

Extracting the word vector
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L2103
def word_vec(self, word, use_norm=False):
In this function you would see how the hashing is used over the n-grams for OOV word:
ngram_hashes = ft_ngram_hashes(word, self.min_n, self.max_n, self.bucket, self.compatible_hash)

You can also refer to Polm23's answer in https://stackoverflow.com/questions/50828314/how-does-the-gensim-fasttext-pre-trained-model-get-vectors-for-out-of-vocabulary

Importance of upper case in corpus text:

Though in BioWordVec all words were lower cased, but there's a discussion thread in fastText where its mentioned why we might want to keep upper-case characters too in the corpus.

adijad20 · 2020-05-23T17:45:46Z

@kaushikacharya,

Thanks for your reply. It was really informative.

I still have two questions:

Do you mean to say that because hashing is used, the n-grams containing capital letters (e.g. Adr) which were not present in training, will hash to some bucket and the this gives the embedding for that n-gram?
Let's say I have a gene symbol "ghrl" which is present in the vocabulary. FastText paper says that the n-grams for this word will also include the entire word . So while calculating the embedding for "ghrl", does it sum up embeddings of all its constituent n-grams or just returns the embeddings for n-gram ?

Thanks,
Aditya

kaushikacharya · 2020-05-24T14:26:14Z

@adijad20

Regarding your 1st question

Execute the following script in
https://www.onlinegdb.com/online_c++_compiler
to see yourself how the hashing is done for the n-gram Adr

#include <iostream>

using namespace std;

// hash function copied from https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc#L163
uint32_t hash_func(const std::string& str) {
  uint32_t h = 2166136261;
  for (size_t i = 0; i < str.size(); i++) {
    h = h ^ uint32_t(int8_t(str[i]));
    h = h * 16777619;
  }
  return h;
}

int main()
{
    // bucket value taken from https://github.com/facebookresearch/fastText/blob/master/src/args.cc
    int bucket = 2000000;
    int32_t h = hash_func("Adr") % bucket;
    cout << "Hash of sub-string: " << h;

    return 0;
}

I am getting the output as
Hash of sub-string: 848830

Regarding your 2nd question:

From the FastText paper:

Each word w is represented as a bag of character n-gram. We add special boundary symbols < and > at the beginning and end of words, allowing to distinguish prefixes and suffixes from other character sequences. We also include the word w itself in the set of its n-grams, to learn a representation for each word (in addition to character n-grams). Taking the
word where and n = 3 as an example, it will be represented by the character n-grams:
<wh, whe, her, ere, re>
and the special sequence
<where>.
Note that the sequence , corresponding to the word her is different from the tri-gram her from the word where. In practice, we extract all the n-grams for n greater or equal to 3 and smaller or equal to 6.

Ultimately, a word is represented by its index in the word dictionary and the set of hashed n-grams it contains.

Along with the n-grams of subwords, it will create n-gram of the word itself provided its within the max size i.e. 6. Thanks for correcting me.

In the source code, have a look at
https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc#L172
void Dictionary::computeSubwords(

for (size_t j = i, n = 1; j < word.size() && n <= args_->maxn; n++) {
      ngram.push_back(word[j++]);

For i = 0, this part of the code will create the n-gram for the entire word ::
<ghrl>

In case you are wondering what does
(word[j] & 0xC0) == 0x80)
do, then read paxdiablo's answer in StackOverflow where the user explains that it is to identify multi-byte sequence in unicode.

adijad20 · 2020-05-25T19:14:15Z

Thanks @kaushikacharya,

The first question is clear, the substr Adr would hash into some bucket and n-grams in that bucket will give the embeddings for the substring.

For the second question, I understood that n-gram for entire word "ghrl" would be created, but my question is when I ask FastText model for embedding of "ghrl", will it just return the vector learned for entire word "ghrl" or all its constituent n-grams (<gh, ghr etc) will be added along with ghrl to get its embedding?
For out-of-vocabulary words, I know that it will sum up embeddings of all the constituent n-grams of the word. But for in-vocabulary words, I wanted to know how it behaves?

kaushikacharya · 2020-05-26T04:11:15Z

For the second question, I understood that n-gram for entire word "ghrl" would be created, but my question is when I ask FastText model for embedding of "ghrl", will it just return the vector learned for entire word "ghrl" or all its constituent n-grams (<gh, ghr etc) will be added along with ghrl to get its embedding?

@adijad20
My understanding is that for every word(irrespective of whether its in vocabulary or not), FastText computes its vector embedding by adding all the constituent n-grams of the word.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding the bio_embedding_intrinsic download file #1

Regarding the bio_embedding_intrinsic download file #1

coolabhishek commented May 6, 2020

kaushikacharya commented May 12, 2020 •

edited

Loading

adijad20 commented May 22, 2020

kaushikacharya commented May 23, 2020 •

edited

Loading

adijad20 commented May 23, 2020

kaushikacharya commented May 24, 2020 •

edited

Loading

adijad20 commented May 25, 2020 •

edited

Loading

kaushikacharya commented May 26, 2020

Regarding the bio_embedding_intrinsic download file #1

Regarding the bio_embedding_intrinsic download file #1

Comments

coolabhishek commented May 6, 2020

kaushikacharya commented May 12, 2020 • edited Loading

adijad20 commented May 22, 2020

kaushikacharya commented May 23, 2020 • edited Loading

Now let's first look into FastText's source code:

Now let's look into the gensim's FastText source code:

Importance of upper case in corpus text:

adijad20 commented May 23, 2020

kaushikacharya commented May 24, 2020 • edited Loading

Regarding your 1st question

Regarding your 2nd question:

adijad20 commented May 25, 2020 • edited Loading

kaushikacharya commented May 26, 2020

kaushikacharya commented May 12, 2020 •

edited

Loading

kaushikacharya commented May 23, 2020 •

edited

Loading

kaushikacharya commented May 24, 2020 •

edited

Loading

adijad20 commented May 25, 2020 •

edited

Loading