-
Notifications
You must be signed in to change notification settings - Fork 99
Home
Unfortunately, we cannot provide the corpora due to the copyrights. The PubMed abstracts can be downloaded from https://www.ncbi.nlm.nih.gov/pubmed. The MIMIC-III Clinical Database can be downloaded from https://physionet.org/works/MIMICIIIClinicalDatabase/access.shtml.
The BioWordVec is in the binary word2vec C format. One way to read the model is using gensim
. The following example is copied from their website.
To use BioWordVec vector:
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(filename, binary=True)
To use BioWordVec model:
Based on a recent test on the speed, we recommend using fasttext library to load BioWordVec model:
import fasttext
model = fasttext.load_model(filename)
Alternatively, you could use gensim:
from gensim.models import FastText
model = FastText.load_fasttext_format(filename)
The BioSentVec is built upon sent2vec. To infer sentence embeddings, please see the Directly from python
section. The following example is copied from their website,
import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model('model.bin')
emb = model.embed_sentence("once upon a time .")
embs = model.embed_sentences(["first sentence .", "another sentence"])
The preprocessing methods can be found in the src
folder. In general, the text was first tokenized using NLTK and then lowercased.
The bash scripts can be found in the src
folder.
@article{chen2018biosentvec,
title={BioSentVec: creating sentence embeddings for biomedical texts},
author={Chen, Qingyu and Peng, Yifan and Lu, Zhiyong},
journal={arXiv preprint arXiv:181302},
year={2018}
}