Skip to content

Latest commit

 

History

History
242 lines (182 loc) · 14.5 KB

README.md

File metadata and controls

242 lines (182 loc) · 14.5 KB

BELB: Biomedical Entity Linking Benchmark

drawing

the Biomedical Entity Linking Benchmark (BELB) is a collection of datasets and knowledge bases to train and evaluate biomedical entity linking models.

Citing

If you use BELB in your work, please cite:

@article{10.1093/bioinformatics/btad698,
    author = {Garda, Samuele and Weber-Genzel, Leon and Martin, Robert and Leser, Ulf},
    title = {{BELB}: a {B}iomedical {E}ntity {L}inking {B}enchmark},
    journal = {Bioinformatics},
    pages = {btad698},
    year = {2023},
    month = {11},
    issn = {1367-4811},
    doi = {10.1093/bioinformatics/btad698},
    url = {https://doi.org/10.1093/bioinformatics/btad698},
    eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btad698/53483107/btad698.pdf},
}

Data

Knowledge Bases

Corpus Entity Public Versioned Website Download
NCBI Gene Gene homepage kb, history
NCBI Taxonomy Species homepage kb, history
CTD Diseases (MEDIC) Disease homepage kb
CTD Chemicals Chemical homepage kb
dbSNP Variant homepage kb,history
Cellosaurus Cell line homepage kb, history
UMLS General homepage -

Corpora

Corpus Entity Public Website Download
GNormPLus (improved BC2) Gene homepage link
NLM-Gene Gene homepage link
NCBI-Disease Disease homepage link
BC5CDR Disease, Chemical homepage link
NLM-Chem Chemical homepage link
Linnaeus Species homepage link
S800 Species homepage link
BioID Cell, Species, Gene homepage link
Osiris Gene, Variant homepage link
Thomas2011 Variant homepage link
tmVar (v3) Gene, Species, Variant homepage link
MedMentions UMLS homepage link

Setup

We assume that all data will be stored in a single directory.

This reduces flexibility, but due to the inter-connection of all data (corpora and KB) this is a trade-off to ease accessibility.

PubTator database

Download PubTator raw data (compressed:~19GB) and PMCID->PMID mapping (compressed: ~155MB). This is needed to add annotations to certain corpora and add the text to those which provide only annotations.

mkdir -p <PUBTATOR>
cd <PUBTATOR>
wget https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/bioconcepts2pubtatorcentral.offset.gz 
wget https://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz
python -m scripts.build_pubtator 
       --pubtator <PUBTATOR>/bioconcepts2pubtatorcentral.offset.gz 
       --pmicid_pmid <PUBTATOR>/PMC-ids.csv.gz
       --output pubtator.db 
       --overwrite

Knowledge Bases

All knowledge bases will be automatically downloaded for you, with two exceptions: dbSNP and UMLS.

dbSNP

As dbSNP is a large resource (>100GB) it is best to launch a separate process to fetch it.

Essentially it boils down to:

mkdir -p <DBSNP> 
cd <DBSNP>

echo "Fetch dbSNP latest release..."
wget --continue "ftp://ftp.ncbi.nlm.nih.gov/snp/redesign/latest_release/JSON/refsnp-chr*.bz2"
wget --continue "ftp://ftp.ncbi.nlm.nih.gov/snp/redesign/latest_release/JSON/refsnp-unsupported.json.bz2"
wget --continue "ftp://ftp.ncbi.nlm.nih.gov/snp/redesign/latest_release/JSON/refsnp-withdrawn.json.bz2"

echo "Identify corrupted files: please delete and re-initiate download for all corrupted files..."
find . -name *.bz2 -exec bunzip2 --test {} \;

See here for more details.

UMLS

See here for more details on how to request a license.

You need to download the 2017AA full version as this is the one used by the corpus MedMentions.

In principle the parser should work with later versions too, it expects as input a folder (usually cold META) where containing the files MRCONSO.RFF and MRCUI.RFF.

The 2017AA is the last one that does not provide direct access to the UMLS raw data ("Metathesaurus Files"). To access the data w/o setting up a mysql database you can to the following:

unzip umls-2017AA-full.zip
cd 2017AA-full
# poorly disguised zip files...
unzip 2017aa-1-meta.nlm 2017aa-2-meta.nlm
cd 2017AA/META
gunzip MRCONSO.RRF.aa.gz MRCONSO.RRF.ab.gz MRCUI.RRF.gz
cat MRCONSO.RRF.aa MRCONSO.RRF.ab > MRCONSO.RFF

Once you have downloaded these two resources you can launch the script:

python -m belb.scripts.build_kbs --dir <BELB> --cores 20 --umls <path/to/umls/META> --dbsnp <path/to/dbsnp>

This will fetch all the other kbs data and convert them to a unified schema and store them as TSV files.

Each kb can be processed individually with its corresponding module, e.g.:

python -m belb.kbs.umls
       --dir  /belb/directory
       --data_dir /path/to/umls/data
       --db ./db.yaml

By default all kbs are stored as sqlite databases. The db.yaml can be edited to your liking if you wish to store the data into a database. This feature is only paritally tested and it supports only postgres.

Corpora

Once all kbs are ready you can create all benchmark corpora via:

python -m belb.scripts.build_corpora --dir <BELB> --pubtator <BELB>/pubtator/pubtator.db 

Similarly to kbs, you can also create a single corpus:

python -m belb.corpora.ncbi_disease --dir  /belb/directory --sentences

This will fetch the ncbi disease corpus, preprocess it, split text into sentences (--sentences) and store it into the belb directory.

API

Every resource (corpus, kb) is represented by a module which acts as a standalone script as well. This means you can programmatically access a resource:

from belb.kbs.kb import BelbKb
from belb.kbs.ncbi_gene import NcbiGeneKbConfig
from belb.corpora.nlm_gene import NlmGeneCorpusParser

For ease of access we provide a two classes to instantiate corpora and kbs respectively simply by providing an identifying name (a poor reproduction of what you see in the Auto* classes in the transformers library).

from belb import AutoBelbCorpus, AutoBelbKb
from belb.resources import Corpora, Kbs

corpus = AutoBelbCorpus.from_name(directory="path_to_belb", name=Corpora.NCBI_DISEASE.name)
kb = AutoBelbKb.from_name(directory="path_to_belb", name=Kbs.CTD_DISEASES.name)

Roadmap

Datasets:

Knowledge Bases:

Snapshot

Create snapshot regularly for ease of reproducibility. This would require contacting resources providers and verify that it is doable, i.e. redistribution issues may arise.