This Repository contains source code for the paper: Leveraging Wikidata for Biomedical Entity Linking in a Low-Resource Setting: A Case Study for German
Finetuned SapBERT for German: https://huggingface.co/permediq/SapBERT-DE
German Biomedical Entity Linking Knowledge Base (UMLS-Wikidata): https://zenodo.org/records/11003203
- Update pip
python -m pip install --upgrade pip
- Install the scispacy using
setup.py
pip install -e .
The data
folder has following structure. The folder labeled as (not in repo)
can be generated as described in the scispacy.iypnb
notebook.
│ de_1k_test_query.txt # mention dataset
│ de_wikimed_bel_dev_query.txt # mention dataset
│ de_wikimed_bel_test_query.txt # mention dataset
│ de_wikimed_bel_train_query.txt # mention dataset
│ qids_with_cui_kb.txt # UMLS_Wikidata KB
│ umls_onto_all_lang_cased_wikimed_only_399931.txt # UMLS_SapBERT KB
│
├───processed (not in repo)
│ └───kbs
│ kb_from_sapbert.jsonl
│ kb_from_wikidata_sparql.jsonl
│
└───raw
├───BEL-silver-standard
│ │ .DS_Store
│ │
│ └───WikiMed-DE-BEL
│ .DS_Store
│ dev_data_bel.json
│ test_data_bel.json
│ train_data_bel.json
│ WikiMed-DE-BEL.json
│
└───kbs (not in repo)
qids_with_cui.csv
qids_with_cui_output.jsonl
The artifacts
contains ANN indexes and vectorizers for KBs.
├───sapbert
│ concept_aliases.json
│ nmslib_index.bin
│ tfidf_vectorizer.joblib
│ tfidf_vectors_sparse.npz
│
└───sparql
concept_aliases.json
nmslib_index.bin
tfidf_vectorizer.joblib
tfidf_vectors_sparse.npz
Follow the steps in the scispacy.iypnb
to reproduce results.
-
SapBERT-DE: a finetuned version of multilingual BEL model cambridgeltl/SapBERT-UMLS-2020AB-all-lang-from-XLMR. The model is finetuned using German BEL KB UMLS-Wikidata as described in the SapBERT Repo.
-
jinaai/jina-embeddings-v2-base-de: a German/English bilingual text embedding model
-
BAAI/bge-m3 : Multilingual model which can generate dense, sparse, and colbert style embeddings. We only use dense embeddings.
Installed in the same virtual environment.
pip install torch --index-url https://download.pytorch.org/whl/cu118
You may need to select cuda version for torch
according to your GPU.
pip install -r requirements_encoders.txt
For Faiss ANN index.
pip install faiss-gpu
Preferable: Follow the steps in the embedding_encoders_ann.iypnb
for using the FAISS ANN index.
Alternative: Follow the steps in the embedding_encoders.iypnb
to reproduce results.
The exact dependencies used are mentioned in reference_requirements.txt
for reference.
Python 3.8.0
The Repository is based on ScispaCy Repository.
@inproceedings{mustafa-etal-2024-leveraging,
title = "Leveraging {W}ikidata for Biomedical Entity Linking in a Low-Resource Setting: A Case Study for {G}erman",
author = "Mustafa, Faizan E and
Dima, Corina and
Ochoa, Juan and
Staab, Steffen",
booktitle = "Proceedings of the 6th Clinical Natural Language Processing Workshop",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.clinicalnlp-1.17",
pages = "202--207",