README

This Repository contains source code for the paper: Leveraging Wikidata for Biomedical Entity Linking in a Low-Resource Setting: A Case Study for German

Resources:

Finetuned SapBERT for German: https://huggingface.co/permediq/SapBERT-DE

German Biomedical Entity Linking Knowledge Base (UMLS-Wikidata): https://zenodo.org/records/11003203

Scispacy

Installation:

Update pip

python -m pip install --upgrade pip

Install the scispacy using setup.py

pip install -e .

Folder structure:

The data folder has following structure. The folder labeled as (not in repo) can be generated as described in the scispacy.iypnb notebook.

│   de_1k_test_query.txt # mention dataset
│   de_wikimed_bel_dev_query.txt # mention dataset
│   de_wikimed_bel_test_query.txt # mention dataset
│   de_wikimed_bel_train_query.txt # mention dataset
│   qids_with_cui_kb.txt # UMLS_Wikidata KB
│  umls_onto_all_lang_cased_wikimed_only_399931.txt # UMLS_SapBERT KB
│
├───processed (not in repo)
│   └───kbs
│           kb_from_sapbert.jsonl
│           kb_from_wikidata_sparql.jsonl
│
└───raw 
    ├───BEL-silver-standard
    │   │   .DS_Store
    │   │
    │   └───WikiMed-DE-BEL
    │           .DS_Store
    │           dev_data_bel.json
    │           test_data_bel.json
    │           train_data_bel.json
    │           WikiMed-DE-BEL.json
    │
    └───kbs (not in repo)
            qids_with_cui.csv
            qids_with_cui_output.jsonl

The artifacts contains ANN indexes and vectorizers for KBs.

├───sapbert
│       concept_aliases.json
│       nmslib_index.bin
│       tfidf_vectorizer.joblib
│       tfidf_vectors_sparse.npz
│
└───sparql
        concept_aliases.json
        nmslib_index.bin
        tfidf_vectorizer.joblib
        tfidf_vectors_sparse.npz

Reproduce Results

Follow the steps in the scispacy.iypnb to reproduce results.

Embedding based models:

SapBERT-DE: a finetuned version of multilingual BEL model cambridgeltl/SapBERT-UMLS-2020AB-all-lang-from-XLMR. The model is finetuned using German BEL KB UMLS-Wikidata as described in the SapBERT Repo.
jinaai/jina-embeddings-v2-base-de: a German/English bilingual text embedding model
BAAI/bge-m3 : Multilingual model which can generate dense, sparse, and colbert style embeddings. We only use dense embeddings.

Installation:

Installed in the same virtual environment.

pip install torch --index-url https://download.pytorch.org/whl/cu118

You may need to select cuda version for torch according to your GPU.

pip install -r requirements_encoders.txt

For Faiss ANN index.

pip install faiss-gpu

Reproduce Results

Preferable: Follow the steps in the embedding_encoders_ann.iypnb for using the FAISS ANN index.

Alternative: Follow the steps in the embedding_encoders.iypnb to reproduce results.

Reference Dependencies

The exact dependencies used are mentioned in reference_requirements.txt for reference.

Python 3.8.0

References:

The Repository is based on ScispaCy Repository.

BibTeX

@inproceedings{mustafa-etal-2024-leveraging,
    title = "Leveraging {W}ikidata for Biomedical Entity Linking in a Low-Resource Setting: A Case Study for {G}erman",
    author = "Mustafa, Faizan E  and
      Dima, Corina  and
      Ochoa, Juan  and
      Staab, Steffen",
    booktitle = "Proceedings of the 6th Clinical Natural Language Processing Workshop",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.clinicalnlp-1.17",
    pages = "202--207",

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
evaluation		evaluation
scispacy		scispacy
scripts		scripts
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
embedding_encoders.ipynb		embedding_encoders.ipynb
embedding_encoders_ann.ipynb		embedding_encoders_ann.ipynb
reference_requirements.txt		reference_requirements.txt
requirements.in		requirements.in
requirements_encoders.txt		requirements_encoders.txt
scispacy.ipynb		scispacy.ipynb
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

Resources:

Scispacy

Installation:

Folder structure:

Reproduce Results

Embedding based models:

Installation:

Reproduce Results

Reference Dependencies

References:

BibTeX

About

Releases

Packages

Languages

License

PERMEDIQ/German-Bio-Entity-Linking

Folders and files

Latest commit

History

Repository files navigation

README

Resources:

Scispacy

Installation:

Folder structure:

Reproduce Results

Embedding based models:

Installation:

Reproduce Results

Reference Dependencies

References:

BibTeX

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages