Skip to content

PERMEDIQ/German-Bio-Entity-Linking

Repository files navigation

README

This Repository contains source code for the paper: Leveraging Wikidata for Biomedical Entity Linking in a Low-Resource Setting: A Case Study for German

Resources:

Finetuned SapBERT for German: https://huggingface.co/permediq/SapBERT-DE

German Biomedical Entity Linking Knowledge Base (UMLS-Wikidata): https://zenodo.org/records/11003203

Scispacy

Installation:

  • Update pip
python -m pip install --upgrade pip
  • Install the scispacy using setup.py
pip install -e .

Folder structure:

The data folder has following structure. The folder labeled as (not in repo) can be generated as described in the scispacy.iypnb notebook.

│   de_1k_test_query.txt # mention dataset
│   de_wikimed_bel_dev_query.txt # mention dataset
│   de_wikimed_bel_test_query.txt # mention dataset
│   de_wikimed_bel_train_query.txt # mention dataset
│   qids_with_cui_kb.txt # UMLS_Wikidata KB
│  umls_onto_all_lang_cased_wikimed_only_399931.txt # UMLS_SapBERT KB
│
├───processed (not in repo)
│   └───kbs
│           kb_from_sapbert.jsonl
│           kb_from_wikidata_sparql.jsonl
│
└───raw 
    ├───BEL-silver-standard
    │   │   .DS_Store
    │   │
    │   └───WikiMed-DE-BEL
    │           .DS_Store
    │           dev_data_bel.json
    │           test_data_bel.json
    │           train_data_bel.json
    │           WikiMed-DE-BEL.json
    │
    └───kbs (not in repo)
            qids_with_cui.csv
            qids_with_cui_output.jsonl

The artifacts contains ANN indexes and vectorizers for KBs.

├───sapbert
│       concept_aliases.json
│       nmslib_index.bin
│       tfidf_vectorizer.joblib
│       tfidf_vectors_sparse.npz
│
└───sparql
        concept_aliases.json
        nmslib_index.bin
        tfidf_vectorizer.joblib
        tfidf_vectors_sparse.npz

Reproduce Results

Follow the steps in the scispacy.iypnb to reproduce results.

Embedding based models:

Installation:

Installed in the same virtual environment.

pip install torch --index-url https://download.pytorch.org/whl/cu118

You may need to select cuda version for torch according to your GPU.

pip install -r requirements_encoders.txt

For Faiss ANN index.

pip install faiss-gpu

Reproduce Results

Preferable: Follow the steps in the embedding_encoders_ann.iypnb for using the FAISS ANN index.

Alternative: Follow the steps in the embedding_encoders.iypnb to reproduce results.

Reference Dependencies

The exact dependencies used are mentioned in reference_requirements.txt for reference.

Python 3.8.0

References:

The Repository is based on ScispaCy Repository.

BibTeX

@inproceedings{mustafa-etal-2024-leveraging,
    title = "Leveraging {W}ikidata for Biomedical Entity Linking in a Low-Resource Setting: A Case Study for {G}erman",
    author = "Mustafa, Faizan E  and
      Dima, Corina  and
      Ochoa, Juan  and
      Staab, Steffen",
    booktitle = "Proceedings of the 6th Clinical Natural Language Processing Workshop",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.clinicalnlp-1.17",
    pages = "202--207", 

About

Biomedical Entity Linking for German

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published