Skip to content

DrNote is an open tagging tool for text annotation and entity linking based on OpenTapioca and WIkiData/Wikipedia. It provides an entity linking service with pre-trained data for medical annotations in multilingual settings. The processing of raw text as well as PDF by a tesseract backend is supported.

License

Notifications You must be signed in to change notification settings

frankkramer-lab/DrNote

Repository files navigation

DrNote

Accepted at PLOS DH:
https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000086 (or see citation)

The DrNote annotation tool features a simple yet effective annotation tool for various purposes.

The annotation method is based on the Opentapioca (GitHub) codebase to provide a named entity linking functionality on unstructured text data.

The project leverages the data from Wikidata and Wikipedia without the requirement of any commercial components.

The annotation service provides a web-based UI as well as an API-based access.

The processing of PDF files is supported. Linked entities can be injected as hyperlinks into the uploaded PDF file.

Different languages (de, en, es etc.) are supported.

Update on Results:
A bug in the evaluation pipeline was found, leading to degraded results in the obtained scores. See the updated scores in the Errata section.

Demo:
Our demo instance is available at:
https://drnote.misit-augsburg.de
Note: Upload of large PDF files is not supported. Uploaded data is discarded after processing.

Graphical Demo:
Annotation Demo

CLI Demo:

# Enter text
text="Die Diagnosen sind Hypothyreose bei Autoimmunthyreoiditis, Diabetes mellitus mit diabetische Nephropathie und akutes Nierenversagen."
# Annotate
curl -k https://drnote.misit-augsburg.de/annotate \
  -F "inputType=plaintext" \
  -F "outputType=html" \
  -F \
"filterOptions={
  \"pipeline\": \"de_core_news_sm\",
  \"rules\": [
    \"any pos[NOUN,PROPN] require\",
    \"all non_stopwords require\"
  ]
}" \
  -F \
"plaintext=$text"

Errata

Detected issues:

  • For the GSC EMEA/Medline datasets, the labels were not correctly filtered for the CHEM label class in all instances.
  • Due to a too strict regular expression, detected Chemical entries for PubTator were only considered if a MeSH code was given.
  • For GSC EMEA/Medline datasets, in the cTAKES outputs, the UMLS tags were wrongfully used over the MedicationMentions tags.
  • The character spans of cTAKES may yield broken values due to unsupported umlaut characters. The broken character spans are now fixed using a workaround.

The evaluation was re-run with a revised evaluation pipeline. However, due to constant changes in the WikiData, the results may vary. For instance, due to substantial changes in the WikiData graph structure, the SPARQL query for finding medication entities was changed from the previous query

(old SPARQL query)
SELECT DISTINCT ?entity WHERE
{
    {?entity wdt:P279+ wd:Q12140 .}
    UNION
    {?entity wdt:P31+ wd:Q12140 .}
}
to an ATC code-based query
(new SPARQL query)
SELECT DISTINCT ?entity WHERE
{
    {?entity wdt:P279+ wd:Q12140 .}
    UNION
    {?entity wdt:P31+ wd:Q12140 .}
    UNION
    {?entity wdt:P267 ?atccode .}
}

For comparisons, the (cached) original outputs from PubTator, cTAKES, and the original pre-trained DrNote model & index store was used. Also, the cached set of UMLS entities was used. The updated results are (as of 31.07.2024) as follows.

Dataset Method Precision Recall F1 score
GERNERMED cTAKES 0.858 0.512 0.641
GERNERMED PubTator 0.760 0.481 0.590
GERNERMED DrNote 0.935 0.624 0.749
Medline GSC cTAKES 0.806 0.307 0.444
Medline GSC PubTator 0.449 0.420 0.434
Medline GSC DrNote 0.693 0.139 0.232
EMEA GSC cTAKES 0.834 0.357 0.500
EMEA GSC PubTator 0.522 0.211 0.301
EMEA GSC DrNote 0.833 0.172 0.285
Medline GSC DrNote (filtered) 0.634 0.444 0.522
EMEA GSC DrNote (filtered) 0.604 0.636 0.620

How to Use

Spawn DrNote using Pre-trained Data

Steps to spawn the service using pre-trained data:

# Assumed: Docker, Docker-compose installed and user added to Docker group
# follow guide from https://docs.docker.com/engine/install/ubuntu/
# sudo apt-get install -y docker docker-compose
# sudo usermod -aG docker $USER

# Clone repository
git clone https://github.com/frankkramer-lab/DrNote
cd DrNote/

# Retrieve pre-trained data
wget -O build/pretrained_data.tar.gz https://myweb.rz.uni-augsburg.de/~freijoha/DrNote/pretrained_data.tar.gz

# Spawn annotation service
./04_start_annotation_service.sh

The annotation service should be available at:
https://<DOCKER_HOST>/

Build From Scratch and Spawn DrNote

Steps to automatically build the OpenTapioca data setup pipeline and spawn the annotation service.

Prestep: Setup the configuration:

  • Modify the file ./cfg/opentapioca_profile.json.
  • Modify the file ./cfg/load_config.json.
    Note: The language code should match the entry in ./cfg/opentapioca_profile.json.

Steps:

  1. Check dependencies:

    • Run ./01_checkDependencies.sh
  2. Generate the NIF file:

    • Run ./02_loadNIFFile.sh
  3. Generate the OpenTapioca data:

    • Run ./03_processForOpenTapioca.sh
  4. Spawn the MISIT annotation service:

    • Run ./04_start_annotation_service.sh

The annotation service should be available at:
https://<DOCKER_HOST>/

Citation

The paper is available at: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000086 If you use our work or want to reference it, use the following bibtex lines:

@article{10.1371/journal.pdig.0000086,
    doi = {10.1371/journal.pdig.0000086},
    author = {Frei, Johann and Soto-Rey, Iñaki and Kramer, Frank},
    journal = {PLOS Digital Health},
    publisher = {Public Library of Science},
    title = {DrNote: An open medical annotation service},
    year = {2022},
    month = {08},
    volume = {1},
    url = {https://doi.org/10.1371/journal.pdig.0000086},
    pages = {1-18},
    abstract = {In the context of clinical trials and medical research medical text mining can provide broader insights for various research scenarios by tapping additional text data sources and extracting relevant information that is often exclusively present in unstructured fashion. Although various works for data like electronic health reports are available for English texts, only limited work on tools for non-English text resources has been published that offers immediate practicality in terms of flexibility and initial setup. We introduce DrNote, an open source text annotation service for medical text processing. Our work provides an entire annotation pipeline with its focus on a fast yet effective and easy to use software implementation. Further, the software allows its users to define a custom annotation scope by filtering only for relevant entities that should be included in its knowledge base. The approach is based on OpenTapioca and combines the publicly available datasets from WikiData and Wikipedia, and thus, performs entity linking tasks. In contrast to other related work our service can easily be built upon any language-specific Wikipedia dataset in order to be trained on a specific target language. We provide a public demo instance of our DrNote annotation service at https://drnote.misit-augsburg.de/.},
    number = {8},
}

Referenced Repositories

Not required for smaller queries:

About

DrNote is an open tagging tool for text annotation and entity linking based on OpenTapioca and WIkiData/Wikipedia. It provides an entity linking service with pre-trained data for medical annotations in multilingual settings. The processing of raw text as well as PDF by a tesseract backend is supported.

Topics

Resources

License

Stars

Watchers

Forks

Languages