GGPONC 2.0 — The German Clinical Guideline Corpus for Oncology

This repository contains the code to reproduce results from the paper:

GGPONC 2.0 — The German Clinical Guideline Corpus for Oncology: Curation Workflow, Annotation Policy, Baseline NER Taggers (To appear at LREC '22)

Preparation

Get access to GGPONC following the instructions on the project homepage and place the the contents of the 2.0 release (v2.0_2022_03_24) in the data folder
Install Python dependencies pip install -r requirements.txt

Preprocessing (optional)

The prepocessed, i.e., sentence-splitted and tokenized text portions, are already included in the release (data/plain_text)

If you want to do the preprocessing yourself, build the jar file from https://github.com/hpi-dhc/ggponc_preprocessing and use it with data/xml/cpg-corpus-cms.xml

Notebooks

In notebooks, we provide the following Jupyter Notebooks to reproduce the results from the paper:

01_Statistics.ipynb
- Corpus Statistics and IAA Calculcation with INCEpTALYTICS
02_NER_Baselines.ipynb
- NER Baselines using BERT / HuggingFace Transformers
03_NER_Analysis.ipynb
- Analysis of NER errors and coordination ellipses
new: 04_spaCy_Spancat.ipynb
- Alternative NER implementation using spaCy's SpanCategorizer feature, which can handle overlapping and nested mentions

Running NER Experiments with HuggingFace and Hydra

In experiments, we provide Hydra configurations for the different NER experiments with the best hyperparameters found through grid search. To run such an experiment, do:

cd experiments
python run_experiment.py -cn <experiment>.yaml cuda=<cuda devices>
- for instance: python run_experiment.py -cn 01_ggponc_coarse_short.yaml cuda=0

If you have installed and configured Weights & Biases, it will automatically sync your runs.

To run a hyperparameter sweep, pass the optiom -m to Hydra, e.g.:

python run_experiment.py -m -cn=01_ggponc_coarse_short.yaml cuda=0 learning_rate=1e-6,5e-6,1e-5,5e-5,1e-4 label_smoothing_factor=0.0,0.05,0.1,0.2 weight_decay=0.0,0.05,0.1

Annotation Guide

Please refer to the annotation guide for a detailed description of the entity classes and rules.

Citing GGPONC

According to the terms of use of GGPONC, contributions which are based on the corpus must cite the following publication:

Florian Borchert, Christina Lohr, Luise Modersohn, Thomas Langer, Markus Follmann, Jan Philipp Sachs, Udo Hahn, and Matthieu-P. Schapranow. GGPONC: A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines. In Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis, 38–48. Online: Association for Computational Linguistics, 2020.

BibTeX:

@inproceedings{borchert-etal-2020-ggponc,
    title = "{GGPONC}: A Corpus of {G}erman Medical Text with Rich Metadata Based on Clinical Practice Guidelines",
    author = "Borchert, Florian  and
      Lohr, Christina  and
      Modersohn, Luise  and
      Langer, Thomas  and
      Follmann, Markus  and
      Sachs, Jan Philipp  and
      Hahn, Udo  and
      Schapranow, Matthieu-P.",
    booktitle = "Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.louhi-1.5",
    doi = "10.18653/v1/2020.louhi-1.5",
    pages = "38--48",
  }

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
annotation_guide		annotation_guide
data		data
experiments		experiments
notebooks		notebooks
spacy		spacy
test		test
util		util
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE.txt		LICENSE.txt
README.md		README.md
conftest.py		conftest.py
gamma_agreement.py		gamma_agreement.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GGPONC 2.0 — The German Clinical Guideline Corpus for Oncology

Preparation

Preprocessing (optional)

Notebooks

Running NER Experiments with HuggingFace and Hydra

Annotation Guide

Citing GGPONC

About

Releases

Packages

Languages

License

lsarodnik/ggponc_annotation

Folders and files

Latest commit

History

Repository files navigation

GGPONC 2.0 — The German Clinical Guideline Corpus for Oncology

Preparation

Preprocessing (optional)

Notebooks

Running NER Experiments with HuggingFace and Hydra

Annotation Guide

Citing GGPONC

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages