An NLP Pipeline for German literary texts implemented in Python and Spacy (v3.5.2). Work in progress.
This pipeline implements several custom pipeline components using the Spacy API. Currently the components perform
- Tokenization and Sentence Splitting via SoMaJo (Proisl, Uhrig 2016). Version 2.4.
- POS tagging via SoMeWeTa (Proisl 2018). Version 1.8.1.
- Lemmatization and Morphological Analysis via RNNTagger (Schmid 2019). Version 1.4.1.
- Dependency Parsing via ParZu (Sennrich, Schneider, Volk, Warin 2009; Sennrich, Volk, Schneider 2013; Sennrich, Kunz 2014). Commit a15ae7f.
- Named Entity Recognition via FLERT (Schweter, Akbik 2021). Version 0.12.2.
- Recognition of References to literary Characters (proper nouns and common nouns, i.e. “Appelative”, cf. Krug et al., 2017) via a custom fine-tuned FLERT model
aehrm/droc-character-recognizer
. - Tagging of German speech, thought and writing representation (STWR) via custom fine-tuned BERT embeddings, inspired by Brunner, Tu, Weimer, Jannidis (2020); models
aehrm/redewiedergabe-direct
, .... - Segmentation into Scenes via BERT Embeddings via a custom fine-tuned re-implementation of a model by Kurfalı and Wirén (2021); model
aehrm/stss-scene-segmenter
. - Coreference Resolution via BERT Embeddings (Schröder, Hatzel, Biemann 2021). Commit f34a99e.
- Annotating Event Types to verbal phrases via BERT Embeddings (Vauth, Hatzel, Gius, Biemann 2021) Version 0.2, Commit 25fdf7e.
See also the section about the Output Format for a description of the tabular output format.
usage: bin/llpro_cli.py [-h] [-v] [--no-normalize-tokens] [--tokenized]
[--sentencized] [--paragraph-pattern PAT]
[--section-pattern PAT] [--stdout | --writefiles DIR]
--infiles FILE [FILE ...]
NLP Pipeline for literary texts written in German.
optional arguments:
-h, --help show this help message and exit
-v, --verbose
--no-normalize-tokens
Do not normalize tokens.
--tokenized Skip tokenization, and assume that tokens are
separated by whitespace.
--sentencized Skip sentence splitting, and assume that sentences are
separated by newline characters.
--paragraph-pattern PAT
Optional paragraph separator pattern. Paragraph
separators are removed, and sentences always terminate
on paragraph boundaries. Performed before
tokenization/sentence splitting.
--section-pattern PAT
Optional sectioning paragraph pattern. Paragraphs
fully matching the pattern are removed. Performed
before tokenization/sentence splitting.
--stdout Write all processed tokens to stdout.
--writefiles DIR For each input file, write processed tokens to a
separate file in DIR.
--infiles FILE [FILE ...]
Input files, or directories.
Note: you can specify the resources directory (containing ParZu
etc.) with the environment
variable LLPRO_RESOURCES_ROOT
, and the temporary workdir with the environment variable LLPRO_TEMPDIR
.
The LLpro pipeline can be run either locally or as a Docker container. Running the pipeline using Docker is strongly recommended.
WINDOWS USERS: For building the Docker image, clone using
git clone https://github.com/aehrm/LLpro --config core.autocrlf=input
to preserve line endings.
We strongly recommend using Docker to run the pipeline. With the provided Dockerfile, all dependencies and prerequisites are downloaded automatically.
cd LLpro
docker build --tag cophiwue/llpro .
# or, if you want experimental features enabled
# docker build --build-arg LLPRO_EXPERIMENTAL=1 --tag cophiwue/llpro-experimental .
After building, the Docker image can be run like this:
mkdir -p files/in files/out
chmod a+w files/out # make directory writeable from the Docker container
# copy files into ./files/in to be processed
docker run \
--rm \
-e OMP_NUM_THREADS=4 \
--gpus all \ # alternatively, e.g., --gpus "device=0"
--interactive \
--tty \
-a stdout \
-a stderr \
-v "$(pwd)/files:/files" \
cophiwue/llpro -v --writefiles /files/out --infiles /files/in
# processed files are located in ./files/out
Verify that the following dependencies are installed:
- Python (tested on version 3.7)
- For RNNTagger
- CUDA (tested on version 11.4)
- For Parzu:
- SWI-Prolog >= 5.6
- SFST >= 1.4
Execute poetry install
and ./prepare.sh
. The script downloads all remaining prerequisites.
Example usage:
poetry install
./prepare.sh
# NOTICE: use the prepared poetry venv!
poetry run python ./bin/llpro_cli.py -v --writefiles files/out files/in
# if desired, run tests
poetry run pytest -vv
See the separate Developer Guide about the implemented Spacy components and how to access the assigned attributes.
See also the separate document about the tabular Output Format for a description of the output format and a reference of the used tagsets.
See the folder ./contrib
for scripts to reproduce the fine-tuning of the custom models.
If you use the LLpro software for academic research, please consider citing the accompanying publication:
Ehrmanntraut, Anton, Leonard Konle, and Fotis Jannidis. 2023. „LLpro: A Literary Language Processing Pipeline for German Narrative Text.“ In Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023), ed. Munir Georges, Aaricia Herygers, Annemarie Friedrich and Benjamin Roth, pp. 28–39. Ingolstadt, Germany: Association for Computational Linguistics. https://aclanthology.org/2023.konvens-main.3/
@inproceedings{ehrmanntraut-etal-2023-llpro,
title = "{LL}pro: A Literary Language Processing Pipeline for {G}erman Narrative Texts",
author = "Ehrmanntraut, Anton and
Konle, Leonard and
Jannidis, Fotis",
editor = "Georges, Munir and
Herygers, Aaricia and
Friedrich, Annemarie and
Roth, Benjamin",
booktitle = "Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)",
date = "2023-09-18",
address = "Ingolstadt, Germany",
publisher = "Association for Computational Lingustics",
url = "https://aclanthology.org/2023.konvens-main.3/",
pages = "28--39"
}
In accordance with the license terms of ParZu+Zmorge (GPL v2), and of SoMeWeTa (GPL v3) the LLpro pipeline is licensed under the terms of GPL v3. See LICENSE.
NOTICE: The code of the ParZu parser located in resources/ParZu
has been modified to be compatible with LLpro.
See git log -p df1e91a.. -- resources/ParZu
for a summary of these changes.
NOTICE: Some subsystems and resources used by the LLpro pipeline have additional license terms:
- RNNTagger: see https://www.cis.uni-muenchen.de/~schmid/tools/RNNTagger/Tagger-Licence
- SoMeWeTa model
german_web_social_media_2020-05-28.model
: derived from the TIGER corpus; see https://www.ims.uni-stuttgart.de/documents/ressourcen/korpora/tiger-corpus/license/htmlicense.html
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
Akbik, Alan, Duncan Blythe, and Roland Vollgraf. 2018. “Contextual String Embeddings for Sequence Labeling.” In COLING 2018, 27th International Conference on Computational Linguistics, 1638–49.
Brunner, Annelen, Ngoc Duyen Tanja Tu, Lukas Weimer, and Fotis Jannidis. 2021. “To BERT or Not to BERT – Comparing Contextual Embeddings in a Deep Learning Architecture for the Automatic Recognition of Four Types of Speech, Thought and Writing Representation.” In Proceedings of the 5th Swiss Text Analytics Conference (SwissText) & 16th Conference on Natural Language Processing (KONVENS), 2624:11. CEUR Workshop Proceedings. Zurich, Switzerland. http://ceur-ws.org/Vol-2624/paper5.pdf.
Krug, Markus, Lukas Weimer, Isabella Reger, Luisa Macharowsky, Stephan Feldhaus, Frank Puppe, and Fotis Jannidis. 2017. “Description of a Corpus of Character References in German Novels - DROC [Deutsches ROman Corpus].” https://resolver.sub.uni-goettingen.de/purl?gro-2/108301.
Kurfalı, Murathan, and Mats Wirén. 2021. “Breaking the Narrative: Scene Segmentation Through Sequential Sentence Classification.” In Proceedings of the Shared Task on Scene Segmentation, edited by Albin Zehe, Leonard Konle, Lea Dümpelmann, Evelyn Gius, Svenja Guhr, Andreas Hotho, Fotis Jannidis, et al., 3001:49–53. CEUR Workshop Proceedings. Düsseldorf, Germany. http://ceur-ws.org/Vol-3001/#paper6.
Proisl, Thomas. 2018. “SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 665–70. Miyazaki, Japan: European Language Resources Association ELRA. http://www.lrec-conf.org/proceedings/lrec2018/pdf/49.pdf.
Proisl, Thomas, and Peter Uhrig. 2016. “SoMaJo: State-of-the-Art Tokenization for German Web and Social Media Texts.” In Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task, 57–62. Berlin, Germany: Association for Computational Linguistics (ACL). http://aclweb.org/anthology/W16-2607.
———. 2019. “Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts.” In DATeCH, Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage, 133–37. Brussels, Belgium: Association for Computing Machinery. https://www.cis.uni-muenchen.de/~schmid/papers/Datech2019.pdf.
Schröder, Fynn, Hans Ole Hatzel, and Chris Biemann. 2021. “Neural End-to-End Coreference Resolution for German in Different Domains.” In Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021), 170–81. Düsseldorf, Germany: KONVENS 2021 Organizers. https://aclanthology.org/2021.konvens-1.15.
Schweter, Stefan, and Alan Akbik. 2021. “FLERT: Document-Level Features for Named Entity Recognition.” arXiv:2011.06993 [Cs], May. http://arxiv.org/abs/2011.06993.
Sennrich, Rico, and Beat Kunz. 2014. “Zmorge: A German Morphological Lexicon Extracted from Wiktionary.” In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), 1063–7. Reykjavik, Iceland: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2014/pdf/116_Paper.pdf.
Sennrich, Rico, G. Schneider, M. Volk, M. Warin, C. Chiarcos, Richard Eckart de Castilho, and Manfred Stede. 2009. “A New Hybrid Dependency Parser for German.” In Proceedings of the GSCL Conference. Potsdam, Germany. https://doi.org/10.5167/UZH-25506.
Sennrich, Rico, Martin Volk, and Gerold Schneider. 2013. “Exploiting Synergies Between Open Resources for German Dependency Parsing, POS-Tagging, and Morphological Analysis.” In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, 601–9. Hissar, Bulgaria: INCOMA Ltd. Shoumen, BULGARIA. https://www.aclweb.org/anthology/R13-1079.
Vauth, Michael, Hans Ole Hatzel, Evelyn Gius, and Chris Biemann. 2021. “Automated Event Annotation in Literary Texts.” In Proceedings of the Conference on Computational Humanities Research 2021, edited by Maud Ehrmann, Folgert Karsdorp, Melvin Wevers, Tara Lee Andrews, Manuel Burghardt, Mike Kestemont, Enrique Manjavacas, Michael Piotrowski, and Joris van Zundert, 2989:333–45. CEUR Workshop Proceedings. Amsterdam, the Netherlands. https://ceur-ws.org/Vol-2989/#short_paper18.