This repository hosts code and resources to encode the Karlsruhe Children's Text corpus and the H2, E2, ERK1 Children's Writing corpus into a computationally digestible format.
Karlsruhe Children's Text: https://catalog.ldc.upenn.edu/LDC2015T22
H2, E2, ERK1 Children's Writing: https://catalog.ldc.upenn.edu/LDC2018T05
The chosen encoding format is PAULA XML. This is a standoff XML format designed to represent a wide range of linguistically annotated textual and multi-modal corpora. PAULA allows to store each layer of annotation in a separate XML file which refer to the same raw data. This allows for easy upgrade and scalability.
Some of the scripts in this repository rely on external libraries: SpaCy
, lxml
, img2pdf
and pyhunspell
, a set of Python bindings for the Hunspell spellchecker engine. Run the following commands to get them:
pip install lxml
pip install pip setuptools wheel
pip install spacy
python -m spacy download de_dep_news_trf
pip install img2pdf
sudo apt-get install python3-dev
sudo apt-get install libhunspell-dev
pip install hunspell