Skip to content

piaschwarz/LingCorpusAnnotation_encodingKCT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Encoding Language Learner Corpora

This repository hosts code and resources to encode the Karlsruhe Children's Text corpus and the H2, E2, ERK1 Children's Writing corpus into a computationally digestible format.

Corpora

Karlsruhe Children's Text: https://catalog.ldc.upenn.edu/LDC2015T22
H2, E2, ERK1 Children's Writing: https://catalog.ldc.upenn.edu/LDC2018T05

Encoding format

The chosen encoding format is PAULA XML. This is a standoff XML format designed to represent a wide range of linguistically annotated textual and multi-modal corpora. PAULA allows to store each layer of annotation in a separate XML file which refer to the same raw data. This allows for easy upgrade and scalability.

Requirements

Some of the scripts in this repository rely on external libraries: SpaCy, lxml, img2pdf and pyhunspell, a set of Python bindings for the Hunspell spellchecker engine. Run the following commands to get them:

lxml

pip install lxml

SpaCy

pip install pip setuptools wheel
pip install spacy
python -m spacy download de_dep_news_trf

img2pdf

pip install img2pdf

pyhunspell

sudo apt-get install python3-dev
sudo apt-get install libhunspell-dev
pip install hunspell

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages