Encoding Language Learner Corpora

This repository hosts code and resources to encode the Karlsruhe Children's Text corpus and the H2, E2, ERK1 Children's Writing corpus into a computationally digestible format.

Corpora

Karlsruhe Children's Text: https://catalog.ldc.upenn.edu/LDC2015T22
H2, E2, ERK1 Children's Writing: https://catalog.ldc.upenn.edu/LDC2018T05

Encoding format

The chosen encoding format is PAULA XML. This is a standoff XML format designed to represent a wide range of linguistically annotated textual and multi-modal corpora. PAULA allows to store each layer of annotation in a separate XML file which refer to the same raw data. This allows for easy upgrade and scalability.

Requirements

Some of the scripts in this repository rely on external libraries: SpaCy, lxml, img2pdf and pyhunspell, a set of Python bindings for the Hunspell spellchecker engine. Run the following commands to get them:

lxml

pip install lxml

SpaCy

pip install pip setuptools wheel
pip install spacy
python -m spacy download de_dep_news_trf

img2pdf

pip install img2pdf

pyhunspell

sudo apt-get install python3-dev
sudo apt-get install libhunspell-dev
pip install hunspell

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
project_presentation		project_presentation
scripts		scripts
ProjectSummary.pdf		ProjectSummary.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Encoding Language Learner Corpora

Corpora

Encoding format

Requirements

lxml

SpaCy

img2pdf

pyhunspell

About

Releases

Packages

Languages

piaschwarz/LingCorpusAnnotation_encodingKCT

Folders and files

Latest commit

History

Repository files navigation

Encoding Language Learner Corpora

Corpora

Encoding format

Requirements

lxml

SpaCy

img2pdf

pyhunspell

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages