This repository contains the source code and documentation of selnolig-check. selnolig-check tests the German ligature suppression patterns of the LuaLaTeX package selnolig for morphological correctness and relative completeness, based on an extensive corpus.
We conducted the majority of this project as our final project for the class Introducton to Computational Linguistics at the University of Massachusetts at Amherst in the fall of 2012.
In order to run the programs, two external resources are required, which are not included in this repository:
- the SDeWaC corpus, licenses to be obtained from the Web-as-Corpus kool ynitiative
the untagged version (sdewac-v3.corpus
) renamedcorpus.raw
and placed in the directorysrc/testing_dictionary/
.
M. Baroni, S. Bernardini, A. Ferraresi and E. Zanchetta. 2009. The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language Resources and Evaluation 43 (3): p. 209–226. - the morphological analyzer SMOR, licenses to be obtained from the Institut für Maschinelle Sprachverarbeitung at Universität Stuttgart,
placed in a directory named
98-SMOR_binaries/
within the directorysrc/selnolig-check/
.
Helmut Schmid, Arne Fitschen and Ulrich Heid: SMOR: A German Computational Morphology Covering Derivation, Composition, and Inflection, Proceedings of the IVth International Conference on Language Resources and Evaluation (LREC 2004), p. 1263–1266, Lisbon, Portugal.
The programs are supposed fo be run in the following order:
- in
src/testing_dictionary/
:corpus_to_words
words_to_ligs
ligs_to_ligdict
- in
src/selnolig_check/
:ligdict_to_smor
(this is just a script to call SMOR with the correct input and output files)smor_to_morphemes
morphemes_to_analyses
analyses_to_errors
The code is licensed under a Simplified BSD License, to be viewed in the file LICENSE.md.
The documentation is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.