Skip to content

Commit

Permalink
Add University of Leipzig French corpus to kalamine (#195)
Browse files Browse the repository at this point in the history
* Add Leipzig University 2012 Corpus

* Fix shebangs in corpus generation binaries
  • Loading branch information
gagbo authored Nov 10, 2024
1 parent 5ad53d1 commit 38da0a1
Show file tree
Hide file tree
Showing 4 changed files with 6,113 additions and 3 deletions.
26 changes: 26 additions & 0 deletions kalamine/www/corpus/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Corpus for layout analysis

## `fr` / `en`

Those corpora and stats come from Don Quixote

## `fra_mixed-typical_2012_1M-sentences`

These stats come from [University of Leipzig](https://wortschatz.uni-leipzig.de/en/download/French#fra_mixed_2012)

### Sources
French Mixed-Typical 2012, 1M sentences file has been extracted, and the
sentence indices have been stripped with `awk '!($1="")'
fra_mixed-typical_2012_1M/fra_mixed-typical_2012_1M-sentences.txt >
fra_mixed-typical_2012_1M-sentences.txt`

### Bibtex

```tex
@misc{fra_mixed_2012,
author = {Leipzig Corpora Collection},
title = {French mixed corpus based on material from 2012},
howpublished = {https://corpora.uni-leipzig.de?corpusId=fra_mixed_2012},
note = {Accessed: 2024-11-09}
}
```
4 changes: 2 additions & 2 deletions kalamine/www/corpus/chardict.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/bin/env python3
""" Turn corpus texts into dictionaries of symbols and digrams. """
#!/usr/bin/env python3
"""Turn corpus texts into dictionaries of symbols and digrams."""

import json
from os import listdir, path
Expand Down
Loading

0 comments on commit 38da0a1

Please sign in to comment.