Corpora: A collection of reformatted texts for use with CCR tools.

This document explains the procedure for cleaning and adding the corpora texts. For a list of included texts, please see INDEX.pdf.

Source texts

The sources of the English texts are the Gutenberg plain text UTF-8 files. We save the initial, unchanged versions, as downloaded from gutenberg.org in a folder for the relevant corpus.

This process has been followed for the two most recent CLiC corpora, ChiLit and Arts. The initial files are available from previous commits to this repository:

initial versions of ChiLit files added 2017-09-10
initial versions of ArTs files added 2017-10-26 (this corpus was originally called "Other")
initial versions of ArTs files added 2019-01-16 (as part of the ArTs expansion)

Also note that the initial file for gulliver is found in the initial downloads for ChiLit above; the book was later moved to the ArTs corpus.

The texts of the German “Deutsche Romane des 19. Jahrhunderts” (DE19) corpus originate from the ELTeC-deu collection. After a selection process aimed at ensuring that DE19 is comparable to the English 19C corpus in terms of size and gender balance, we converted the ELTeC XML files into plain text files, retaining chapter boundaries. Unlike the English corpora, in German texts, chapter titles are preceded by ###, with a corresponding chapter segmentation rule added to clictagger to avoid making the tagger too language-specific.

Maintaining the corpora repository

Texts added to the corpora repository should be prepared according to the notes in the clictagger documentation.

`.bib` file

We currently manage the bibliography in a shared zotero folder. The important fields in the bib entries are:

The shorttitle field must match the filename of the relevant text file in the corpus folder.
The keywords field must contain the name of the corpus.
The title, author and date fields must be present.
The editor field is optional and refers to the people or group of people who transcribed/edited the text for publication on gutenberg.org. We add this manually based on any information in the initial text file from Project Gutenberg (not all text files contain this).

Example entry:

    @book{grahame_wind_1908,
        title = {The Wind in the Willows},
        url = {https://www.gutenberg.org/ebooks/289},
        shorttitle = {willows},      <<===  filename willows.txt
        author = {Grahame, Kenneth},
        editor = {Lough, Mike},
        urldate = {2017-06-28},
        date = {1908},
        keywords = {{ChiLit}}        <<===  corpus id
    }

Make sure that the entries don't inclue extraneous information. For example, when using the Zotero Chrome Add On to export a citation from gutenberg.org, Zotero tends to save licensing information. This should be deleted from the Zotero entry.

For the date we try to establish the date of the first publication of the novel (or the work as a whole, in the case of serialised works), using external information, such as Wikipedia entries. We use this date of first publication rather than the date of the edition for the main historical context of the novel. Although we do not explicitly record the edition of the book transcribed by Project Gutenber, CLiC users can look for this information in the initial versions of the texts: initial versions of

If you are adding a new corpus, you will also have to create a @book entry for the corpus. The important fields in the bib entries are:

The shorttitle field must match the corpus id used in book keywords
The title field must be present.
The number field must be present, and is used to order the corpora in CLiC.
The keyword field must contain the keyword corpus.

Example entry:

    @book{cermakova_childrens_2017,
        location = {University of Birmingham, {UK}},
        title = {Children's Literature},
        series = {{CCR} Corpus},
        shorttitle = {{ChiLit}},
        number = {3},
        publisher = {Centre for Corpus Research},
        author = {Čermáková, A. and Mahlberg, M. and Wiegand, V.},
        date = {2017},
        keywords = {corpus}
    }

In order to export the required bib style from Zotero, choose "BibLaTeX" (not BibTeX!) in Preferences -> Export. It appears that different versions of Zotero export different sequences of .bib entries; please check before you update the file. If the sequence differs, new entries can be added manually instead of rewriting the entire .bib file.

Adding a new text to a corpus

Clean the text as described in Section 2.
Add entry to the .bib file; see Section 3.1.
Update repository tags; see Section 3.4.

Adding a new corpus

Add a new folder to the corpus repository.
Add an entry to the .bib file for the corpus; see Section 3.1.
For each new corpus file
1. Clean the text as described in Section 2.
2. Add entry to the .bib file; see Section 3.1.
Update repository tags; see Section 3.4.

Repository Tags

TODO

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
19C		19C
AAW		AAW
ArTs		ArTs
ChiLit		ChiLit
DE19		DE19
DNov		DNov
images		images
.gitignore		.gitignore
INDEX.pdf		INDEX.pdf
INDEX.tex		INDEX.tex
README.md		README.md
corpora.bib		corpora.bib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Corpora: A collection of reformatted texts for use with CCR tools.

Source texts

Maintaining the corpora repository

`.bib` file

Adding a new text to a corpus

Adding a new corpus

Repository Tags

About

Releases

Packages

Contributors 4

Languages

mahlberg-lab/corpora

Folders and files

Latest commit

History

Repository files navigation

Corpora: A collection of reformatted texts for use with CCR tools.

Source texts

Maintaining the corpora repository

.bib file

Adding a new text to a corpus

Adding a new corpus

Repository Tags

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

`.bib` file

Packages