Skip to content

Generating a COGS corpus

alexanderkoller edited this page Aug 20, 2022 · 1 revision

Obtaining the grammar

The reimplemented COGS grammar is available here: https://github.com/coli-saar/cogs-generator-alto

Generating a corpus

The corpus is generated in the variable-free format introduced by Qiu et al. 2022. Use the following command:

java -cp <alto.jar> de.up.ling.irtg.script.CogsCorpusGenerator [options] <grammar.irtg>

Here <alto.jar> stands for the Alto jarfile, and <grammar.irtg> is the reimplemented COGS grammar. The options are as follows:

  • --count <N> says that we want to generate a corpus with <N> instances
  • --suppress-duplicates says that the same sentence should never be generated twice
  • --previous-instances <filename> reads a previously generated corpus from <filename>; if you also choose --suppress-duplicates, the tool guarantees that you won't generate a sentence again that was already part of the old corpus.
  • --pp-depth <min>-<max> restricts the PP embedding depth to a minimum of <min> and a maximum of <max>. For instance, write --pp-depth 0-2 to generate instances with PP depth at most two.
  • --cp-depth <min>-<max> restricts the CP embedding depth in the same way. For instance, --cp-depth 3-12 generates instances with CP embedding depth three to twelve.

The corpus generator prints the new instances to stdout. It prints error messages and a progress report to stderr.

Clone this wiki locally