Skip to content

Latest commit

 

History

History
44 lines (33 loc) · 3.1 KB

README.md

File metadata and controls

44 lines (33 loc) · 3.1 KB

LIUM WMT17 Systems for News Translation Task

Below you will find the data and nmtpy configurations for LIUM's WMT17 News Translation systems (see paper):

@InProceedings{garciamartinez-EtAl:2017:WMT,
  author    = {Garc\'{i}a-Mart\'{i}nez, Mercedes  and
               Caglayan, Ozan  and  Aransa, Walid
               and  Bardet, Adrien  and  Bougares, Fethi
               and  Barrault, Lo\"{i}c},
  title     = {LIUM Machine Translation Systems for WMT17 News Translation Task},
  booktitle = {Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers},
  month     = {September},
  year      = {2017},
  address   = {Copenhagen, Denmark},
  publisher = {Association for Computational Linguistics},
  pages     = {288--295},
  url       = {http://www.aclweb.org/anthology/W17-4726.pdf}
}

En->Tr Systems

Data

(Note: Turkish side of the corpora below is tokenized with a slightly modified version of Moses tokenizer which handles apostrophes correctly for Turkish.)

  • Download (13M) our normalized/tokenized/length-filtered version of officially provided SETIMES2 with ~200K sentences.

  • Download joint BPE (16K merge ops) trained on bitext.

  • The exact incremental subsamples of 150K, 700K, 1M and 1.7M (~all news2016) parallel back-translation corpora used in the paper where the target (TR) side samples are from monolingual Turkish data news.2016.shuffled. The sentences are translated into EN with a single TR->EN NMT system (~14 BLEU on newstest2016):

  • Ready to use BPE-ized subsamples as they are used in the paper (cf. Table 3):

    • (System B0) BPE-ized, (only) SETIMES2-200K (~200K total) corpora (14M)
    • (System B1) BPE-ized, (only) BT-1M (~1M total) corpora (58M)
    • (System B2) BPE-ized, SETIMES2-200K+BT-150K (~350K total) corpora (21M)
    • (System B4) BPE-ized, SETIMES2-200K+BT-700K (~900K total) corpora (51M)
    • (System B6) BPE-ized, SETIMES2-200K+BT-1M (~1.2M total) corpora (72M)
    • (System B8) BPE-ized, SETIMES2-200K+BT-1.7M (~1.9M total) corpora (112M)