This repository contains manually curated parallel sentences for Japanese--Russian, Japanese--English, and Russian--English language pairs in news domain.
The Japanese--Russian is one of the most distant language pairs and has only limited quantity of parallel data to train machine translation (MT) systems. To promote the research on low-resource MT, we have curated parallel sentences, which can be used as development and test data, through the following procedure:
- Downloaded from OPUS News Commentary data for Japanese--Russian with 586 sentence pairs and Japanese--English with 637 sentence pairs.
- The above Japanese--Russian and Japanese--English data share many lines in the Japanese side. Therefore, we first compiled a Russian--Japanese--English tri-text data.
- From each line, we identified corresponding parts across languages, and split off unaligned parts into a new line.
- As a result, we obtained 1,654 lines of data comprising trilingual, bilingual, and monolingual segments (mainly sentences).
- For the sake of comparability, we randomly chose 600 trilingual sentences to create a test set, and concatenated the rest of them and bilingual sentences to form development sets.
Ru | Ja | En | #sent | Test | Dev |
---|---|---|---|---|---|
✓ | ✓ | ✓ | 913 | 600 | 313 |
✓ | ✓ | - | 173 | - | 173 |
- | ✓ | ✓ | 276 | - | 276 |
✓ | - | ✓ | 0 | - | - |
✓ | - | - | 4 | - | - |
- | ✓ | - | 287 | - | - |
- | - | ✓ | 1 | - | - |
L1--L2 | Development | Test |
---|---|---|
Japanese--Russian | 486 | 600 |
Japanese--English | 589 | 600 |
Russian--English | 313 | 600 |
Scoreboard (BLEU-cased)
System description | Resources Used | Ja-to-Ru | Ru-to-Ja |
---|---|---|---|
Uni-directional Transformer NMT | (a) | 0.70 | 1.96 |
Multi-to-multi Transformer NMT involving English | (a) | 3.72 | 8.35 |
Same but with multi-lingual multi-stage fine-tuning | (a) (b) (c) (d) | 7.49 | 12.10 |
Data used for above systems are as follows:
(a) Global Voices parallel data retrieved from OPUS (v2015; included in this repository)
(b) ASPEC: Asian Scientific Paper Excerpt Corpus (out-of-domain Japanese--English parallel data)
(c) UN provided for WMT 18 (out-of-domain Russian--English parallel data)
(d) Yandex provided for WMT 18 (out-of-domain Russian--English parallel data)
- Aizhan Imankulova, Raj Dabre, Atsushi Fujita, and Kenji Imamura. Exploiting Out-of-Domain Parallel Data through Multilingual Transfer Learning for Low-Resource Neural Machine Translation. In Proceedings of the 17th Machine Translation Summit (MT Summit), Aug., 2019. (to appear). arXiv
- National Institute of Information and Communications Technology (henceforth, NICT) has made the database publicly available under the conditions of license specified below.
- NICT bears no responsibility for the contents of the database and assumes no liability for any direct or indirect damage or loss whatsoever that may be incurred as a result of using the database.
- If any copyright infringement or other problems are found in the database, please contact us at atsushi.fujita[at]nict[dot]go[dot]jp. We will review the issue and undertake appropriate measures when needed.
No claims of intellectual property are made on the work of preparation of the corpus. See the OPUS and/or CASMACAT for details.
The dataset has been developed as a part of work at Advanced Translation Technology Laboratory, Advanced Speech Translation Research and Development Promotion Center, National Institute of Information and Communications Technology.