Japanese--Russian--English News Commentary Parallel Data

Introduction

This repository contains manually curated parallel sentences for Japanese--Russian, Japanese--English, and Russian--English language pairs in news domain.

The Japanese--Russian is one of the most distant language pairs and has only limited quantity of parallel data to train machine translation (MT) systems. To promote the research on low-resource MT, we have curated parallel sentences, which can be used as development and test data, through the following procedure:

Downloaded from OPUS News Commentary data for Japanese--Russian with 586 sentence pairs and Japanese--English with 637 sentence pairs.
The above Japanese--Russian and Japanese--English data share many lines in the Japanese side. Therefore, we first compiled a Russian--Japanese--English tri-text data.
From each line, we identified corresponding parts across languages, and split off unaligned parts into a new line.
As a result, we obtained 1,654 lines of data comprising trilingual, bilingual, and monolingual segments (mainly sentences).
For the sake of comparability, we randomly chose 600 trilingual sentences to create a test set, and concatenated the rest of them and bilingual sentences to form development sets.

Distribution of tri-texts

Ru	Ja	En	#sent	Test	Dev
✓	✓	✓	913	600	313
✓	✓	-	173	-	173
-	✓	✓	276	-	276
✓	-	✓	0	-	-
✓	-	-	4	-	-
-	✓	-	287	-	-
-	-	✓	1	-	-

Development and test splits (available in this repository)

L1--L2	Development	Test
Japanese--Russian	486	600
Japanese--English	589	600
Russian--English	313	600

Benchmarking

Scoreboard (BLEU-cased)

System description	Resources Used	Ja-to-Ru	Ru-to-Ja
Uni-directional Transformer NMT	(a)	0.70	1.96
Multi-to-multi Transformer NMT involving English	(a)	3.72	8.35
Same but with multi-lingual multi-stage fine-tuning	(a) (b) (c) (d)	7.49	12.10

Data used for above systems are as follows:

(a) Global Voices parallel data retrieved from OPUS (v2015; included in this repository)

(b) ASPEC: Asian Scientific Paper Excerpt Corpus (out-of-domain Japanese--English parallel data)

(c) UN provided for WMT 18 (out-of-domain Russian--English parallel data)

(d) Yandex provided for WMT 18 (out-of-domain Russian--English parallel data)

References

Aizhan Imankulova, Raj Dabre, Atsushi Fujita, and Kenji Imamura. Exploiting Out-of-Domain Parallel Data through Multilingual Transfer Learning for Low-Resource Neural Machine Translation. In Proceedings of the 17th Machine Translation Summit (MT Summit), Aug., 2019. (to appear). arXiv

Precautions

National Institute of Information and Communications Technology (henceforth, NICT) has made the database publicly available under the conditions of license specified below.
NICT bears no responsibility for the contents of the database and assumes no liability for any direct or indirect damage or loss whatsoever that may be incurred as a result of using the database.
If any copyright infringement or other problems are found in the database, please contact us at atsushi.fujita[at]nict[dot]go[dot]jp. We will review the issue and undertake appropriate measures when needed.

License

No claims of intellectual property are made on the work of preparation of the corpus. See the OPUS and/or CASMACAT for details.

Acknowledgments

The dataset has been developed as a part of work at Advanced Translation Technology Laboratory, Advanced Speech Translation Research and Development Promotion Center, National Institute of Information and Communications Technology.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
en-ja		en-ja
en-ru		en-ru
ja-ru		ja-ru
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Japanese--Russian--English News Commentary Parallel Data

Introduction

Contents

Distribution of tri-texts

Development and test splits (available in this repository)

Benchmarking

References

Precautions

License

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

Ru	Ja	En	#sent	Test	Dev
✓	✓	✓	913	600	313
✓	✓	-	173	-	173
-	✓	✓	276	-	276
✓	-	✓	0	-	-
✓	-	-	4	-	-
-	✓	-	287	-	-
-	-	✓	1	-	-

Ru	Ja	En	#sent	Test	Dev
✓	✓	✓	913	600	313
✓	✓	-	173	-	173
-	✓	✓	276	-	276
✓	-	✓	0	-	-
✓	-	-	4	-	-
-	✓	-	287	-	-
-	-	✓	1	-	-

aizhanti/JaRuNC

Folders and files

Latest commit

History

Repository files navigation

Japanese--Russian--English News Commentary Parallel Data

Introduction

Contents

Distribution of tri-texts

Development and test splits (available in this repository)

Benchmarking

References

Precautions

License

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages

Ru	Ja	En	#sent	Test	Dev
✓	✓	✓	913	600	313
✓	✓	-	173	-	173
-	✓	✓	276	-	276
✓	-	✓	0	-	-
✓	-	-	4	-	-
-	✓	-	287	-	-
-	-	✓	1	-	-