Skip to content

Improve a baseline NMT system trained on a very small parallel corpus using either monolingual data or parallel data in other languages

License

Notifications You must be signed in to change notification settings

awant/lowResourceNMT

Repository files navigation

lowResourceNMT

Improve a baseline NMT system trained on a very small parallel corpus using either monolingual data or parallel data in other languages

Presentation

https://docs.google.com/presentation/d/1J6Xh0YfCSnQIcA6doUm7skZEG1ZqE50bUFehq72i6V4

Datasets:

https://yadi.sk/d/xUKsoX-G3T6ZYc

en-ru parallel data: https://translate.yandex.ru/corpus (just fill fields and you immediately get corpus)

Workflow board:

https://trello.com/b/f3kcPkqm/low-resource-nmt

Tensor2Tensor

Forked Tensor2Tensor version

Install from local dir:

pip install -e tensor2tensor/

or directly from github:

pip install git+https://github.com/AlAntonov/tensor2tensor

Read articles:

Attention Is All You Need

Unsupervised Neural Machine Translation Using Monolingual Corpora Only

Zero-shot translation

Dual learning for Machine Translation

Transfer Learning for Low-Resource Neural Machine Translation

Adversarial Neural Machine Translation

Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets

On Using Monolingual Corpora in Neural Machine Translation

Improving Neural Machine Translation Models with Monolingual Data

Inducing Bilingual Lexica From Non-Parallel Data With Earth Mover’s Distance Regularization

Exploiting Source-side Monolingual Data in Neural Machine Translation

Unsupervised Pretraining for Sequence to Sequence Learning

Universal Neural Machine Translation for Extremely Low Resource Languages

Joint Training for Neural Machine Translation Models with Monolingual Data

Unsupervised Neural Machine Translation

Learning principled bilingual mappings of word embeddings while preserving monolingual invariance

Learning bilingual word embeddings with (almost) no bilingual data

Effective Domain Mixing for Neural Machine Translation

Phrase-Based & Neural Unsupervised Machine Translation

How to run train-and-evaluation with he-en:

  1. place your data in data/t2t_data/* (en.train.txt, he.train.txt - train, dev, test generate from these files)
  2. run he-en_translation.sh (with options, train/dev/test sizes, etc - check script for more info)

Features:

  1. Sizes can be fractional. For example: he-en_translation.sh --train_size 0.4 --test_size 0.3 --dev_size 0.3
  2. compute_bleu.py with flag --bootstrap return 95% confidence interval

About

Improve a baseline NMT system trained on a very small parallel corpus using either monolingual data or parallel data in other languages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published