A Study on Errors in Multilingual Machine Translation

Authors: Stefania Radu, Lisa Koopmans, Alina Dima

This repository contains 2 python notebooks, one for the multilingual error analysis, and one for the editing effort prediction. It also contains the DivEMT dataset, annotated with new features for prediction.

These notebooks can be run in Google Collaboratory. You can also install all the requiremenets by running: pip install -r requirements.txt.

Multilingual error analysis

This notebook generates the plots for the following tasks:

high level analysis of the errors distribution across all the languages with respect to the BLEU, CHRF, CER and BAD rate. See the plot below as an example.

in depth analysis of the errors distribution across all languages and across different linguistic features:
- part-of-speech (POS)
- named-entity recognition (NER)
- dependency relations (deprel)

Editing effort prediction

In this notebook, we train 3 linear regression models to predict the HTER score and post-editing time (PET). The models have different configurations of features. See the paper for an explanation. You can also find the annotated dataset, including the features. We do not provide a pre-trained model, since the linear regression can be easily trained in Google Collaboratory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

A Study on Errors in Multilingual Machine Translation

Authors: Stefania Radu, Lisa Koopmans, Alina Dima

Multilingual error analysis

Editing effort prediction

Files

README.md

Latest commit

History

README.md

File metadata and controls

A Study on Errors in Multilingual Machine Translation

Authors: Stefania Radu, Lisa Koopmans, Alina Dima

Multilingual error analysis

Editing effort prediction