Skip to content

Latest commit

 

History

History
27 lines (14 loc) · 1.52 KB

File metadata and controls

27 lines (14 loc) · 1.52 KB

A Study on Errors in Multilingual Machine Translation

Authors: Stefania Radu, Lisa Koopmans, Alina Dima

This repository contains 2 python notebooks, one for the multilingual error analysis, and one for the editing effort prediction. It also contains the DivEMT dataset, annotated with new features for prediction.

These notebooks can be run in Google Collaboratory. You can also install all the requiremenets by running: pip install -r requirements.txt.

This notebook generates the plots for the following tasks:

  • high level analysis of the errors distribution across all the languages with respect to the BLEU, CHRF, CER and BAD rate. See the plot below as an example.

high level analysis of the errors distribution

  • in depth analysis of the errors distribution across all languages and across different linguistic features:

    • part-of-speech (POS)
    • named-entity recognition (NER)
    • dependency relations (deprel)

In this notebook, we train 3 linear regression models to predict the HTER score and post-editing time (PET). The models have different configurations of features. See the paper for an explanation. You can also find the annotated dataset, including the features. We do not provide a pre-trained model, since the linear regression can be easily trained in Google Collaboratory.