Calculates BLEU and COMET scores for Firefox Translations models using bergamot-translator and compares them to other translation systems.
We recommend running this on a Linux machine with at least one GPU, and inside a docker container.
If you intend to run it on macOS, run the eval/evaluate.py
script standalone inside a virtualenv, and skip the Start docker
section below.
You might need to manually install the correspondent packages in the Dockerfile
in your system and virtual environment.
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
Recommended memory size for Docker is 16gb.
Run from the repo root directory:
export MODELS=<absolute path to a local directory with models>
# Specify Azure key and location if you want to add Azure Translator API for comparison
export AZURE_TRANSLATOR_KEY=<Azure translator resource API key>
# optional, specify if it's different than default 'global'
export AZURE_LOCATION=<location>
# Specify GCP credentials json path if you want to add Google Translator API for comparison
export GCP_CREDS_PATH=<absolute path to .json>
# Build and run docker container
make build-docker
make start-docker
On completion, your terminal should be attached to the launched container.
From inside docker container run:
python3 eval/evaluate.py \
--translators=bergamot,microsoft,google \
--pairs=all \
--skip-existing \
--gpus=1 \
--evaluation-engine=comet,bleu \
--models-dir=/models/models/prod \
--results-dir=/models/evaluation/prod
If you don't have a GPU, use 0
in the --gpus
argument.
More options:
python3 eval/evaluate.py --help
install/install-bergamot-translator.sh
- clones and compiles bergamot-translator and marian (launched in docker image).
install/download-models.sh
- downloads current Mozilla production models.
The COMET evaluation framework supports CUDA, and you can enable it by setting the --gpus
argument in the eval\evaluate.py
script to the number of GPUs you wish to utilize (0
disables it).
If you are using it, make sure you have the nvidia container toolkit enabled in your docker setup.
- bergamot - uses compiled bergamot-translator in wasm mode
- google - users Google Translation API
- microsoft - users Azure Cognitive Services Translator API
Use --skip-existing
option to reuse already calculated scores saved as results/xx-xx/*.bleu
files.
It is useful to continue evaluation if it was interrupted
or to rebuild a full report reevaluating only selected translators.
SacreBLEU - all available datasets for a language pair are used for evaluation.
Flores - parallel evaluation dataset for 101 languages.
With option --pairs=all
, language pairs will be discovered
in the specified models folder (option --models-dir
)
and evaluation will run for all of them.
Results will be written to the specified directory (option --results-dir
).