Every year the WMT News Translation task organizers collect thousands of quality annotations in the form of Direct Assessments. Most COMET models use that data either in the form of z-scores or in the form of relative-ranks.
I'll leave here a table with links for that data.
year | DA | relative ranks | paper |
---|---|---|---|
2017 | 🔗 | 🔗 | Results of the WMT17 Metrics Shared Task |
2018 | 🔗 | 🔗 | Results of the WMT18 Metrics Shared Task |
2019 | 🔗 | 🔗 | Results of the WMT19 Metrics Shared Task |
2020 | 🔗 | 🔗 | Results of the WMT20 Metrics Shared Task |
2021 | 🔗 | 🔗 | Results of the WMT21 Metrics Shared Task |
In the last editions of the WMT Metrics shared task the organizers decided to run evaluation of MT based on Multidimensional Quality Metrics (MQM) based on findings that crowd-sourced Direct Assessments are noisy and do not correlate well with annotations done by experts [Freitag, et al. 2021].
year | MQM | paper |
---|---|---|
2020 | 🔗 | A Large-Scale Study of Human Evaluation for Machine Translation |
2021 | 🔗 | Results of the WMT21 Metrics Shared Task |