Code refactor to add XCOMET

Unbabel · Oct 23, 2023 · fe834cd · fe834cd
1 parent a31c2de
commit fe834cd
Show file tree

Hide file tree

Showing 22 changed files with 872 additions and 393 deletions.
diff --git a/LICENSE.models.md b/LICENSE.models.md
@@ -15,5 +15,9 @@ Starting at version 2.0 new models will be hosted on [Hugging Face Hub](https://
 | [`Unbabel/wmt22-cometkiwi-da`](https://huggingface.co/Unbabel/wmt22-cometkiwi-da) | [CC-BY-NC-SA](https://huggingface.co/Unbabel/wmt22-cometkiwi-da/blob/main/LICENSE) |
 | [`Unbabel/wmt23-cometkiwi-da-xl`](https://huggingface.co/Unbabel/wmt23-cometkiwi-da-xl) | [CC-BY-NC-SA](https://huggingface.co/Unbabel/wmt23-cometkiwi-da-xl/blob/main/LICENSE) |
 | [`Unbabel/wmt23-cometkiwi-da-xxl`](https://huggingface.co/Unbabel/wmt23-cometkiwi-da-xxl) | [CC-BY-NC-SA](https://huggingface.co/Unbabel/wmt23-cometkiwi-da-xxl/blob/main/LICENSE) |
+| [`Unbabel/unite-xl`](https://huggingface.co/Unbabel/unite-xl) | [CC-BY-NC-SA](https://huggingface.co/Unbabel/unite-xl/blob/main/LICENSE) |
+| [`Unbabel/unite-xxl`](https://huggingface.co/Unbabel/unite-xxl) | [CC-BY-NC-SA](https://huggingface.co/Unbabel/unite-xxl/blob/main/LICENSE) |
+| [`Unbabel/XCOMET-XL`](https://huggingface.co/Unbabel/XCOMET-XL) | [CC-BY-NC-SA](https://huggingface.co/Unbabel/XCOMET-XL/blob/main/LICENSE) |
+| [`Unbabel/XCOMET-XXL`](https://huggingface.co/Unbabel/XCOMET-XXL) | [CC-BY-NC-SA](https://huggingface.co/Unbabel/XCOMET-XXL/blob/main/LICENSE) |
 
 Please check the `models licenses` before using them.
diff --git a/MODELS.md b/MODELS.md
@@ -4,6 +4,7 @@ Within COMET, there are several evaluation models available. The primary referen
 
 - **Default Model:** [`Unbabel/wmt22-comet-da`](https://huggingface.co/Unbabel/wmt22-comet-da) - This model employs a reference-based regression approach and is built upon the XLM-R architecture. It has been trained on direct assessments from WMT17 to WMT20 and provides scores ranging from 0 to 1, where 1 signifies a perfect translation.
 - **Reference-free Model:** [`Unbabel/wmt22-cometkiwi-da`](https://huggingface.co/Unbabel/wmt23-cometkiwi-da) - This reference-free model employs a regression approach and is built on top of InfoXLM. It has been trained using direct assessments from WMT17 to WMT20, as well as direct assessments from the MLQE-PE corpus. Similar to other models, it generates scores ranging from 0 to 1. For those interested, we also offer larger versions of this model: [`Unbabel/wmt23-cometkiwi-da-xl`](https://huggingface.co/Unbabel/wmt23-cometkiwi-da-xl) with 3.5 billion parameters and [`Unbabel/wmt23-cometkiwi-da-xxl`](https://huggingface.co/Unbabel/wmt23-cometkiwi-da-xxl) with 10.7 billion parameters.
+- **eXplainable COMET (XCOMET):** [`Unbabel/XCOMET-XXL`](https://huggingface.co/Unbabel/XCOMET-XXL) - Our latest model is trained to identify error spans and assign a final quality score, resulting in an explainable neural metric. We offer this version in XXL with 10.7 billion parameters, as well as the XL variant with 3.5 billion parameters ([`Unbabel/XCOMET-XL`](https://huggingface.co/Unbabel/XCOMET-XL)). These models have demonstrated the highest correlation with MQM and are our best performing evaluation models.
 
 If you intend to compare your results with papers published before 2022, it's likely that they used older evaluation models. In such cases, please refer to [Unbabel/wmt20-comet-da](https://huggingface.co/Unbabel/wmt20-comet-da) and [Unbabel/wmt20-comet-qe-da](https://huggingface.co/Unbabel/wmt20-comet-qe-da), which were the primary checkpoints used in previous versions (<2.0) of COMET.
 
@@ -13,6 +14,8 @@ If you intend to compare your results with papers published before 2022, it's li
 
 - [`Unbabel/unite-mup`](https://huggingface.co/Unbabel/unite-mup) - This is the original UniTE Metric proposed in the [UniTE: Unified Translation Evaluation](https://aclanthology.org/2022.acl-long.558/) paper.
 - [`Unbabel/wmt22-unite-da`](https://huggingface.co/Unbabel/wmt22-unite-da) - This model was trained for our paper [(Rei et al., ACL 2023)](https://aclanthology.org/2023.acl-short.94/) and it uses the same data as [`Unbabel/wmt22-comet-da`](https://huggingface.co/Unbabel/wmt22-comet-da) thus, the outputed scores are between 0 and 1.
+- [`Unbabel/unite-xxl`](https://huggingface.co/Unbabel/unite-xxl) - xCOMET models [(Guerreiro et al. 2023)](https://arxiv.org/pdf/2310.10482.pdf) are training following a curriculum. The checkpoint resulting from the first phase of that curriculum (before the introduction of a sequence tagging task and MQM data) is a [UniTE Model](https://aclanthology.org/2022.acl-long.558/). An [XL version](https://huggingface.co/Unbabel/unite-xl) is also available.
+
 
 ## Older Models:
 

diff --git a/README.md b/README.md
@@ -8,7 +8,11 @@
   <a href="https://github.com/psf/black"><img alt="Code Style" src="https://img.shields.io/badge/code%20style-black-black" /></a>
 </p>
 
-**NEWS:** We release [CometKiwi -XL (3.5B)](https://huggingface.co/Unbabel/wmt23-cometkiwi-da-xl) and [-XXL (10.7B)](https://huggingface.co/Unbabel/wmt23-cometkiwi-da-xxl) QE models. These models were the best performing QE models on the WMT23 QE shared task. Please check all available models [here](https://github.com/Unbabel/COMET/blob/master/MODELS.md)
+**NEWS:** 
+1) We released our new eXplainable COMET models ([XCOMET-XL](https://huggingface.co/Unbabel/XCOMET-XL) and [-XXL](https://huggingface.co/Unbabel/XCOMET-XXL)) which along with quality scores detects which errors in the translation are minor, major or critical according to MQM typology
+2) We release [CometKiwi -XL (3.5B)](https://huggingface.co/Unbabel/wmt23-cometkiwi-da-xl) and [-XXL (10.7B)](https://huggingface.co/Unbabel/wmt23-cometkiwi-da-xxl) QE models. These models were the best performing QE models on the WMT23 QE shared task. 
+
+Please check all available models [here](https://github.com/Unbabel/COMET/blob/master/MODELS.md)
 
 # Quick Installation
 
@@ -55,6 +59,12 @@ comet-score -s src.txt -t hyp1.txt -r ref.txt
 ```
 > you can set the number of gpus using `--gpus` (0 to test on CPU).
 
+For better error analysis, you can use XCOMET models such as [`Unbabel/XCOMET-XL`](https://huggingface.co/Unbabel/XCOMET-XL), you can export the identified errors using the `--to_json` flag:
+
+```bash
+comet-score -s src.txt -t hyp1.txt -r ref.txt --model Unbabel/XCOMET-XL --to_json output.json
+```
+
 Scoring multiple systems:
 ```bash
 comet-score -s src.txt -t hyp1.txt hyp2.txt -r ref.txt
@@ -113,6 +123,9 @@ Within COMET, there are several evaluation models available. You can refer to th
 
 - **Default Model:** [`Unbabel/wmt22-comet-da`](https://huggingface.co/Unbabel/wmt22-comet-da) - This model employs a reference-based regression approach and is built upon the XLM-R architecture. It has been trained on direct assessments from WMT17 to WMT20 and provides scores ranging from 0 to 1, where 1 signifies a perfect translation.
 - **Reference-free Model:** [`Unbabel/wmt22-cometkiwi-da`](https://huggingface.co/Unbabel/wmt23-cometkiwi-da) - This reference-free model employs a regression approach and is built on top of InfoXLM. It has been trained using direct assessments from WMT17 to WMT20, as well as direct assessments from the MLQE-PE corpus. Similar to other models, it generates scores ranging from 0 to 1. For those interested, we also offer larger versions of this model: [`Unbabel/wmt23-cometkiwi-da-xl`](https://huggingface.co/Unbabel/wmt23-cometkiwi-da-xl) with 3.5 billion parameters and [`Unbabel/wmt23-cometkiwi-da-xxl`](https://huggingface.co/Unbabel/wmt23-cometkiwi-da-xxl) with 10.7 billion parameters.
+- **eXplainable COMET (XCOMET):** [`Unbabel/XCOMET-XXL`](https://huggingface.co/Unbabel/XCOMET-XXL) - Our latest model is trained to identify error spans and assign a final quality score, resulting in an explainable neural metric. We offer this version in XXL with 10.7 billion parameters, as well as the XL variant with 3.5 billion parameters ([`Unbabel/XCOMET-XL`](https://huggingface.co/Unbabel/XCOMET-XL)). These models have demonstrated the highest correlation with MQM and are our best performing evaluation models.
+
+Please be aware that different models may be subject to varying licenses. To learn more, kindly refer to the [MODELS](MODELS.md) and model licenses sections.
 
 If you intend to compare your results with papers published before 2022, it's likely that they used older evaluation models. In such cases, please refer to [`Unbabel/wmt20-comet-da`](https://huggingface.co/Unbabel/wmt20-comet-da) and [`Unbabel/wmt20-comet-qe-da`](https://huggingface.co/Unbabel/wmt20-comet-qe-da), which were the primary checkpoints used in previous versions (<2.0) of COMET.
 
@@ -124,15 +137,15 @@ When using COMET to evaluate machine translation, it's important to understand h
 
 In general, COMET models are trained to predict quality scores for translations. These scores are typically normalized using a [z-score transformation](https://simplypsychology.org/z-score.html) to account for individual differences among annotators. While the raw score itself does not have a direct interpretation, it is useful for ranking translations and systems according to their quality.
 
-However, for the latest COMET models like [`Unbabel/wmt22-comet-da`](https://huggingface.co/Unbabel/wmt22-comet-da), we have introduced a new training approach that scales the scores between 0 and 1. This makes it easier to interpret the scores: a score close to 1 indicates a high-quality translation, while a score close to 0 indicates a translation that is no better than random chance.
+However, since 2022 we have introduced a new training approach that scales the scores between 0 and 1. This makes it easier to interpret the scores: a score close to 1 indicates a high-quality translation, while a score close to 0 indicates a translation that is no better than random chance. Also, with the introduction of XCOMET models we can now analyse which text spans are part of minor, major or critical errors according to the MQM typology.
 
 It's worth noting that when using COMET to compare the performance of two different translation systems, it's important to run the `comet-compare` command to obtain statistical significance measures. This command compares the output of two systems using a statistical hypothesis test, providing an estimate of the probability that the observed difference in scores between the systems is due to chance. This is an important step to ensure that any differences in scores between systems are statistically significant.
 
 Overall, the added interpretability of scores in the latest COMET models, combined with the ability to assess statistical significance between systems using `comet-compare`, make COMET a valuable tool for evaluating machine translation.
 
 ## Languages Covered:
 
-All the above mentioned models are build on top of XLM-R which cover the following languages:
+All the above mentioned models are build on top of XLM-R (variants) which cover the following languages:
 
 Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.
 
@@ -143,8 +156,15 @@ Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, B
 ```python
 from comet import download_model, load_from_checkpoint
 
-model_path = download_model("Unbabel/wmt22-comet-da")
+# Choose your model from Hugging Face Hub
+model_path = download_model("Unbabel/XCOMET-XXL")
+# or for example:
+# model_path = download_model("Unbabel/wmt22-comet-da")
+
+# Load the model checkpoint:
 model = load_from_checkpoint(model_path)
+
+# Data must be in the following format:
 data = [
     {
         "src": "10 到 15 分钟可以送到吗",
@@ -157,8 +177,14 @@ data = [
         "ref": "Can it be delivered between 10 to 15 minutes?"
     }
 ]
+# Call predict method:
 model_output = model.predict(data, batch_size=8, gpus=1)
-print(model_output.to_tuple())
+print(model_output)
+print(model_output.scores) # sentence-level scores
+print(model_output.system_score) # system-level score
+
+# Not all COMET models return metadata with detected errors.
+print(model_output.metadata.error_spans) # detected error spans
 ```
 
 # Train your own Metric: 
@@ -181,7 +207,7 @@ In order to run the toolkit tests you must run the following command:
 
 ```bash
 poetry run coverage run --source=comet -m unittest discover
-poetry run coverage report -m # Expected coverage 80%
+poetry run coverage report -m # Expected coverage 76%
 ```
 
 **Note:** Testing on CPU takes a long time
@@ -190,6 +216,8 @@ poetry run coverage report -m # Expected coverage 80%
 
 If you use COMET please cite our work **and don't forget to say which model you used!**
 
+- [xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection](https://arxiv.org/pdf/2310.10482.pdf)
+
 - [Scaling up CometKiwi: Unbabel-IST 2023 Submission for the Quality Estimation Shared Task](https://arxiv.org/pdf/2309.11925.pdf)
 
 - [CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task](https://aclanthology.org/2022.wmt-1.60/)

diff --git a/comet/__init__.py b/comet/__init__.py
@@ -22,5 +22,5 @@
 logger = logging.getLogger(__name__)
 
 
-__version__ = "2.1.1"
+__version__ = "2.2.0"
 __copyright__ = "2020 Unbabel. All rights reserved."
diff --git a/comet/cli/score.py b/comet/cli/score.py
@@ -199,24 +199,31 @@ def score_command() -> None:
             length_batching=(not cfg.disable_length_batching),
         )
         seg_scores = outputs.scores
+        if "metadata" in outputs and "error_spans" in outputs.metadata:
+            errors = outputs.metadata.error_spans
+        else:
+            errors = []
+
         if len(cfg.translations) > 1:
             seg_scores = np.array_split(seg_scores, len(cfg.translations))
             sys_scores = [sum(split) / len(split) for split in seg_scores]
             data = np.array_split(data, len(cfg.translations))
+            errors = np.array_split(outputs.metadata.errors, len(cfg.translations))
         else:
             sys_scores = [
                 outputs.system_score,
             ]
             seg_scores = [
                 seg_scores,
             ]
+            errors = [errors, ]
             data = [
                 np.array(data),
             ]
     else:
         # If not using Multiple GPUs we will score each system independently
         # to maximize cache hits!
-        seg_scores, sys_scores = [], []
+        seg_scores, sys_scores, errors = [], [], []
         new_data = []
         for i in range(len(cfg.translations)):
             sys_data = {k: v[i] for k, v in data.items()}
@@ -233,13 +240,18 @@ def score_command() -> None:
             )
             seg_scores.append(outputs.scores)
             sys_scores.append(outputs.system_score)
+            if "metadata" in outputs and "error_spans" in outputs.metadata:
+                errors.append(outputs.metadata.error_spans)
         data = new_data
 
     files = [path_fr.rel_path for path_fr in cfg.translations]
     data = {file: system_data.tolist() for file, system_data in zip(files, data)}
     for i in range(len(data[files[0]])):  # loop over (src, ref)
         for j in range(len(files)):  # loop of system
             data[files[j]][i]["COMET"] = seg_scores[j][i]
+            if errors:
+                data[files[j]][i]["errors"] = errors[j][i]
+
             if not cfg.only_system:
                 print(
                     "{}\tSegment {}\tscore: {:.4f}".format(