replace load_metric with evaluate.load (#285)

* update `load_metric` refs to `evaluate.load` Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
huggingface · Jul 21, 2022 · d804922 · d804922
1 parent e46ab85
commit d804922
Show file tree

Hide file tree

Showing 39 changed files with 148 additions and 136 deletions.
diff --git a/chapters/de/chapter3/3.mdx b/chapters/de/chapter3/3.mdx
@@ -110,12 +110,12 @@ import numpy as np
 preds = np.argmax(predictions.predictions, axis=-1)
 ```
 
-Jetzt können wir diese Vorhersagen in `preds` mit den Labels vergleichen. Wir greifen auf die Metriken aus der 🤗 Bibliothek Datasets zurück, um unsere Funktion `compute_metric()` zu erstellen. Die mit dem MRPC-Datensatz verbundenen Metriken können genauso einfach geladen werden, wie wir den Datensatz geladen haben, diesmal mit der Funktion `load_metric()`. Das zurückgegebene Objekt verfügt über eine Berechnungsmethode, mit der wir die Metrik auswerten können:
+Jetzt können wir diese Vorhersagen in `preds` mit den Labels vergleichen. Wir greifen auf die Metriken aus der 🤗 Bibliothek [Evaluate](https://github.com/huggingface/evaluate/) zurück, um unsere Funktion `compute_metric()` zu erstellen. Die mit dem MRPC-Datensatz verbundenen Metriken können genauso einfach geladen werden, wie wir den Datensatz geladen haben, diesmal mit der Funktion `evaluate.load()`. Das zurückgegebene Objekt verfügt über eine Berechnungsmethode, mit der wir die Metrik auswerten können:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 metric.compute(predictions=preds, references=predictions.label_ids)
 ```
 
@@ -129,7 +129,7 @@ Zusammenfassend ergibt das unsere Funktion `compute_metrics()`:
 
 ```py
 def compute_metrics(eval_preds):
-    metric = load_metric("glue", "mrpc")
+    metric = evaluate.load("glue", "mrpc")
     logits, labels = eval_preds
     predictions = np.argmax(logits, axis=-1)
     return metric.compute(predictions=predictions, references=labels)

diff --git a/chapters/de/chapter3/3_tf.mdx b/chapters/de/chapter3/3_tf.mdx
@@ -172,12 +172,12 @@ print(preds.shape, class_preds.shape)
 (408, 2) (408,)
 ```
 
-Nun können wir diese Vorhersagen in `preds` nutzen, um einige Metriken zu berechnen! Wir können die Metriken, die mit dem MRPC-Datensatz verbunden sind, genauso einfach laden, wie wir den Datensatz geladen haben, in diesem Fall mit der Funktion "load_metric()". Das zurückgegebene Objekt verfügt über eine Berechnungsmethode, mit der wir die Metrik berechnen können:
+Nun können wir diese Vorhersagen in `preds` nutzen, um einige Metriken zu berechnen! Wir können die Metriken, die mit dem MRPC-Datensatz verbunden sind, genauso einfach laden, wie wir den Datensatz geladen haben, in diesem Fall mit der Funktion "evaluate.load()". Das zurückgegebene Objekt verfügt über eine Berechnungsmethode, mit der wir die Metrik berechnen können:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 metric.compute(predictions=class_preds, references=raw_datasets["validation"]["label"])
 ```
 

diff --git a/chapters/de/chapter3/4.mdx b/chapters/de/chapter3/4.mdx
@@ -171,12 +171,12 @@ Der Kern der Trainingsschleife sieht ähnlich aus wie in der Einleitung. Da wir
 
 ### Die Evaluationsschleife
 
-Wie schon zuvor verwenden wir eine Metrik, die von der 🤗 Datasets-Bibliothek bereitgestellt wird. Wir haben bereits die Methode `metric.compute()` gesehen, aber Metriken können auch Batches für uns akkumulieren, wenn wir die Vorhersageschleife mit der Methode `add_batch()` durchlaufen. Sobald wir alle Batches gesammelt haben, können wir das Endergebnis mit der Methode `metric.compute()` ermitteln. So implementierst du all das in eine Evaluationsschleife:
+Wie schon zuvor verwenden wir eine Metrik, die von der 🤗 Evaluate-Bibliothek bereitgestellt wird. Wir haben bereits die Methode `metric.compute()` gesehen, aber Metriken können auch Batches für uns akkumulieren, wenn wir die Vorhersageschleife mit der Methode `add_batch()` durchlaufen. Sobald wir alle Batches gesammelt haben, können wir das Endergebnis mit der Methode `metric.compute()` ermitteln. So implementierst du all das in eine Evaluationsschleife:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 model.eval()
 for batch in eval_dataloader:
     batch = {k: v.to(device) for k, v in batch.items()}

diff --git a/chapters/en/chapter3/3.mdx b/chapters/en/chapter3/3.mdx
@@ -110,12 +110,12 @@ import numpy as np
 preds = np.argmax(predictions.predictions, axis=-1)
 ```
 
-We can now compare those `preds` to the labels. To build our `compute_metric()` function, we will rely on the metrics from the 🤗 Datasets library. We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the `load_metric()` function. The object returned has a `compute()` method we can use to do the metric calculation:
+We can now compare those `preds` to the labels. To build our `compute_metric()` function, we will rely on the metrics from the 🤗 [Evaluate](https://github.com/huggingface/evaluate/) library. We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the `evaluate.load()` function. The object returned has a `compute()` method we can use to do the metric calculation:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 metric.compute(predictions=preds, references=predictions.label_ids)
 ```
 
@@ -129,7 +129,7 @@ Wrapping everything together, we get our `compute_metrics()` function:
 
 ```py
 def compute_metrics(eval_preds):
-    metric = load_metric("glue", "mrpc")
+    metric = evaluate.load("glue", "mrpc")
     logits, labels = eval_preds
     predictions = np.argmax(logits, axis=-1)
     return metric.compute(predictions=predictions, references=labels)

diff --git a/chapters/en/chapter3/3_tf.mdx b/chapters/en/chapter3/3_tf.mdx
@@ -181,12 +181,12 @@ print(preds.shape, class_preds.shape)
 (408, 2) (408,)
 ```
 
-Now, let's use those `preds` to compute some metrics! We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the `load_metric()` function. The object returned has a `compute()` method we can use to do the metric calculation:
+Now, let's use those `preds` to compute some metrics! We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the `evaluate.load()` function. The object returned has a `compute()` method we can use to do the metric calculation:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 metric.compute(predictions=class_preds, references=raw_datasets["validation"]["label"])
 ```
 

diff --git a/chapters/en/chapter3/4.mdx b/chapters/en/chapter3/4.mdx
@@ -172,12 +172,12 @@ You can see that the core of the training loop looks a lot like the one in the i
 
 ### The evaluation loop
 
-As we did earlier, we will use a metric provided by the 🤗 Datasets library. We've already seen the `metric.compute()` method, but metrics can actually accumulate batches for us as we go over the prediction loop with the method `add_batch()`. Once we have accumulated all the batches, we can get the final result with `metric.compute()`. Here's how to implement all of this in an evaluation loop:
+As we did earlier, we will use a metric provided by the 🤗 Evaluate library. We've already seen the `metric.compute()` method, but metrics can actually accumulate batches for us as we go over the prediction loop with the method `add_batch()`. Once we have accumulated all the batches, we can get the final result with `metric.compute()`. Here's how to implement all of this in an evaluation loop:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 model.eval()
 for batch in eval_dataloader:
     batch = {k: v.to(device) for k, v in batch.items()}

diff --git a/chapters/en/chapter7/2.mdx b/chapters/en/chapter7/2.mdx
@@ -522,7 +522,7 @@ The traditional framework used to evaluate token classification prediction is [*
 !pip install seqeval
 ```
 
-We can then load it via the `load_metric()` function like we did in [Chapter 3](/course/chapter3):
+We can then load it via the `evaluate.load()` function like we did in [Chapter 3](/course/chapter3):
 
 {:else}
 
@@ -532,14 +532,14 @@ The traditional framework used to evaluate token classification prediction is [*
 !pip install seqeval
 ```
 
-We can then load it via the `load_metric()` function like we did in [Chapter 3](/course/chapter3):
+We can then load it via the `evaluate.load()` function like we did in [Chapter 3](/course/chapter3):
 
 {/if}
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("seqeval")
+metric = evaluate.load("seqeval")
 ```
 
 This metric does not behave like the standard accuracy: it will actually take the lists of labels as strings, not integers, so we will need to fully decode the predictions and labels before passing them to the metric. Let's see how it works. First, we'll get the labels for our first training example:

diff --git a/chapters/en/chapter7/4.mdx b/chapters/en/chapter7/4.mdx
@@ -53,7 +53,7 @@ To fine-tune or train a translation model from scratch, we will need a dataset s
 As usual, we download our dataset using the `load_dataset()` function:
 
 ```py
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
 
 raw_datasets = load_dataset("kde4", lang1="en", lang2="fr")
 ```
@@ -428,12 +428,12 @@ One weakness with BLEU is that it expects the text to already be tokenized, whic
 !pip install sacrebleu
 ```
 
-We can then load it via `load_metric()` like we did in [Chapter 3](/course/chapter3):
+We can then load it via `evaluate.load()` like we did in [Chapter 3](/course/chapter3):
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("sacrebleu")
+metric = evaluate.load("sacrebleu")
 ```
 
 This metric will take texts as inputs and targets. It is designed to accept several acceptable targets, as there are often multiple acceptable translations of the same sentence -- the dataset we're using only provides one, but it's not uncommon in NLP to find datasets that give several sentences as labels. So, the predictions should be a list of sentences, but the references should be a list of lists of sentences.

diff --git a/chapters/en/chapter7/5.mdx b/chapters/en/chapter7/5.mdx
@@ -352,9 +352,9 @@ Applying this to our verbose summary gives a precision of 6/10  = 0.6, which is
 and then loading the ROUGE metric as follows:
 
 ```python
-from datasets import load_metric
+import evaluate
 
-rouge_score = load_metric("rouge")
+rouge_score = evaluate.load("rouge")
 ```
 
 Then we can use the `rouge_score.compute()` function to calculate all the metrics at once:

diff --git a/chapters/en/chapter7/7.mdx b/chapters/en/chapter7/7.mdx
@@ -670,12 +670,12 @@ for example in small_eval_set:
     predicted_answers.append({"id": example_id, "prediction_text": best_answer["text"]})
 ```
 
-The final format of the predicted answers is the one that will be expected by the metric we will use. As usual, we can load it with the help of the 🤗 Datasets library:
+The final format of the predicted answers is the one that will be expected by the metric we will use. As usual, we can load it with the help of the 🤗 Evaluate library:
 
 ```python
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("squad")
+metric = evaluate.load("squad")
 ```
 
 This metric expects the predicted answers in the format we saw above (a list of dictionaries with one key for the ID of the example and one key for the predicted text) and the theoretical answers in the format below (a list of dictionaries with one key for the ID of the example and one key for the possible answers):

diff --git a/chapters/en/chapter8/4.mdx b/chapters/en/chapter8/4.mdx
@@ -22,7 +22,8 @@ The best way to debug an error that arises in `trainer.train()` is to manually g
 To demonstrate this, we will use the following script that (tries to) fine-tune a DistilBERT model on the [MNLI dataset](https://huggingface.co/datasets/glue):
 
 ```py
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
+import evaluate
 from transformers import (
     AutoTokenizer,
     AutoModelForSequenceClassification,
@@ -52,7 +53,7 @@ args = TrainingArguments(
     weight_decay=0.01,
 )
 
-metric = load_metric("glue", "mnli")
+metric = evaluate.load("glue", "mnli")
 
 
 def compute_metrics(eval_pred):
@@ -98,7 +99,8 @@ Do you notice something wrong? This, in conjunction with the error message about
 Why wasn't the data processed? We did use the `Dataset.map()` method on the datasets to apply the tokenizer on each sample. But if you look closely at the code, you will see that we made a mistake when passing the training and evaluation sets to the `Trainer`. Instead of using `tokenized_datasets` here, we used `raw_datasets` 🤦. So let's fix this!
 
 ```py
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
+import evaluate
 from transformers import (
     AutoTokenizer,
     AutoModelForSequenceClassification,
@@ -128,7 +130,7 @@ args = TrainingArguments(
     weight_decay=0.01,
 )
 
-metric = load_metric("glue", "mnli")
+metric = evaluate.load("glue", "mnli")
 
 
 def compute_metrics(eval_pred):
@@ -291,7 +293,8 @@ So this is the `default_data_collator`, but that's not what we want in this case
 The answer is because we did not pass the `tokenizer` to the `Trainer`, so it couldn't create the `DataCollatorWithPadding` we want. In practice, you should never hesitate to explicitly pass along the data collator you want to use, to make sure you avoid these kinds of errors. Let's adapt our code to do exactly that:
 
 ```py
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
+import evaluate
 from transformers import (
     AutoTokenizer,
     AutoModelForSequenceClassification,
@@ -322,7 +325,7 @@ args = TrainingArguments(
     weight_decay=0.01,
 )
 
-metric = load_metric("glue", "mnli")
+metric = evaluate.load("glue", "mnli")
 
 
 def compute_metrics(eval_pred):
@@ -416,7 +419,8 @@ trainer.model.config.num_labels
 With two labels, only 0s and 1s are allowed as targets, but according to the error message we got a 2. Getting a 2 is actually normal: if we remember the label names we extracted earlier, there were three, so we have indices 0, 1, and 2 in our dataset. The problem is that we didn't tell that to our model, which should have been created with three labels. So let's fix that!
 
 ```py
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
+import evaluate
 from transformers import (
     AutoTokenizer,
     AutoModelForSequenceClassification,
@@ -447,7 +451,7 @@ args = TrainingArguments(
     weight_decay=0.01,
 )
 
-metric = load_metric("glue", "mnli")
+metric = evaluate.load("glue", "mnli")
 
 
 def compute_metrics(eval_pred):
@@ -626,7 +630,8 @@ For reference, here is the completely fixed script:
 
 ```py
 import numpy as np
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
+import evaluate
 from transformers import (
     AutoTokenizer,
     AutoModelForSequenceClassification,
@@ -657,7 +662,7 @@ args = TrainingArguments(
     weight_decay=0.01,
 )
 
-metric = load_metric("glue", "mnli")
+metric = evaluate.load("glue", "mnli")
 
 
 def compute_metrics(eval_pred):

diff --git a/chapters/en/chapter8/4_tf.mdx b/chapters/en/chapter8/4_tf.mdx
@@ -22,7 +22,8 @@ The best way to debug an error that arises in `model.fit()` is to manually go th
 To demonstrate this, we will use the following script that (tries to) fine-tune a DistilBERT model on the [MNLI dataset](https://huggingface.co/datasets/glue):
 
 ```py
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
+import evaluate
 from transformers import (
     AutoTokenizer,
     TFAutoModelForSequenceClassification,

diff --git a/chapters/es/chapter3/4.mdx b/chapters/es/chapter3/4.mdx
@@ -171,12 +171,12 @@ Puedes ver que la parte central del bucle de entrenamiento luce bastante como el
 
 ### El bucle de evaluación
 
-Como lo hicimos anteriormente, usaremos una métrica ofrecida por la libreria Datasets 🤗. Ya hemos visto el método `metric.compute()`, pero de hecho las métricas se pueden acumular sobre los lotes a medida que avanzamos en el bucle de predicción con el método `add_batch()`. Una vez que hemos acumulado todos los lotes, podemos obtener el resultado final con `metric.compute()`. Aquí se muestra como se puede implementar en un bucle de evaluación:
+Como lo hicimos anteriormente, usaremos una métrica ofrecida por la libreria 🤗 Evaluate. Ya hemos visto el método `metric.compute()`, pero de hecho las métricas se pueden acumular sobre los lotes a medida que avanzamos en el bucle de predicción con el método `add_batch()`. Una vez que hemos acumulado todos los lotes, podemos obtener el resultado final con `metric.compute()`. Aquí se muestra como se puede implementar en un bucle de evaluación:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 model.eval()
 for batch in eval_dataloader:
     batch = {k: v.to(device) for k, v in batch.items()}

diff --git a/chapters/fr/chapter3/3.mdx b/chapters/fr/chapter3/3.mdx
@@ -110,12 +110,12 @@ import numpy as np
 preds = np.argmax(predictions.predictions, axis=-1)
 ```
 
-Nous pouvons maintenant comparer ces `preds` aux étiquettes. Pour construire notre fonction `compute_metric()`, nous allons nous appuyer sur les métriques de la bibliothèque 🤗 *Datasets*. Nous pouvons charger les métriques associées au jeu de données MRPC aussi facilement que nous avons chargé le jeu de données, cette fois avec la fonction `load_metric()`. L'objet retourné possède une méthode `compute()` que nous pouvons utiliser pour effectuer le calcul de la métrique :
+Nous pouvons maintenant comparer ces `preds` aux étiquettes. Pour construire notre fonction `compute_metric()`, nous allons nous appuyer sur les métriques de la bibliothèque 🤗 [*Evaluate*](https://github.com/huggingface/evaluate/). Nous pouvons charger les métriques associées au jeu de données MRPC aussi facilement que nous avons chargé le jeu de données, cette fois avec la fonction `evaluate.load()`. L'objet retourné possède une méthode `compute()` que nous pouvons utiliser pour effectuer le calcul de la métrique :
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 metric.compute(predictions=preds, references=predictions.label_ids)
 ```
 
@@ -129,7 +129,7 @@ En regroupant le tout, nous obtenons notre fonction `compute_metrics()` :
 
 ```py
 def compute_metrics(eval_preds):
-    metric = load_metric("glue", "mrpc")
+    metric = evaluate.load("glue", "mrpc")
     logits, labels = eval_preds
     predictions = np.argmax(logits, axis=-1)
     return metric.compute(predictions=predictions, references=labels)

diff --git a/chapters/fr/chapter3/3_tf.mdx b/chapters/fr/chapter3/3_tf.mdx
@@ -172,12 +172,12 @@ print(preds.shape, class_preds.shape)
 (408, 2) (408,)
 ```
 
-Maintenant, utilisons ces `preds` pour calculer des métriques ! Nous pouvons charger les métriques associées au jeu de données MRPC aussi facilement que nous avons chargé le jeu de données, cette fois avec la fonction `load_metric()`. L'objet retourné a une méthode `compute()` que nous pouvons utiliser pour faire le calcul de la métrique :
+Maintenant, utilisons ces `preds` pour calculer des métriques ! Nous pouvons charger les métriques associées au jeu de données MRPC aussi facilement que nous avons chargé le jeu de données, cette fois avec la fonction `evaluate.load()`. L'objet retourné a une méthode `compute()` que nous pouvons utiliser pour faire le calcul de la métrique :
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 metric.compute(predictions=class_preds, references=raw_datasets["validation"]["label"])
 ```