Skip to content

Commit

Permalink
replace load_metric with evaluate.load (#285)
Browse files Browse the repository at this point in the history
* update `load_metric` refs to `evaluate.load`

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
  • Loading branch information
lvwerra and lewtun authored Jul 21, 2022
1 parent e46ab85 commit d804922
Show file tree
Hide file tree
Showing 39 changed files with 148 additions and 136 deletions.
8 changes: 4 additions & 4 deletions chapters/de/chapter3/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -110,12 +110,12 @@ import numpy as np
preds = np.argmax(predictions.predictions, axis=-1)
```

Jetzt können wir diese Vorhersagen in `preds` mit den Labels vergleichen. Wir greifen auf die Metriken aus der 🤗 Bibliothek Datasets zurück, um unsere Funktion `compute_metric()` zu erstellen. Die mit dem MRPC-Datensatz verbundenen Metriken können genauso einfach geladen werden, wie wir den Datensatz geladen haben, diesmal mit der Funktion `load_metric()`. Das zurückgegebene Objekt verfügt über eine Berechnungsmethode, mit der wir die Metrik auswerten können:
Jetzt können wir diese Vorhersagen in `preds` mit den Labels vergleichen. Wir greifen auf die Metriken aus der 🤗 Bibliothek [Evaluate](https://github.com/huggingface/evaluate/) zurück, um unsere Funktion `compute_metric()` zu erstellen. Die mit dem MRPC-Datensatz verbundenen Metriken können genauso einfach geladen werden, wie wir den Datensatz geladen haben, diesmal mit der Funktion `evaluate.load()`. Das zurückgegebene Objekt verfügt über eine Berechnungsmethode, mit der wir die Metrik auswerten können:

```py
from datasets import load_metric
import evaluate

metric = load_metric("glue", "mrpc")
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)
```

Expand All @@ -129,7 +129,7 @@ Zusammenfassend ergibt das unsere Funktion `compute_metrics()`:

```py
def compute_metrics(eval_preds):
metric = load_metric("glue", "mrpc")
metric = evaluate.load("glue", "mrpc")
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
Expand Down
6 changes: 3 additions & 3 deletions chapters/de/chapter3/3_tf.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -172,12 +172,12 @@ print(preds.shape, class_preds.shape)
(408, 2) (408,)
```

Nun können wir diese Vorhersagen in `preds` nutzen, um einige Metriken zu berechnen! Wir können die Metriken, die mit dem MRPC-Datensatz verbunden sind, genauso einfach laden, wie wir den Datensatz geladen haben, in diesem Fall mit der Funktion "load_metric()". Das zurückgegebene Objekt verfügt über eine Berechnungsmethode, mit der wir die Metrik berechnen können:
Nun können wir diese Vorhersagen in `preds` nutzen, um einige Metriken zu berechnen! Wir können die Metriken, die mit dem MRPC-Datensatz verbunden sind, genauso einfach laden, wie wir den Datensatz geladen haben, in diesem Fall mit der Funktion "evaluate.load()". Das zurückgegebene Objekt verfügt über eine Berechnungsmethode, mit der wir die Metrik berechnen können:

```py
from datasets import load_metric
import evaluate

metric = load_metric("glue", "mrpc")
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=class_preds, references=raw_datasets["validation"]["label"])
```

Expand Down
6 changes: 3 additions & 3 deletions chapters/de/chapter3/4.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -171,12 +171,12 @@ Der Kern der Trainingsschleife sieht ähnlich aus wie in der Einleitung. Da wir

### Die Evaluationsschleife

Wie schon zuvor verwenden wir eine Metrik, die von der 🤗 Datasets-Bibliothek bereitgestellt wird. Wir haben bereits die Methode `metric.compute()` gesehen, aber Metriken können auch Batches für uns akkumulieren, wenn wir die Vorhersageschleife mit der Methode `add_batch()` durchlaufen. Sobald wir alle Batches gesammelt haben, können wir das Endergebnis mit der Methode `metric.compute()` ermitteln. So implementierst du all das in eine Evaluationsschleife:
Wie schon zuvor verwenden wir eine Metrik, die von der 🤗 Evaluate-Bibliothek bereitgestellt wird. Wir haben bereits die Methode `metric.compute()` gesehen, aber Metriken können auch Batches für uns akkumulieren, wenn wir die Vorhersageschleife mit der Methode `add_batch()` durchlaufen. Sobald wir alle Batches gesammelt haben, können wir das Endergebnis mit der Methode `metric.compute()` ermitteln. So implementierst du all das in eine Evaluationsschleife:

```py
from datasets import load_metric
import evaluate

metric = load_metric("glue", "mrpc")
metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
Expand Down
8 changes: 4 additions & 4 deletions chapters/en/chapter3/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -110,12 +110,12 @@ import numpy as np
preds = np.argmax(predictions.predictions, axis=-1)
```

We can now compare those `preds` to the labels. To build our `compute_metric()` function, we will rely on the metrics from the 🤗 Datasets library. We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the `load_metric()` function. The object returned has a `compute()` method we can use to do the metric calculation:
We can now compare those `preds` to the labels. To build our `compute_metric()` function, we will rely on the metrics from the 🤗 [Evaluate](https://github.com/huggingface/evaluate/) library. We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the `evaluate.load()` function. The object returned has a `compute()` method we can use to do the metric calculation:

```py
from datasets import load_metric
import evaluate

metric = load_metric("glue", "mrpc")
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)
```

Expand All @@ -129,7 +129,7 @@ Wrapping everything together, we get our `compute_metrics()` function:

```py
def compute_metrics(eval_preds):
metric = load_metric("glue", "mrpc")
metric = evaluate.load("glue", "mrpc")
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
Expand Down
6 changes: 3 additions & 3 deletions chapters/en/chapter3/3_tf.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -181,12 +181,12 @@ print(preds.shape, class_preds.shape)
(408, 2) (408,)
```

Now, let's use those `preds` to compute some metrics! We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the `load_metric()` function. The object returned has a `compute()` method we can use to do the metric calculation:
Now, let's use those `preds` to compute some metrics! We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the `evaluate.load()` function. The object returned has a `compute()` method we can use to do the metric calculation:

```py
from datasets import load_metric
import evaluate

metric = load_metric("glue", "mrpc")
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=class_preds, references=raw_datasets["validation"]["label"])
```

Expand Down
6 changes: 3 additions & 3 deletions chapters/en/chapter3/4.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -172,12 +172,12 @@ You can see that the core of the training loop looks a lot like the one in the i

### The evaluation loop

As we did earlier, we will use a metric provided by the 🤗 Datasets library. We've already seen the `metric.compute()` method, but metrics can actually accumulate batches for us as we go over the prediction loop with the method `add_batch()`. Once we have accumulated all the batches, we can get the final result with `metric.compute()`. Here's how to implement all of this in an evaluation loop:
As we did earlier, we will use a metric provided by the 🤗 Evaluate library. We've already seen the `metric.compute()` method, but metrics can actually accumulate batches for us as we go over the prediction loop with the method `add_batch()`. Once we have accumulated all the batches, we can get the final result with `metric.compute()`. Here's how to implement all of this in an evaluation loop:

```py
from datasets import load_metric
import evaluate

metric = load_metric("glue", "mrpc")
metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
Expand Down
8 changes: 4 additions & 4 deletions chapters/en/chapter7/2.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -522,7 +522,7 @@ The traditional framework used to evaluate token classification prediction is [*
!pip install seqeval
```

We can then load it via the `load_metric()` function like we did in [Chapter 3](/course/chapter3):
We can then load it via the `evaluate.load()` function like we did in [Chapter 3](/course/chapter3):

{:else}

Expand All @@ -532,14 +532,14 @@ The traditional framework used to evaluate token classification prediction is [*
!pip install seqeval
```

We can then load it via the `load_metric()` function like we did in [Chapter 3](/course/chapter3):
We can then load it via the `evaluate.load()` function like we did in [Chapter 3](/course/chapter3):

{/if}

```py
from datasets import load_metric
import evaluate

metric = load_metric("seqeval")
metric = evaluate.load("seqeval")
```

This metric does not behave like the standard accuracy: it will actually take the lists of labels as strings, not integers, so we will need to fully decode the predictions and labels before passing them to the metric. Let's see how it works. First, we'll get the labels for our first training example:
Expand Down
8 changes: 4 additions & 4 deletions chapters/en/chapter7/4.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ To fine-tune or train a translation model from scratch, we will need a dataset s
As usual, we download our dataset using the `load_dataset()` function:

```py
from datasets import load_dataset, load_metric
from datasets import load_dataset

raw_datasets = load_dataset("kde4", lang1="en", lang2="fr")
```
Expand Down Expand Up @@ -428,12 +428,12 @@ One weakness with BLEU is that it expects the text to already be tokenized, whic
!pip install sacrebleu
```

We can then load it via `load_metric()` like we did in [Chapter 3](/course/chapter3):
We can then load it via `evaluate.load()` like we did in [Chapter 3](/course/chapter3):

```py
from datasets import load_metric
import evaluate

metric = load_metric("sacrebleu")
metric = evaluate.load("sacrebleu")
```

This metric will take texts as inputs and targets. It is designed to accept several acceptable targets, as there are often multiple acceptable translations of the same sentence -- the dataset we're using only provides one, but it's not uncommon in NLP to find datasets that give several sentences as labels. So, the predictions should be a list of sentences, but the references should be a list of lists of sentences.
Expand Down
4 changes: 2 additions & 2 deletions chapters/en/chapter7/5.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -352,9 +352,9 @@ Applying this to our verbose summary gives a precision of 6/10 = 0.6, which is
and then loading the ROUGE metric as follows:

```python
from datasets import load_metric
import evaluate

rouge_score = load_metric("rouge")
rouge_score = evaluate.load("rouge")
```

Then we can use the `rouge_score.compute()` function to calculate all the metrics at once:
Expand Down
6 changes: 3 additions & 3 deletions chapters/en/chapter7/7.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -670,12 +670,12 @@ for example in small_eval_set:
predicted_answers.append({"id": example_id, "prediction_text": best_answer["text"]})
```

The final format of the predicted answers is the one that will be expected by the metric we will use. As usual, we can load it with the help of the 🤗 Datasets library:
The final format of the predicted answers is the one that will be expected by the metric we will use. As usual, we can load it with the help of the 🤗 Evaluate library:

```python
from datasets import load_metric
import evaluate

metric = load_metric("squad")
metric = evaluate.load("squad")
```

This metric expects the predicted answers in the format we saw above (a list of dictionaries with one key for the ID of the example and one key for the predicted text) and the theoretical answers in the format below (a list of dictionaries with one key for the ID of the example and one key for the possible answers):
Expand Down
25 changes: 15 additions & 10 deletions chapters/en/chapter8/4.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ The best way to debug an error that arises in `trainer.train()` is to manually g
To demonstrate this, we will use the following script that (tries to) fine-tune a DistilBERT model on the [MNLI dataset](https://huggingface.co/datasets/glue):

```py
from datasets import load_dataset, load_metric
from datasets import load_dataset
import evaluate
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
Expand Down Expand Up @@ -52,7 +53,7 @@ args = TrainingArguments(
weight_decay=0.01,
)

metric = load_metric("glue", "mnli")
metric = evaluate.load("glue", "mnli")


def compute_metrics(eval_pred):
Expand Down Expand Up @@ -98,7 +99,8 @@ Do you notice something wrong? This, in conjunction with the error message about
Why wasn't the data processed? We did use the `Dataset.map()` method on the datasets to apply the tokenizer on each sample. But if you look closely at the code, you will see that we made a mistake when passing the training and evaluation sets to the `Trainer`. Instead of using `tokenized_datasets` here, we used `raw_datasets` 🤦. So let's fix this!

```py
from datasets import load_dataset, load_metric
from datasets import load_dataset
import evaluate
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
Expand Down Expand Up @@ -128,7 +130,7 @@ args = TrainingArguments(
weight_decay=0.01,
)

metric = load_metric("glue", "mnli")
metric = evaluate.load("glue", "mnli")


def compute_metrics(eval_pred):
Expand Down Expand Up @@ -291,7 +293,8 @@ So this is the `default_data_collator`, but that's not what we want in this case
The answer is because we did not pass the `tokenizer` to the `Trainer`, so it couldn't create the `DataCollatorWithPadding` we want. In practice, you should never hesitate to explicitly pass along the data collator you want to use, to make sure you avoid these kinds of errors. Let's adapt our code to do exactly that:

```py
from datasets import load_dataset, load_metric
from datasets import load_dataset
import evaluate
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
Expand Down Expand Up @@ -322,7 +325,7 @@ args = TrainingArguments(
weight_decay=0.01,
)

metric = load_metric("glue", "mnli")
metric = evaluate.load("glue", "mnli")


def compute_metrics(eval_pred):
Expand Down Expand Up @@ -416,7 +419,8 @@ trainer.model.config.num_labels
With two labels, only 0s and 1s are allowed as targets, but according to the error message we got a 2. Getting a 2 is actually normal: if we remember the label names we extracted earlier, there were three, so we have indices 0, 1, and 2 in our dataset. The problem is that we didn't tell that to our model, which should have been created with three labels. So let's fix that!

```py
from datasets import load_dataset, load_metric
from datasets import load_dataset
import evaluate
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
Expand Down Expand Up @@ -447,7 +451,7 @@ args = TrainingArguments(
weight_decay=0.01,
)

metric = load_metric("glue", "mnli")
metric = evaluate.load("glue", "mnli")


def compute_metrics(eval_pred):
Expand Down Expand Up @@ -626,7 +630,8 @@ For reference, here is the completely fixed script:
```py
import numpy as np
from datasets import load_dataset, load_metric
from datasets import load_dataset
import evaluate
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
Expand Down Expand Up @@ -657,7 +662,7 @@ args = TrainingArguments(
weight_decay=0.01,
)
metric = load_metric("glue", "mnli")
metric = evaluate.load("glue", "mnli")
def compute_metrics(eval_pred):
Expand Down
3 changes: 2 additions & 1 deletion chapters/en/chapter8/4_tf.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ The best way to debug an error that arises in `model.fit()` is to manually go th
To demonstrate this, we will use the following script that (tries to) fine-tune a DistilBERT model on the [MNLI dataset](https://huggingface.co/datasets/glue):

```py
from datasets import load_dataset, load_metric
from datasets import load_dataset
import evaluate
from transformers import (
AutoTokenizer,
TFAutoModelForSequenceClassification,
Expand Down
6 changes: 3 additions & 3 deletions chapters/es/chapter3/4.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -171,12 +171,12 @@ Puedes ver que la parte central del bucle de entrenamiento luce bastante como el

### El bucle de evaluación

Como lo hicimos anteriormente, usaremos una métrica ofrecida por la libreria Datasets 🤗. Ya hemos visto el método `metric.compute()`, pero de hecho las métricas se pueden acumular sobre los lotes a medida que avanzamos en el bucle de predicción con el método `add_batch()`. Una vez que hemos acumulado todos los lotes, podemos obtener el resultado final con `metric.compute()`. Aquí se muestra como se puede implementar en un bucle de evaluación:
Como lo hicimos anteriormente, usaremos una métrica ofrecida por la libreria 🤗 Evaluate. Ya hemos visto el método `metric.compute()`, pero de hecho las métricas se pueden acumular sobre los lotes a medida que avanzamos en el bucle de predicción con el método `add_batch()`. Una vez que hemos acumulado todos los lotes, podemos obtener el resultado final con `metric.compute()`. Aquí se muestra como se puede implementar en un bucle de evaluación:

```py
from datasets import load_metric
import evaluate

metric = load_metric("glue", "mrpc")
metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
Expand Down
8 changes: 4 additions & 4 deletions chapters/fr/chapter3/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -110,12 +110,12 @@ import numpy as np
preds = np.argmax(predictions.predictions, axis=-1)
```

Nous pouvons maintenant comparer ces `preds` aux étiquettes. Pour construire notre fonction `compute_metric()`, nous allons nous appuyer sur les métriques de la bibliothèque 🤗 *Datasets*. Nous pouvons charger les métriques associées au jeu de données MRPC aussi facilement que nous avons chargé le jeu de données, cette fois avec la fonction `load_metric()`. L'objet retourné possède une méthode `compute()` que nous pouvons utiliser pour effectuer le calcul de la métrique :
Nous pouvons maintenant comparer ces `preds` aux étiquettes. Pour construire notre fonction `compute_metric()`, nous allons nous appuyer sur les métriques de la bibliothèque 🤗 [*Evaluate*](https://github.com/huggingface/evaluate/). Nous pouvons charger les métriques associées au jeu de données MRPC aussi facilement que nous avons chargé le jeu de données, cette fois avec la fonction `evaluate.load()`. L'objet retourné possède une méthode `compute()` que nous pouvons utiliser pour effectuer le calcul de la métrique :

```py
from datasets import load_metric
import evaluate

metric = load_metric("glue", "mrpc")
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)
```

Expand All @@ -129,7 +129,7 @@ En regroupant le tout, nous obtenons notre fonction `compute_metrics()` :

```py
def compute_metrics(eval_preds):
metric = load_metric("glue", "mrpc")
metric = evaluate.load("glue", "mrpc")
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
Expand Down
6 changes: 3 additions & 3 deletions chapters/fr/chapter3/3_tf.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -172,12 +172,12 @@ print(preds.shape, class_preds.shape)
(408, 2) (408,)
```

Maintenant, utilisons ces `preds` pour calculer des métriques ! Nous pouvons charger les métriques associées au jeu de données MRPC aussi facilement que nous avons chargé le jeu de données, cette fois avec la fonction `load_metric()`. L'objet retourné a une méthode `compute()` que nous pouvons utiliser pour faire le calcul de la métrique :
Maintenant, utilisons ces `preds` pour calculer des métriques ! Nous pouvons charger les métriques associées au jeu de données MRPC aussi facilement que nous avons chargé le jeu de données, cette fois avec la fonction `evaluate.load()`. L'objet retourné a une méthode `compute()` que nous pouvons utiliser pour faire le calcul de la métrique :

```py
from datasets import load_metric
import evaluate

metric = load_metric("glue", "mrpc")
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=class_preds, references=raw_datasets["validation"]["label"])
```

Expand Down
Loading

0 comments on commit d804922

Please sign in to comment.