diff --git a/chapters/de/chapter3/3.mdx b/chapters/de/chapter3/3.mdx
index 3189863c2..ee38b9c67 100644
--- a/chapters/de/chapter3/3.mdx
+++ b/chapters/de/chapter3/3.mdx
@@ -110,12 +110,12 @@ import numpy as np
 preds = np.argmax(predictions.predictions, axis=-1)
 ```
 
-Jetzt können wir diese Vorhersagen in `preds` mit den Labels vergleichen. Wir greifen auf die Metriken aus der 🤗 Bibliothek Datasets zurück, um unsere Funktion `compute_metric()` zu erstellen. Die mit dem MRPC-Datensatz verbundenen Metriken können genauso einfach geladen werden, wie wir den Datensatz geladen haben, diesmal mit der Funktion `load_metric()`. Das zurückgegebene Objekt verfügt über eine Berechnungsmethode, mit der wir die Metrik auswerten können:
+Jetzt können wir diese Vorhersagen in `preds` mit den Labels vergleichen. Wir greifen auf die Metriken aus der 🤗 Bibliothek [Evaluate](https://github.com/huggingface/evaluate/) zurück, um unsere Funktion `compute_metric()` zu erstellen. Die mit dem MRPC-Datensatz verbundenen Metriken können genauso einfach geladen werden, wie wir den Datensatz geladen haben, diesmal mit der Funktion `evaluate.load()`. Das zurückgegebene Objekt verfügt über eine Berechnungsmethode, mit der wir die Metrik auswerten können:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 metric.compute(predictions=preds, references=predictions.label_ids)
 ```
 
@@ -129,7 +129,7 @@ Zusammenfassend ergibt das unsere Funktion `compute_metrics()`:
 
 ```py
 def compute_metrics(eval_preds):
-    metric = load_metric("glue", "mrpc")
+    metric = evaluate.load("glue", "mrpc")
     logits, labels = eval_preds
     predictions = np.argmax(logits, axis=-1)
     return metric.compute(predictions=predictions, references=labels)
diff --git a/chapters/de/chapter3/3_tf.mdx b/chapters/de/chapter3/3_tf.mdx
index dd1be7835..6290506eb 100644
--- a/chapters/de/chapter3/3_tf.mdx
+++ b/chapters/de/chapter3/3_tf.mdx
@@ -172,12 +172,12 @@ print(preds.shape, class_preds.shape)
 (408, 2) (408,)
 ```
 
-Nun können wir diese Vorhersagen in `preds` nutzen, um einige Metriken zu berechnen! Wir können die Metriken, die mit dem MRPC-Datensatz verbunden sind, genauso einfach laden, wie wir den Datensatz geladen haben, in diesem Fall mit der Funktion "load_metric()". Das zurückgegebene Objekt verfügt über eine Berechnungsmethode, mit der wir die Metrik berechnen können:
+Nun können wir diese Vorhersagen in `preds` nutzen, um einige Metriken zu berechnen! Wir können die Metriken, die mit dem MRPC-Datensatz verbunden sind, genauso einfach laden, wie wir den Datensatz geladen haben, in diesem Fall mit der Funktion "evaluate.load()". Das zurückgegebene Objekt verfügt über eine Berechnungsmethode, mit der wir die Metrik berechnen können:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 metric.compute(predictions=class_preds, references=raw_datasets["validation"]["label"])
 ```
 
diff --git a/chapters/de/chapter3/4.mdx b/chapters/de/chapter3/4.mdx
index 1888bf2fa..c940e4030 100644
--- a/chapters/de/chapter3/4.mdx
+++ b/chapters/de/chapter3/4.mdx
@@ -171,12 +171,12 @@ Der Kern der Trainingsschleife sieht ähnlich aus wie in der Einleitung. Da wir
 
 ### Die Evaluationsschleife
 
-Wie schon zuvor verwenden wir eine Metrik, die von der 🤗 Datasets-Bibliothek bereitgestellt wird. Wir haben bereits die Methode `metric.compute()` gesehen, aber Metriken können auch Batches für uns akkumulieren, wenn wir die Vorhersageschleife mit der Methode `add_batch()` durchlaufen. Sobald wir alle Batches gesammelt haben, können wir das Endergebnis mit der Methode `metric.compute()` ermitteln. So implementierst du all das in eine Evaluationsschleife:
+Wie schon zuvor verwenden wir eine Metrik, die von der 🤗 Evaluate-Bibliothek bereitgestellt wird. Wir haben bereits die Methode `metric.compute()` gesehen, aber Metriken können auch Batches für uns akkumulieren, wenn wir die Vorhersageschleife mit der Methode `add_batch()` durchlaufen. Sobald wir alle Batches gesammelt haben, können wir das Endergebnis mit der Methode `metric.compute()` ermitteln. So implementierst du all das in eine Evaluationsschleife:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 model.eval()
 for batch in eval_dataloader:
     batch = {k: v.to(device) for k, v in batch.items()}
diff --git a/chapters/en/chapter3/3.mdx b/chapters/en/chapter3/3.mdx
index fb1665370..ebd301469 100644
--- a/chapters/en/chapter3/3.mdx
+++ b/chapters/en/chapter3/3.mdx
@@ -110,12 +110,12 @@ import numpy as np
 preds = np.argmax(predictions.predictions, axis=-1)
 ```
 
-We can now compare those `preds` to the labels. To build our `compute_metric()` function, we will rely on the metrics from the 🤗 Datasets library. We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the `load_metric()` function. The object returned has a `compute()` method we can use to do the metric calculation:
+We can now compare those `preds` to the labels. To build our `compute_metric()` function, we will rely on the metrics from the 🤗 [Evaluate](https://github.com/huggingface/evaluate/) library. We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the `evaluate.load()` function. The object returned has a `compute()` method we can use to do the metric calculation:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 metric.compute(predictions=preds, references=predictions.label_ids)
 ```
 
@@ -129,7 +129,7 @@ Wrapping everything together, we get our `compute_metrics()` function:
 
 ```py
 def compute_metrics(eval_preds):
-    metric = load_metric("glue", "mrpc")
+    metric = evaluate.load("glue", "mrpc")
     logits, labels = eval_preds
     predictions = np.argmax(logits, axis=-1)
     return metric.compute(predictions=predictions, references=labels)
diff --git a/chapters/en/chapter3/3_tf.mdx b/chapters/en/chapter3/3_tf.mdx
index 2252a9613..6357be0b2 100644
--- a/chapters/en/chapter3/3_tf.mdx
+++ b/chapters/en/chapter3/3_tf.mdx
@@ -181,12 +181,12 @@ print(preds.shape, class_preds.shape)
 (408, 2) (408,)
 ```
 
-Now, let's use those `preds` to compute some metrics! We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the `load_metric()` function. The object returned has a `compute()` method we can use to do the metric calculation:
+Now, let's use those `preds` to compute some metrics! We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the `evaluate.load()` function. The object returned has a `compute()` method we can use to do the metric calculation:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 metric.compute(predictions=class_preds, references=raw_datasets["validation"]["label"])
 ```
 
diff --git a/chapters/en/chapter3/4.mdx b/chapters/en/chapter3/4.mdx
index 54563ea7c..a515ce2af 100644
--- a/chapters/en/chapter3/4.mdx
+++ b/chapters/en/chapter3/4.mdx
@@ -172,12 +172,12 @@ You can see that the core of the training loop looks a lot like the one in the i
 
 ### The evaluation loop
 
-As we did earlier, we will use a metric provided by the 🤗 Datasets library. We've already seen the `metric.compute()` method, but metrics can actually accumulate batches for us as we go over the prediction loop with the method `add_batch()`. Once we have accumulated all the batches, we can get the final result with `metric.compute()`. Here's how to implement all of this in an evaluation loop:
+As we did earlier, we will use a metric provided by the 🤗 Evaluate library. We've already seen the `metric.compute()` method, but metrics can actually accumulate batches for us as we go over the prediction loop with the method `add_batch()`. Once we have accumulated all the batches, we can get the final result with `metric.compute()`. Here's how to implement all of this in an evaluation loop:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 model.eval()
 for batch in eval_dataloader:
     batch = {k: v.to(device) for k, v in batch.items()}
diff --git a/chapters/en/chapter7/2.mdx b/chapters/en/chapter7/2.mdx
index 9d1ccc3b9..3eaba62c8 100644
--- a/chapters/en/chapter7/2.mdx
+++ b/chapters/en/chapter7/2.mdx
@@ -522,7 +522,7 @@ The traditional framework used to evaluate token classification prediction is [*
 !pip install seqeval
 ```
 
-We can then load it via the `load_metric()` function like we did in [Chapter 3](/course/chapter3):
+We can then load it via the `evaluate.load()` function like we did in [Chapter 3](/course/chapter3):
 
 {:else}
 
@@ -532,14 +532,14 @@ The traditional framework used to evaluate token classification prediction is [*
 !pip install seqeval
 ```
 
-We can then load it via the `load_metric()` function like we did in [Chapter 3](/course/chapter3):
+We can then load it via the `evaluate.load()` function like we did in [Chapter 3](/course/chapter3):
 
 {/if}
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("seqeval")
+metric = evaluate.load("seqeval")
 ```
 
 This metric does not behave like the standard accuracy: it will actually take the lists of labels as strings, not integers, so we will need to fully decode the predictions and labels before passing them to the metric. Let's see how it works. First, we'll get the labels for our first training example:
diff --git a/chapters/en/chapter7/4.mdx b/chapters/en/chapter7/4.mdx
index 5aa654ceb..e68bb376b 100644
--- a/chapters/en/chapter7/4.mdx
+++ b/chapters/en/chapter7/4.mdx
@@ -53,7 +53,7 @@ To fine-tune or train a translation model from scratch, we will need a dataset s
 As usual, we download our dataset using the `load_dataset()` function:
 
 ```py
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
 
 raw_datasets = load_dataset("kde4", lang1="en", lang2="fr")
 ```
@@ -428,12 +428,12 @@ One weakness with BLEU is that it expects the text to already be tokenized, whic
 !pip install sacrebleu
 ```
 
-We can then load it via `load_metric()` like we did in [Chapter 3](/course/chapter3):
+We can then load it via `evaluate.load()` like we did in [Chapter 3](/course/chapter3):
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("sacrebleu")
+metric = evaluate.load("sacrebleu")
 ```
 
 This metric will take texts as inputs and targets. It is designed to accept several acceptable targets, as there are often multiple acceptable translations of the same sentence -- the dataset we're using only provides one, but it's not uncommon in NLP to find datasets that give several sentences as labels. So, the predictions should be a list of sentences, but the references should be a list of lists of sentences.
diff --git a/chapters/en/chapter7/5.mdx b/chapters/en/chapter7/5.mdx
index 958dc685d..e6df6fc31 100644
--- a/chapters/en/chapter7/5.mdx
+++ b/chapters/en/chapter7/5.mdx
@@ -352,9 +352,9 @@ Applying this to our verbose summary gives a precision of 6/10  = 0.6, which is
 and then loading the ROUGE metric as follows:
 
 ```python
-from datasets import load_metric
+import evaluate
 
-rouge_score = load_metric("rouge")
+rouge_score = evaluate.load("rouge")
 ```
 
 Then we can use the `rouge_score.compute()` function to calculate all the metrics at once:
diff --git a/chapters/en/chapter7/7.mdx b/chapters/en/chapter7/7.mdx
index d8e1942e4..d32fc7d8d 100644
--- a/chapters/en/chapter7/7.mdx
+++ b/chapters/en/chapter7/7.mdx
@@ -670,12 +670,12 @@ for example in small_eval_set:
     predicted_answers.append({"id": example_id, "prediction_text": best_answer["text"]})
 ```
 
-The final format of the predicted answers is the one that will be expected by the metric we will use. As usual, we can load it with the help of the 🤗 Datasets library:
+The final format of the predicted answers is the one that will be expected by the metric we will use. As usual, we can load it with the help of the 🤗 Evaluate library:
 
 ```python
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("squad")
+metric = evaluate.load("squad")
 ```
 
 This metric expects the predicted answers in the format we saw above (a list of dictionaries with one key for the ID of the example and one key for the predicted text) and the theoretical answers in the format below (a list of dictionaries with one key for the ID of the example and one key for the possible answers):
diff --git a/chapters/en/chapter8/4.mdx b/chapters/en/chapter8/4.mdx
index 1cc9e4e51..54232cdc9 100644
--- a/chapters/en/chapter8/4.mdx
+++ b/chapters/en/chapter8/4.mdx
@@ -22,7 +22,8 @@ The best way to debug an error that arises in `trainer.train()` is to manually g
 To demonstrate this, we will use the following script that (tries to) fine-tune a DistilBERT model on the [MNLI dataset](https://huggingface.co/datasets/glue):
 
 ```py
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
+import evaluate
 from transformers import (
     AutoTokenizer,
     AutoModelForSequenceClassification,
@@ -52,7 +53,7 @@ args = TrainingArguments(
     weight_decay=0.01,
 )
 
-metric = load_metric("glue", "mnli")
+metric = evaluate.load("glue", "mnli")
 
 
 def compute_metrics(eval_pred):
@@ -98,7 +99,8 @@ Do you notice something wrong? This, in conjunction with the error message about
 Why wasn't the data processed? We did use the `Dataset.map()` method on the datasets to apply the tokenizer on each sample. But if you look closely at the code, you will see that we made a mistake when passing the training and evaluation sets to the `Trainer`. Instead of using `tokenized_datasets` here, we used `raw_datasets` 🤦. So let's fix this!
 
 ```py
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
+import evaluate
 from transformers import (
     AutoTokenizer,
     AutoModelForSequenceClassification,
@@ -128,7 +130,7 @@ args = TrainingArguments(
     weight_decay=0.01,
 )
 
-metric = load_metric("glue", "mnli")
+metric = evaluate.load("glue", "mnli")
 
 
 def compute_metrics(eval_pred):
@@ -291,7 +293,8 @@ So this is the `default_data_collator`, but that's not what we want in this case
 The answer is because we did not pass the `tokenizer` to the `Trainer`, so it couldn't create the `DataCollatorWithPadding` we want. In practice, you should never hesitate to explicitly pass along the data collator you want to use, to make sure you avoid these kinds of errors. Let's adapt our code to do exactly that:
 
 ```py
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
+import evaluate
 from transformers import (
     AutoTokenizer,
     AutoModelForSequenceClassification,
@@ -322,7 +325,7 @@ args = TrainingArguments(
     weight_decay=0.01,
 )
 
-metric = load_metric("glue", "mnli")
+metric = evaluate.load("glue", "mnli")
 
 
 def compute_metrics(eval_pred):
@@ -416,7 +419,8 @@ trainer.model.config.num_labels
 With two labels, only 0s and 1s are allowed as targets, but according to the error message we got a 2. Getting a 2 is actually normal: if we remember the label names we extracted earlier, there were three, so we have indices 0, 1, and 2 in our dataset. The problem is that we didn't tell that to our model, which should have been created with three labels. So let's fix that!
 
 ```py
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
+import evaluate
 from transformers import (
     AutoTokenizer,
     AutoModelForSequenceClassification,
@@ -447,7 +451,7 @@ args = TrainingArguments(
     weight_decay=0.01,
 )
 
-metric = load_metric("glue", "mnli")
+metric = evaluate.load("glue", "mnli")
 
 
 def compute_metrics(eval_pred):
@@ -626,7 +630,8 @@ For reference, here is the completely fixed script:
 
 ```py
 import numpy as np
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
+import evaluate
 from transformers import (
     AutoTokenizer,
     AutoModelForSequenceClassification,
@@ -657,7 +662,7 @@ args = TrainingArguments(
     weight_decay=0.01,
 )
 
-metric = load_metric("glue", "mnli")
+metric = evaluate.load("glue", "mnli")
 
 
 def compute_metrics(eval_pred):
diff --git a/chapters/en/chapter8/4_tf.mdx b/chapters/en/chapter8/4_tf.mdx
index 4ba2f3b1c..6a241216d 100644
--- a/chapters/en/chapter8/4_tf.mdx
+++ b/chapters/en/chapter8/4_tf.mdx
@@ -22,7 +22,8 @@ The best way to debug an error that arises in `model.fit()` is to manually go th
 To demonstrate this, we will use the following script that (tries to) fine-tune a DistilBERT model on the [MNLI dataset](https://huggingface.co/datasets/glue):
 
 ```py
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
+import evaluate
 from transformers import (
     AutoTokenizer,
     TFAutoModelForSequenceClassification,
diff --git a/chapters/es/chapter3/4.mdx b/chapters/es/chapter3/4.mdx
index c16fa7d58..b5bd3f7cf 100644
--- a/chapters/es/chapter3/4.mdx
+++ b/chapters/es/chapter3/4.mdx
@@ -171,12 +171,12 @@ Puedes ver que la parte central del bucle de entrenamiento luce bastante como el
 
 ### El bucle de evaluación
 
-Como lo hicimos anteriormente, usaremos una métrica ofrecida por la libreria Datasets 🤗. Ya hemos visto el método `metric.compute()`, pero de hecho las métricas se pueden acumular sobre los lotes a medida que avanzamos en el bucle de predicción con el método `add_batch()`. Una vez que hemos acumulado todos los lotes, podemos obtener el resultado final con `metric.compute()`. Aquí se muestra como se puede implementar en un bucle de evaluación:
+Como lo hicimos anteriormente, usaremos una métrica ofrecida por la libreria 🤗 Evaluate. Ya hemos visto el método `metric.compute()`, pero de hecho las métricas se pueden acumular sobre los lotes a medida que avanzamos en el bucle de predicción con el método `add_batch()`. Una vez que hemos acumulado todos los lotes, podemos obtener el resultado final con `metric.compute()`. Aquí se muestra como se puede implementar en un bucle de evaluación:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 model.eval()
 for batch in eval_dataloader:
     batch = {k: v.to(device) for k, v in batch.items()}
diff --git a/chapters/fr/chapter3/3.mdx b/chapters/fr/chapter3/3.mdx
index 13563fb0d..2624cdd5a 100644
--- a/chapters/fr/chapter3/3.mdx
+++ b/chapters/fr/chapter3/3.mdx
@@ -110,12 +110,12 @@ import numpy as np
 preds = np.argmax(predictions.predictions, axis=-1)
 ```
 
-Nous pouvons maintenant comparer ces `preds` aux étiquettes. Pour construire notre fonction `compute_metric()`, nous allons nous appuyer sur les métriques de la bibliothèque 🤗 *Datasets*. Nous pouvons charger les métriques associées au jeu de données MRPC aussi facilement que nous avons chargé le jeu de données, cette fois avec la fonction `load_metric()`. L'objet retourné possède une méthode `compute()` que nous pouvons utiliser pour effectuer le calcul de la métrique :
+Nous pouvons maintenant comparer ces `preds` aux étiquettes. Pour construire notre fonction `compute_metric()`, nous allons nous appuyer sur les métriques de la bibliothèque 🤗 [*Evaluate*](https://github.com/huggingface/evaluate/). Nous pouvons charger les métriques associées au jeu de données MRPC aussi facilement que nous avons chargé le jeu de données, cette fois avec la fonction `evaluate.load()`. L'objet retourné possède une méthode `compute()` que nous pouvons utiliser pour effectuer le calcul de la métrique :
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 metric.compute(predictions=preds, references=predictions.label_ids)
 ```
 
@@ -129,7 +129,7 @@ En regroupant le tout, nous obtenons notre fonction `compute_metrics()` :
 
 ```py
 def compute_metrics(eval_preds):
-    metric = load_metric("glue", "mrpc")
+    metric = evaluate.load("glue", "mrpc")
     logits, labels = eval_preds
     predictions = np.argmax(logits, axis=-1)
     return metric.compute(predictions=predictions, references=labels)
diff --git a/chapters/fr/chapter3/3_tf.mdx b/chapters/fr/chapter3/3_tf.mdx
index 6be37c3a3..9a84d533d 100644
--- a/chapters/fr/chapter3/3_tf.mdx
+++ b/chapters/fr/chapter3/3_tf.mdx
@@ -172,12 +172,12 @@ print(preds.shape, class_preds.shape)
 (408, 2) (408,)
 ```
 
-Maintenant, utilisons ces `preds` pour calculer des métriques ! Nous pouvons charger les métriques associées au jeu de données MRPC aussi facilement que nous avons chargé le jeu de données, cette fois avec la fonction `load_metric()`. L'objet retourné a une méthode `compute()` que nous pouvons utiliser pour faire le calcul de la métrique :
+Maintenant, utilisons ces `preds` pour calculer des métriques ! Nous pouvons charger les métriques associées au jeu de données MRPC aussi facilement que nous avons chargé le jeu de données, cette fois avec la fonction `evaluate.load()`. L'objet retourné a une méthode `compute()` que nous pouvons utiliser pour faire le calcul de la métrique :
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 metric.compute(predictions=class_preds, references=raw_datasets["validation"]["label"])
 ```
 
diff --git a/chapters/fr/chapter3/4.mdx b/chapters/fr/chapter3/4.mdx
index ce42e9b7f..e66caa6db 100644
--- a/chapters/fr/chapter3/4.mdx
+++ b/chapters/fr/chapter3/4.mdx
@@ -172,12 +172,12 @@ Vous pouvez voir que le cœur de la boucle d'entraînement ressemble beaucoup à
 
 ### La boucle d'évaluation
 
-Comme nous l'avons fait précédemment, nous allons utiliser une métrique fournie par la bibliothèque 🤗 *Datasets*. Nous avons déjà vu la méthode `metric.compute()`, mais les métriques peuvent en fait accumuler des batchs pour nous au fur et à mesure que nous parcourons la boucle de prédiction avec la méthode `add_batch()`. Une fois que nous avons accumulé tous les batchs, nous pouvons obtenir le résultat final avec `metric.compute()`. Voici comment implémenter tout cela dans une boucle d'évaluation :
+Comme nous l'avons fait précédemment, nous allons utiliser une métrique fournie par la bibliothèque 🤗 *Evaluate*. Nous avons déjà vu la méthode `metric.compute()`, mais les métriques peuvent en fait accumuler des batchs pour nous au fur et à mesure que nous parcourons la boucle de prédiction avec la méthode `add_batch()`. Une fois que nous avons accumulé tous les batchs, nous pouvons obtenir le résultat final avec `metric.compute()`. Voici comment implémenter tout cela dans une boucle d'évaluation :
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 model.eval()
 for batch in eval_dataloader:
     batch = {k: v.to(device) for k, v in batch.items()}
diff --git a/chapters/fr/chapter7/2.mdx b/chapters/fr/chapter7/2.mdx
index de780128e..921f9629f 100644
--- a/chapters/fr/chapter7/2.mdx
+++ b/chapters/fr/chapter7/2.mdx
@@ -522,7 +522,7 @@ Le *framework* traditionnel utilisé pour évaluer la prédiction de la classifi
 !pip install seqeval
 ```
 
-Nous pouvons ensuite le charger via la fonction `load_metric()` comme nous l'avons fait dans le [chapitre 3](/course/fr/chapter3) :
+Nous pouvons ensuite le charger via la fonction `evaluate.load()` comme nous l'avons fait dans le [chapitre 3](/course/fr/chapter3) :
 
 {:else}
 
@@ -532,14 +532,14 @@ Le *framework*  traditionnel utilisé pour évaluer la prédiction de la classif
 !pip install seqeval
 ```
 
-Nous pouvons ensuite le charger via la fonction `load_metric()` comme nous l'avons fait dans le [chapitre 3](/course/fr/chapter3) :
+Nous pouvons ensuite le charger via la fonction `evaluate.load()` comme nous l'avons fait dans le [chapitre 3](/course/fr/chapter3) :
 
 {/if}
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("seqeval")
+metric = evaluate.load("seqeval")
 ```
 
 Cette métrique ne se comporte pas comme la précision standard : elle prend les listes d'étiquettes comme des chaînes de caractères et non comme des entiers. Nous devrons donc décoder complètement les prédictions et les étiquettes avant de les transmettre à la métrique. Voyons comment cela fonctionne. Tout d'abord, nous allons obtenir les étiquettes pour notre premier exemple d'entraînement :
diff --git a/chapters/fr/chapter7/4.mdx b/chapters/fr/chapter7/4.mdx
index d7d868936..e28cf05d5 100644
--- a/chapters/fr/chapter7/4.mdx
+++ b/chapters/fr/chapter7/4.mdx
@@ -54,7 +54,7 @@ Pour *finetuner* ou entraîner un modèle de traduction à partir de zéro, nous
 Comme d'habitude, nous téléchargeons notre jeu de données en utilisant la fonction `load_dataset()` :
 
 ```py
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
 
 raw_datasets = load_dataset("kde4", lang1="en", lang2="fr")
 ```
@@ -425,12 +425,12 @@ L'une des faiblesses de BLEU est qu'il s'attend à ce que le texte soit déjà t
 !pip install sacrebleu
 ```
 
-Nous pouvons ensuite charger ce score via `load_metric()` comme nous l'avons fait dans le [chapitre 3](/course/fr/chapter3) :
+Nous pouvons ensuite charger ce score via `evaluate.load()` comme nous l'avons fait dans le [chapitre 3](/course/fr/chapter3) :
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("sacrebleu")
+metric = evaluate.load("sacrebleu")
 ```
 
 Cette métrique prend des textes comme entrées et cibles. Elle est conçue pour accepter plusieurs cibles acceptables car il y a souvent plusieurs traductions possibles d'une même phrase. Le jeu de données que nous utilisons n'en fournit qu'une seule, mais en NLP, il n'est pas rare de trouver des jeux de données ayant plusieurs phrases comme étiquettes. Ainsi, les prédictions doivent être une liste de phrases mais les références doivent être une liste de listes de phrases.
diff --git a/chapters/fr/chapter7/5.mdx b/chapters/fr/chapter7/5.mdx
index 47428e425..c2177fb07 100644
--- a/chapters/fr/chapter7/5.mdx
+++ b/chapters/fr/chapter7/5.mdx
@@ -374,9 +374,9 @@ En appliquant cela à notre résumé verbeux, on obtient une précision de 6/10
 et ensuite charger la métrique ROUGE comme suit :
 
 ```python
-from datasets import load_metric
+import evaluate
 
-rouge_score = load_metric("rouge")
+rouge_score = evaluate.load("rouge")
 ```
 
 Ensuite, nous pouvons utiliser la fonction `rouge_score.compute()` pour calculer toutes les métriques en une seule fois :
diff --git a/chapters/fr/chapter7/7.mdx b/chapters/fr/chapter7/7.mdx
index 359e84211..b703523bd 100644
--- a/chapters/fr/chapter7/7.mdx
+++ b/chapters/fr/chapter7/7.mdx
@@ -691,12 +691,12 @@ for example in small_eval_set:
     predicted_answers.append({"id": example_id, "prediction_text": best_answer["text"]})
 ```
 
-Le format final des réponses prédites est celui qui sera attendu par la métrique que nous allons utiliser. Comme d'habitude, nous pouvons la charger à l'aide de la bibliothèque 🤗 *Datasets* :
+Le format final des réponses prédites est celui qui sera attendu par la métrique que nous allons utiliser. Comme d'habitude, nous pouvons la charger à l'aide de la bibliothèque 🤗 *Evaluate* :
 
 ```python
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("squad")
+metric = evaluate.load("squad")
 ```
 
 Cette métrique attend les réponses prédites dans le format que nous avons vu ci-dessus (une liste de dictionnaires avec une clé pour l'identifiant de l'exemple et une clé pour le texte prédit) et les réponses théoriques dans le format ci-dessous (une liste de dictionnaires avec une clé pour l'identifiant de l'exemple et une clé pour les réponses possibles) :
diff --git a/chapters/fr/chapter8/4.mdx b/chapters/fr/chapter8/4.mdx
index 2ea4a9ece..7ac1272c0 100644
--- a/chapters/fr/chapter8/4.mdx
+++ b/chapters/fr/chapter8/4.mdx
@@ -22,7 +22,8 @@ La meilleure façon de déboguer une erreur qui survient dans `trainer.train()`
 Pour le démontrer, nous utiliserons le script suivant qui tente de *finetuner* un modèle DistilBERT sur le [jeu de données MNLI](https://huggingface.co/datasets/glue) :
 
 ```py
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
+import evaluate
 from transformers import (
     AutoTokenizer,
     AutoModelForSequenceClassification,
@@ -52,7 +53,7 @@ args = TrainingArguments(
     weight_decay=0.01,
 )
 
-metric = load_metric("glue", "mnli")
+metric = evaluate.load("glue", "mnli")
 
 
 def compute_metrics(eval_pred):
@@ -98,7 +99,8 @@ Vous remarquez quelque chose d'anormal ? Ceci, en conjonction avec le message d'
 Pourquoi les données n'ont-elles pas été traitées ? Nous avons utilisé la méthode `Dataset.map()` sur les jeux de données pour appliquer le *tokenizer* sur chaque échantillon. Mais si vous regardez attentivement le code, vous verrez que nous avons fait une erreur en passant les ensembles d'entraînement et d'évaluation au `Trainer`. Au lieu d'utiliser `tokenized_datasets` ici, nous avons utilisé `raw_datasets` 🤦. Alors corrigeons ça !
 
 ```py
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
+import evaluate
 from transformers import (
     AutoTokenizer,
     AutoModelForSequenceClassification,
@@ -128,7 +130,7 @@ args = TrainingArguments(
     weight_decay=0.01,
 )
 
-metric = load_metric("glue", "mnli")
+metric = evaluate.load("glue", "mnli")
 
 
 def compute_metrics(eval_pred):
@@ -291,7 +293,8 @@ C'est donc `default_data_collator`, mais ce n'est pas ce que nous voulons dans c
 La réponse est que nous n'avons pas passé le `tokenizer` au `Trainer`, donc il ne pouvait pas créer le `DataCollatorWithPadding` que nous voulons. En pratique, il ne faut jamais hésiter à transmettre explicitement l'assembleur de données que l'on veut utiliser pour être sûr d'éviter ce genre d'erreurs. Adaptons notre code pour faire exactement cela :
 
 ```py
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
+import evaluate
 from transformers import (
     AutoTokenizer,
     AutoModelForSequenceClassification,
@@ -322,7 +325,7 @@ args = TrainingArguments(
     weight_decay=0.01,
 )
 
-metric = load_metric("glue", "mnli")
+metric = evaluate.load("glue", "mnli")
 
 
 def compute_metrics(eval_pred):
@@ -417,7 +420,8 @@ trainer.model.config.num_labels
 Avec deux étiquettes, seuls les 0 et les 1 sont autorisés comme cibles, mais d'après le message d'erreur, nous avons obtenu un 2. Obtenir un 2 est en fait normal : si nous nous souvenons des noms des étiquettes que nous avons extraits plus tôt, il y en avait trois, donc nous avons les indices 0, 1 et 2 dans notre jeu de données. Le problème est que nous n'avons pas indiqué cela à notre modèle, qui aurait dû être créé avec trois étiquettes. Alors, corrigeons cela !
 
 ```py
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
+import evaluate
 from transformers import (
     AutoTokenizer,
     AutoModelForSequenceClassification,
@@ -448,7 +452,7 @@ args = TrainingArguments(
     weight_decay=0.01,
 )
 
-metric = load_metric("glue", "mnli")
+metric = evaluate.load("glue", "mnli")
 
 
 def compute_metrics(eval_pred):
@@ -627,7 +631,8 @@ Pour référence, voici le script complètement corrigé :
 
 ```py
 import numpy as np
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
+import evaluate
 from transformers import (
     AutoTokenizer,
     AutoModelForSequenceClassification,
@@ -658,7 +663,7 @@ args = TrainingArguments(
     weight_decay=0.01,
 )
 
-metric = load_metric("glue", "mnli")
+metric = evaluate.load("glue", "mnli")
 
 
 def compute_metrics(eval_pred):
diff --git a/chapters/fr/chapter8/4_tf.mdx b/chapters/fr/chapter8/4_tf.mdx
index e178f6842..257dafe26 100644
--- a/chapters/fr/chapter8/4_tf.mdx
+++ b/chapters/fr/chapter8/4_tf.mdx
@@ -22,7 +22,8 @@ La meilleure façon de déboguer une erreur qui survient dans `trainer.train()`
 Pour le démontrer, nous utiliserons le script suivant qui tente de *finetuner* un modèle DistilBERT sur le [jeu de données MNLI](https://huggingface.co/datasets/glue) :
 
 ```py
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
+import evaluate
 from transformers import (
     AutoTokenizer,
     TFAutoModelForSequenceClassification,
diff --git a/chapters/hi/chapter3/3.mdx b/chapters/hi/chapter3/3.mdx
index 93bcdeb97..748824180 100644
--- a/chapters/hi/chapter3/3.mdx
+++ b/chapters/hi/chapter3/3.mdx
@@ -110,12 +110,12 @@ import numpy as np
 preds = np.argmax(predictions.predictions, axis=-1)
 ```
 
-अब हम उन `preds` की तुलना लेबल से कर सकते हैं। हमारे `compute_metric()` फ़ंक्शन को बनाने के लिए, हम 🤗 डेटासेट लाइब्रेरी के मेट्रिक्स पर निर्भर है। हम MRPC डेटासेट से जुड़े मेट्रिक्स को उतनी ही आसानी से लोड कर सकते हैं, जितनी आसानी से हमने डेटासेट लोड किया, इस बार `load_metric()` फ़ंक्शन के साथ। इसने एक वस्तु लौटाया जिसमे एक `compute()` विधि है जिसका उपयोग हम मीट्रिक गणना करने के लिए कर सकते हैं:
+अब हम उन `preds` की तुलना लेबल से कर सकते हैं। हमारे `compute_metric()` फ़ंक्शन को बनाने के लिए, हम 🤗 [मूल्यांकन करना](https://github.com/huggingface/evaluate/) लाइब्रेरी के मेट्रिक्स पर निर्भर है। हम MRPC डेटासेट से जुड़े मेट्रिक्स को उतनी ही आसानी से लोड कर सकते हैं, जितनी आसानी से हमने डेटासेट लोड किया, इस बार `evaluate.load()` फ़ंक्शन के साथ। इसने एक वस्तु लौटाया जिसमे एक `compute()` विधि है जिसका उपयोग हम मीट्रिक गणना करने के लिए कर सकते हैं:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 metric.compute(predictions=preds, references=predictions.label_ids)
 ```
 
@@ -129,7 +129,7 @@ metric.compute(predictions=preds, references=predictions.label_ids)
 
 ```py
 def compute_metrics(eval_preds):
-    metric = load_metric("glue", "mrpc")
+    metric = evaluate.load("glue", "mrpc")
     logits, labels = eval_preds
     predictions = np.argmax(logits, axis=-1)
     return metric.compute(predictions=predictions, references=labels)
diff --git a/chapters/hi/chapter3/3_tf.mdx b/chapters/hi/chapter3/3_tf.mdx
index 84f022ead..837983be3 100644
--- a/chapters/hi/chapter3/3_tf.mdx
+++ b/chapters/hi/chapter3/3_tf.mdx
@@ -181,12 +181,12 @@ print(preds.shape, class_preds.shape)
 (408, 2) (408,)
 ```
 
-अब, कुछ मेट्रिक्स की गणना करने के लिए उन `preds` का उपयोग करते हैं! हम MRPC डेटासेट से जुड़े मेट्रिक्स को उतनी ही आसानी से लोड कर सकते हैं, जितनी आसानी से हमने डेटासेट लोड किया, इस बार `load_metric()` फ़ंक्शन के साथ। इसने एक वस्तु लौटाया जिसमे एक `compute()` विधि है जिसका उपयोग हम मीट्रिक गणना करने के लिए कर सकते हैं:
+अब, कुछ मेट्रिक्स की गणना करने के लिए उन `preds` का उपयोग करते हैं! हम MRPC डेटासेट से जुड़े मेट्रिक्स को उतनी ही आसानी से लोड कर सकते हैं, जितनी आसानी से हमने डेटासेट लोड किया, इस बार `evaluate.load()` फ़ंक्शन के साथ। इसने एक वस्तु लौटाया जिसमे एक `compute()` विधि है जिसका उपयोग हम मीट्रिक गणना करने के लिए कर सकते हैं:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 metric.compute(predictions=class_preds, references=raw_datasets["validation"]["label"])
 ```
 
diff --git a/chapters/hi/chapter3/4.mdx b/chapters/hi/chapter3/4.mdx
index 10e690a1b..181e3e0e5 100644
--- a/chapters/hi/chapter3/4.mdx
+++ b/chapters/hi/chapter3/4.mdx
@@ -172,12 +172,12 @@ for epoch in range(num_epochs):
 
 ### मूल्यांकन लूप
 
-जैसा कि हमने पहले किया था, हम 🤗 डेटासेट लाइब्रेरी द्वारा प्रदान किए गए मीट्रिक का उपयोग करेंगे। हम पहले ही `metric.compute()` विधि देख चुके हैं, लेकिन मेट्रिक्स वास्तव में हमारे लिए बैच जमा कर सकते हैं जब हम भविष्यवाणी लूप पर जाते हैं `add_batch()` विधि के साथ । एक बार जब हम सभी बैचों को जमा कर लेते हैं, तो हम `metric.compute()` के साथ अंतिम परिणाम प्राप्त कर सकते हैं। मूल्यांकन लूप में इन सभी को कार्यान्वित करने का तरीका यहां दिया गया है:
+जैसा कि हमने पहले किया था, हम 🤗 मूल्यांकन करना लाइब्रेरी द्वारा प्रदान किए गए मीट्रिक का उपयोग करेंगे। हम पहले ही `metric.compute()` विधि देख चुके हैं, लेकिन मेट्रिक्स वास्तव में हमारे लिए बैच जमा कर सकते हैं जब हम भविष्यवाणी लूप पर जाते हैं `add_batch()` विधि के साथ । एक बार जब हम सभी बैचों को जमा कर लेते हैं, तो हम `metric.compute()` के साथ अंतिम परिणाम प्राप्त कर सकते हैं। मूल्यांकन लूप में इन सभी को कार्यान्वित करने का तरीका यहां दिया गया है:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 model.eval()
 for batch in eval_dataloader:
     batch = {k: v.to(device) for k, v in batch.items()}
diff --git a/chapters/ja/chapter7/2.mdx b/chapters/ja/chapter7/2.mdx
index c6fad9d03..efd90ad71 100644
--- a/chapters/ja/chapter7/2.mdx
+++ b/chapters/ja/chapter7/2.mdx
@@ -534,7 +534,7 @@ model.fit(
 !pip install seqeval
 ```
 
-そして、[第3章](/course/ja/chapter3) で行ったように `load_metric()` 関数で読み込むことができるようになります。
+そして、[第3章](/course/ja/chapter3) で行ったように `evaluate.load()` 関数で読み込むことができるようになります。
 
 {:else}
 
@@ -544,14 +544,14 @@ model.fit(
 !pip install seqeval
 ```
 
-そして、[第3章](/course/ja/chapter3) で行ったように `load_metric()` 関数で読み込むことができるようになります。
+そして、[第3章](/course/ja/chapter3) で行ったように `evaluate.load()` 関数で読み込むことができるようになります。
 
 {/if}
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("seqeval")
+metric = evaluate.load("seqeval")
 ```
 
 この指標は標準的な精度指標のように動作しません：実際にはラベルのリストを整数ではなく文字列として受け取るので、予測値とラベルを指標に渡す前に完全にデコードする必要があります。
diff --git a/chapters/ja/chapter7/4.mdx b/chapters/ja/chapter7/4.mdx
index 969647748..cadd8c24c 100644
--- a/chapters/ja/chapter7/4.mdx
+++ b/chapters/ja/chapter7/4.mdx
@@ -56,7 +56,7 @@
 いつものように、 `load_dataset()` 関数を使用してデータセットをダウンロードします。
 
 ```py
-from datasets import load_dataset, load_metric
+from datasets import load_dataset
 
 raw_datasets = load_dataset("kde4", lang1="en", lang2="fr")
 ```
@@ -439,12 +439,12 @@ BLEUの弱点は、テキストがすでにトークン化されていること
 !pip install sacrebleu
 ```
 
-そして、[第3章](/course/ja/chapter3) で行ったように `load_metric()` で読み込むことができるようになります。
+そして、[第3章](/course/ja/chapter3) で行ったように `evaluate.load()` で読み込むことができるようになります。
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("sacrebleu")
+metric = evaluate.load("sacrebleu")
 ```
 
 この指標はテキストを入力とターゲットとして受け取ります。同じ文でも複数の翻訳があることが多いので、複数の翻訳を受け入れるように設計されています。私たちが使っているデータセットは1つしか提供していませんが、NLPでは複数の文をラベルとして与えるデータセットが珍しくありません。つまり、予測は文のリストであるべきですが、その参照は文のリストのリストであるべきなのです。
diff --git a/chapters/ja/chapter7/5.mdx b/chapters/ja/chapter7/5.mdx
index 3f83c6dad..13232100e 100644
--- a/chapters/ja/chapter7/5.mdx
+++ b/chapters/ja/chapter7/5.mdx
@@ -354,9 +354,9 @@ $$ \mathrm{Precision} = \frac{\mathrm{Number\,of\,overlapping\, words}}{\mathrm{
 そして、ROUGE指標を読み込みます。
 
 ```python
-from datasets import load_metric
+import evaluate
 
-rouge_score = load_metric("rouge")
+rouge_score = evaluate.load("rouge")
 ```
 
 そして、`rouge_score.compute()`関数を使って、すべての指標を一度に計算することができます。
diff --git a/chapters/ja/chapter7/7.mdx b/chapters/ja/chapter7/7.mdx
index e54482205..8ee20d14f 100644
--- a/chapters/ja/chapter7/7.mdx
+++ b/chapters/ja/chapter7/7.mdx
@@ -671,12 +671,12 @@ for example in small_eval_set:
     predicted_answers.append({"id": example_id, "prediction_text": best_answer["text"]})
 ```
 
-予測された答えの最終的なフォーマットは、私たちが使用する指標によって期待されるものです。いつものように、🤗 Datasetsライブラリの助けを借りて読み込むことができます。
+予測された答えの最終的なフォーマットは、私たちが使用する指標によって期待されるものです。いつものように、🤗 Evaluateライブラリの助けを借りて読み込むことができます。
 
 ```python
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("squad")
+metric = evaluate.load("squad")
 ```
 
 この指標は、上で見た形式の予測された答え（サンプルのIDと予測されたテキストの1つのキーを持つ辞書のリスト）と、下の形式の理論的な答え（サンプルのIDと可能な答えの1つのキーを持つ辞書のリスト）を期待するものです。
diff --git a/chapters/ru/chapter3/3.mdx b/chapters/ru/chapter3/3.mdx
index b68dbfa01..5ff79c1ed 100644
--- a/chapters/ru/chapter3/3.mdx
+++ b/chapters/ru/chapter3/3.mdx
@@ -113,12 +113,12 @@ import numpy as np
 preds = np.argmax(predictions.predictions, axis=-1)
 ```
 
-Теперь мы можем сравнить эти предсказания с лейблами. Для создания функции `compute_metric()` мы воспользуемся метриками из библиотеки 🤗 Datasets. Мы можем загрузить подходящие для датасета MRPC метрики так же просто, как мы загрузили датасет, но на этот раз с помощью функции `load_metric()`. Возвращаемый объект имеет метод `compute()`, который мы можем использовать для вычисления метрики: 
+Теперь мы можем сравнить эти предсказания с лейблами. Для создания функции `compute_metric()` мы воспользуемся метриками из библиотеки 🤗 [Evaluate](https://github.com/huggingface/evaluate/). Мы можем загрузить подходящие для датасета MRPC метрики так же просто, как мы загрузили датасет, но на этот раз с помощью функции `evaluate.load()`. Возвращаемый объект имеет метод `compute()`, который мы можем использовать для вычисления метрики: 
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 metric.compute(predictions=preds, references=predictions.label_ids)
 ```
 
@@ -132,7 +132,7 @@ metric.compute(predictions=preds, references=predictions.label_ids)
 
 ```py
 def compute_metrics(eval_preds):
-    metric = load_metric("glue", "mrpc")
+    metric = evaluate.load("glue", "mrpc")
     logits, labels = eval_preds
     predictions = np.argmax(logits, axis=-1)
     return metric.compute(predictions=predictions, references=labels)
diff --git a/chapters/ru/chapter3/3_tf.mdx b/chapters/ru/chapter3/3_tf.mdx
index 01f73e339..a3b1f7ef6 100644
--- a/chapters/ru/chapter3/3_tf.mdx
+++ b/chapters/ru/chapter3/3_tf.mdx
@@ -173,12 +173,12 @@ print(preds.shape, class_preds.shape)
 (408, 2) (408,)
 ```
 
-Теперь давайте используем эти `preds` для вычисления некоторых метрик! Мы можем загрузить метрики, связанные с датасетом MRPC, так же легко, как мы загрузили этот датасет, на этот раз с помощью функции `load_metric()`. Возвращаемый объект имеет метод `compute()`, который мы можем использовать для вычисления метрики:
+Теперь давайте используем эти `preds` для вычисления некоторых метрик! Мы можем загрузить метрики, связанные с датасетом MRPC, так же легко, как мы загрузили этот датасет, на этот раз с помощью функции `evaluate.load()`. Возвращаемый объект имеет метод `compute()`, который мы можем использовать для вычисления метрики:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 metric.compute(predictions=class_preds, references=raw_datasets["validation"]["label"])
 ```
 
diff --git a/chapters/ru/chapter3/4.mdx b/chapters/ru/chapter3/4.mdx
index c267ca864..15568df81 100644
--- a/chapters/ru/chapter3/4.mdx
+++ b/chapters/ru/chapter3/4.mdx
@@ -172,12 +172,12 @@ for epoch in range(num_epochs):
 
 ### Валидационный цикл
 
-Ранее мы использовали метрику, которую нам предоставляла библиотека 🤗 Datasets. Мы уже знаем, что есть метод `metric.compute()`, однако метрики могут накапливать значения в процессе итерирования по батчу, для этого есть метод `add_batch()`. После того, как мы пройдемся по всем батчам, мы сможем вычислить финальный результат с помощью `metric.compute()`. Вот пример того, как это можно сделать в цикле валидации:
+Ранее мы использовали метрику, которую нам предоставляла библиотека 🤗 Evaluate. Мы уже знаем, что есть метод `metric.compute()`, однако метрики могут накапливать значения в процессе итерирования по батчу, для этого есть метод `add_batch()`. После того, как мы пройдемся по всем батчам, мы сможем вычислить финальный результат с помощью `metric.compute()`. Вот пример того, как это можно сделать в цикле валидации:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 model.eval()
 for batch in eval_dataloader:
     batch = {k: v.to(device) for k, v in batch.items()}
diff --git a/chapters/th/chapter3/3.mdx b/chapters/th/chapter3/3.mdx
index 783b9325b..e510e443d 100644
--- a/chapters/th/chapter3/3.mdx
+++ b/chapters/th/chapter3/3.mdx
@@ -110,12 +110,12 @@ import numpy as np
 preds = np.argmax(predictions.predictions, axis=-1)
 ```
 
-ตอนนี้เราก็สามารถเปรียบเทียบ `preds` เหล่านี้กับ labels ของเราได้แล้ว เพื่อจะสร้างฟังก์ชั่น `compute_metric()` ของเรา เราจะยืม  metrics จากไลบรารี่ 🤗 Datasets มาใช้ เราสามารถโหลด metrics ที่เกี่ยวข้องกับ MRPC dataset ได้อย่างง่ายดายเหมือนกับที่เราโหลดชุดข้อมูล โดยการใช้ฟังก์ชั่น `load_metric()` โดยจะได้ผลลัพธ์เป็นออพเจ็กต์ที่มีเมธอด `compute()` ที่เราสามารถนำไปใช้ในการคำนวณ metric ได้:
+ตอนนี้เราก็สามารถเปรียบเทียบ `preds` เหล่านี้กับ labels ของเราได้แล้ว เพื่อจะสร้างฟังก์ชั่น `compute_metric()` ของเรา เราจะยืม  metrics จากไลบรารี่ 🤗 [Evaluate](https://github.com/huggingface/evaluate/) มาใช้ เราสามารถโหลด metrics ที่เกี่ยวข้องกับ MRPC dataset ได้อย่างง่ายดายเหมือนกับที่เราโหลดชุดข้อมูล โดยการใช้ฟังก์ชั่น `evaluate.load()` โดยจะได้ผลลัพธ์เป็นออพเจ็กต์ที่มีเมธอด `compute()` ที่เราสามารถนำไปใช้ในการคำนวณ metric ได้:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 metric.compute(predictions=preds, references=predictions.label_ids)
 ```
 
@@ -129,7 +129,7 @@ metric.compute(predictions=preds, references=predictions.label_ids)
 
 ```py
 def compute_metrics(eval_preds):
-    metric = load_metric("glue", "mrpc")
+    metric = evaluate.load("glue", "mrpc")
     logits, labels = eval_preds
     predictions = np.argmax(logits, axis=-1)
     return metric.compute(predictions=predictions, references=labels)
diff --git a/chapters/th/chapter3/3_tf.mdx b/chapters/th/chapter3/3_tf.mdx
index 6e6941754..e9ccd4611 100644
--- a/chapters/th/chapter3/3_tf.mdx
+++ b/chapters/th/chapter3/3_tf.mdx
@@ -179,12 +179,12 @@ print(preds.shape, class_preds.shape)
 (408, 2) (408,)
 ```
 
-ตอนนี้เรามาใช้ `preds` เพื่อคำนวณ metrics บางอย่างกันดีกว่า! เราสามารถโหลด metrics ที่เกี่ยวข้องกับ MRPC dataset ได้อย่างง่ายดายเหมือนกับที่เราโหลดชุดข้อมูล โดยการใช้ฟังก์ชั่น `load_metric()` โดยจะได้ผลลัพธ์เป็นออพเจ็กต์ที่มีเมธอด `compute()` ที่เราสามารถนำไปใช้ในการคำนวณ metric ได้:
+ตอนนี้เรามาใช้ `preds` เพื่อคำนวณ metrics บางอย่างกันดีกว่า! เราสามารถโหลด metrics ที่เกี่ยวข้องกับ MRPC dataset ได้อย่างง่ายดายเหมือนกับที่เราโหลดชุดข้อมูล โดยการใช้ฟังก์ชั่น `evaluate.load()` โดยจะได้ผลลัพธ์เป็นออพเจ็กต์ที่มีเมธอด `compute()` ที่เราสามารถนำไปใช้ในการคำนวณ metric ได้:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 metric.compute(predictions=class_preds, references=raw_datasets["validation"]["label"])
 ```
 
diff --git a/chapters/th/chapter3/4.mdx b/chapters/th/chapter3/4.mdx
index c8fcec348..d5f9f2f94 100644
--- a/chapters/th/chapter3/4.mdx
+++ b/chapters/th/chapter3/4.mdx
@@ -172,12 +172,12 @@ for epoch in range(num_epochs):
 
 ### ลูปในการประเมินผลโมเดล (evaluation loop)
 
-เหมือนกับที่เราได้ทำไว้ก่อนหน้านี้ เราสามารถเรียกใช้ metric จากไลบรารี่ 🤗 Datasets ได้เลย เราได้เห็นเมธอด `metric.compute() มาแล้ว แต่ metrics ยังสามารถรวบรวมผลมาเป็น batches ให้เราได้ด้วย โดยใช้เมธอด `add_batch()` โดยเมื่อเรารวบรวมผลมาจากทุก batches แล้ว เราก็จะคำนวณผลลัพธ์สุดท้ายได้โดยใช้เมธอด `metric.compute()` โค้ดข้างล่างนี้เป็นตัวอย่างการทำทุกอย่างที่เรากล่าวมานี้ในลูปสำหรับประเมินผลโมเดล:
+เหมือนกับที่เราได้ทำไว้ก่อนหน้านี้ เราสามารถเรียกใช้ metric จากไลบรารี่ 🤗 Evaluate ได้เลย เราได้เห็นเมธอด `metric.compute()` มาแล้ว แต่ metrics ยังสามารถรวบรวมผลมาเป็น batches ให้เราได้ด้วย โดยใช้เมธอด `add_batch()` โดยเมื่อเรารวบรวมผลมาจากทุก batches แล้ว เราก็จะคำนวณผลลัพธ์สุดท้ายได้โดยใช้เมธอด `metric.compute()` โค้ดข้างล่างนี้เป็นตัวอย่างการทำทุกอย่างที่เรากล่าวมานี้ในลูปสำหรับประเมินผลโมเดล:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 model.eval()
 for batch in eval_dataloader:
     batch = {k: v.to(device) for k, v in batch.items()}
diff --git a/chapters/zh-CN/chapter3/3.mdx b/chapters/zh-CN/chapter3/3.mdx
index 1d452b8fe..de8018344 100644
--- a/chapters/zh-CN/chapter3/3.mdx
+++ b/chapters/zh-CN/chapter3/3.mdx
@@ -110,12 +110,12 @@ import numpy as np
 preds = np.argmax(predictions.predictions, axis=-1)
 ```
 
-现在建立我们的 **compute_metric()** 函数来较为直观地评估模型的好坏，我们将使用 🤗 Datasets 库中的指标。我们可以像加载数据集一样轻松加载与 MRPC 数据集关联的指标，这次使用 **load_metric()** 函数。返回的对象有一个 **compute()**方法我们可以用来进行度量计算的方法：
+现在建立我们的 **compute_metric()** 函数来较为直观地评估模型的好坏，我们将使用 🤗 [Evaluate](https://github.com/huggingface/evaluate/) 库中的指标。我们可以像加载数据集一样轻松加载与 MRPC 数据集关联的指标，这次使用 **evaluate.load()** 函数。返回的对象有一个 **compute()**方法我们可以用来进行度量计算的方法：
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 metric.compute(predictions=preds, references=predictions.label_ids)
 ```
 
@@ -129,7 +129,7 @@ metric.compute(predictions=preds, references=predictions.label_ids)
 
 ```py
 def compute_metrics(eval_preds):
-    metric = load_metric("glue", "mrpc")
+    metric = evaluate.load("glue", "mrpc")
     logits, labels = eval_preds
     predictions = np.argmax(logits, axis=-1)
     return metric.compute(predictions=predictions, references=labels)
diff --git a/chapters/zh-CN/chapter3/3_tf.mdx b/chapters/zh-CN/chapter3/3_tf.mdx
index 911e12a92..be3953a8c 100644
--- a/chapters/zh-CN/chapter3/3_tf.mdx
+++ b/chapters/zh-CN/chapter3/3_tf.mdx
@@ -172,12 +172,12 @@ print(preds.shape, class_preds.shape)
 (408, 2) (408,)
 ```
 
-现在，让我们使用这些 `preds` 来计算一些指标！ 我们可以像加载数据集一样轻松地加载与 MRPC 数据集相关的指标，这次使用的是 `load_metric()` 函数。 返回的对象有一个 `compute()` 方法，我们可以使用它来进行度量计算：
+现在，让我们使用这些 `preds` 来计算一些指标！ 我们可以像加载数据集一样轻松地加载与 MRPC 数据集相关的指标，这次使用的是 `evaluate.load()` 函数。 返回的对象有一个 `compute()` 方法，我们可以使用它来进行度量计算：
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 metric.compute(predictions=class_preds, references=raw_datasets["validation"]["label"])
 ```
 
diff --git a/chapters/zh-CN/chapter3/4.mdx b/chapters/zh-CN/chapter3/4.mdx
index f1de4cc48..aab5f40a6 100644
--- a/chapters/zh-CN/chapter3/4.mdx
+++ b/chapters/zh-CN/chapter3/4.mdx
@@ -171,12 +171,12 @@ for epoch in range(num_epochs):
 
 ### 评估循环
 
-正如我们之前所做的那样，我们将使用 🤗 Datasets 库提供的指标。我们已经了解了 `metric.compute()` 方法，当我们使用 `add_batch()`方法进行预测循环时，实际上该指标可以为我们累积所有 `batch` 的结果。一旦我们累积了所有 `batch` ，我们就可以使用 `metric.compute()` 得到最终结果 .以下是在评估循环中实现所有这些的方法:
+正如我们之前所做的那样，我们将使用 🤗 Evaluate 库提供的指标。我们已经了解了 `metric.compute()` 方法，当我们使用 `add_batch()`方法进行预测循环时，实际上该指标可以为我们累积所有 `batch` 的结果。一旦我们累积了所有 `batch` ，我们就可以使用 `metric.compute()` 得到最终结果 .以下是在评估循环中实现所有这些的方法:
 
 ```py
-from datasets import load_metric
+import evaluate
 
-metric = load_metric("glue", "mrpc")
+metric = evaluate.load("glue", "mrpc")
 model.eval()
 for batch in eval_dataloader:
     batch = {k: v.to(device) for k, v in batch.items()}
diff --git a/utils/generate_notebooks.py b/utils/generate_notebooks.py
index 875797744..bd41c5dcb 100644
--- a/utils/generate_notebooks.py
+++ b/utils/generate_notebooks.py
@@ -185,11 +185,11 @@ def build_notebook(fname, title, output_dir="."):
 
         nb_cells = [
             nb_cell(f"# {title}", code=False),
-            nb_cell("Install the Transformers and Datasets libraries to run this notebook.", code=False),
+            nb_cell("Install the Transformers, Datasets, and Evaluate libraries to run this notebook.", code=False),
         ]
 
         # Install cell
-        installs = ["!pip install datasets transformers[sentencepiece]"]
+        installs = ["!pip install datasets evaluate transformers[sentencepiece]"]
         if title in sections_with_accelerate:
             installs.append("!pip install accelerate")
             installs.append("# To run the training on TPU, you will need to uncomment the followin line:")