TF generation fixes (#344)

Rocketknight1 · lewtun · web-flow · commit c43313237e89 · 2022-10-28T09:20:34.000+02:00
* Fixes to chapter 7

Co-authored-by: lewtun &lt;lewis.c.tunstall@gmail.com&gt;
diff --git a/chapters/en/chapter7/2.mdx b/chapters/en/chapter7/2.mdx
@@ -371,7 +371,7 @@ As we can see, the second set of labels has been padded to the length of the fir
 
 {:else}
 
-Our data collator is ready to go! Now let's use it to make a `tf.data.Dataset` with the `to_tf_dataset()` method.
+Our data collator is ready to go! Now let's use it to make a `tf.data.Dataset` with the `to_tf_dataset()` method. You can also use `model.prepare_tf_dataset()` to do this with a bit less boilerplate code - you'll see this in some of the other sections of this chapter.
 
 ```py
 tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
@@ -616,7 +616,7 @@ import numpy as np
 all_predictions = []
 all_labels = []
 for batch in tf_eval_dataset:
-    logits = model.predict(batch)["logits"]
+    logits = model.predict_on_batch(batch)["logits"]
     labels = batch["labels"]
     predictions = np.argmax(logits, axis=-1)
     for prediction, label in zip(predictions, labels):
diff --git a/chapters/en/chapter7/3.mdx b/chapters/en/chapter7/3.mdx
@@ -96,7 +96,6 @@ model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)
 We can see how many parameters this model has by calling the `summary()` method:
 
 ```python
-model(model.dummy_inputs)  # Build the model
 model.summary()
 ```
 
@@ -636,18 +635,18 @@ in your favorite terminal and log in there.
 
 {#if fw === 'tf'}
 
-Once we're logged in, we can create our `tf.data` datasets. We'll just use the standard data collator here, but you can also try the whole word masking collator and compare the results as an exercise:
+Once we're logged in, we can create our `tf.data` datasets. To do so, we'll use the `prepare_tf_dataset()` method, which uses our model to automatically infer which columns should go into the dataset. If you want to control exactly which columns to use, you can use the `Dataset.to_tf_dataset()` method instead. To keep things simple, we'll just use the standard data collator here, but you can also try the whole word masking collator and compare the results as an exercise:
 
 ```python
-tf_train_dataset = downsampled_dataset["train"].to_tf_dataset(
-    columns=["input_ids", "attention_mask", "labels"],
+tf_train_dataset = model.prepare_tf_dataset(
+    downsampled_dataset["train"],
     collate_fn=data_collator,
     shuffle=True,
     batch_size=32,
 )
 
-tf_eval_dataset = downsampled_dataset["test"].to_tf_dataset(
-    columns=["input_ids", "attention_mask", "labels"],
+tf_eval_dataset = model.prepare_tf_dataset(
+    downsampled_dataset["test"],
     collate_fn=data_collator,
     shuffle=False,
     batch_size=32,
@@ -675,6 +674,7 @@ model.compile(optimizer=optimizer)
 # Train in mixed-precision float16
 tf.keras.mixed_precision.set_global_policy("mixed_float16")
 
+model_name = model_checkpoint.split("/")[-1]
 callback = PushToHubCallback(
     output_dir=f"{model_name}-finetuned-imdb", tokenizer=tokenizer
 )
diff --git a/chapters/en/chapter7/4.mdx b/chapters/en/chapter7/4.mdx
@@ -378,14 +378,14 @@ We will pass this `data_collator` along to the `Seq2SeqTrainer`. Next, let's hav
 We can now use this `data_collator` to convert each of our datasets to a `tf.data.Dataset`, ready for training:
 
 ```python
-tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
-    columns=["input_ids", "attention_mask", "labels"],
+tf_train_dataset = model.prepare_tf_dataset(
+    tokenized_datasets["train"],
     collate_fn=data_collator,
     shuffle=True,
     batch_size=32,
 )
-tf_eval_dataset = tokenized_datasets["validation"].to_tf_dataset(
-    columns=["input_ids", "attention_mask", "labels"],
+tf_eval_dataset = model.prepare_tf_dataset(
+    tokenized_datasets["validation"],
     collate_fn=data_collator,
     shuffle=False,
     batch_size=16,
@@ -495,28 +495,42 @@ The score can go from 0 to 100, and higher is better.
 
 {#if fw === 'tf'}
 
-To get from the model outputs to texts the metric can use, we will use the `tokenizer.batch_decode()` method. We just have to clean up all the `-100`s in the labels; the tokenizer will automatically do the same for the padding token. Let's define a function that takes our model and a dataset and computes metrics on it. Because generation of long sequences can be slow, we subsample the validation set to make sure this doesn't take forever:
+To get from the model outputs to texts the metric can use, we will use the `tokenizer.batch_decode()` method. We just have to clean up all the `-100`s in the labels; the tokenizer will automatically do the same for the padding token. Let's define a function that takes our model and a dataset and computes metrics on it. We're also going to use a trick that dramatically increases performance - compiling our generation code with [XLA](https://www.tensorflow.org/xla), TensorFlow's accelerated linear algebra compiler. XLA applies various optimizations to the model's computation graph, and results in significant improvements to speed and memory usage. As described in the Hugging Face [blog](https://huggingface.co/blog/tf-xla-generate), XLA works best when our input shapes don't vary too much. To handle this, we'll pad our inputs to multiples of 128, and make a new dataset with the padding collator, and then we'll apply the `@tf.function(jit_compile=True)` decorator to our generation function, which marks the whole function for compilation with XLA. 
 
 ```py
 import numpy as np
+import tensorflow as tf
+from tqdm import tqdm
+
+generation_data_collator = DataCollatorForSeq2Seq(
+    tokenizer, model=model, return_tensors="tf", pad_to_multiple_of=128
+)
+
+tf_generate_dataset = model.prepare_tf_dataset(
+    tokenized_datasets["validation"],
+    collate_fn=generation_data_collator,
+    shuffle=False,
+    batch_size=8,
+)
+
+
+@tf.function(jit_compile=True)
+def generate_with_xla(batch):
+    return model.generate(
+        input_ids=batch["input_ids"],
+        attention_mask=batch["attention_mask"],
+        max_new_tokens=128,
+    )
 
 
 def compute_metrics():
     all_preds = []
     all_labels = []
-    sampled_dataset = tokenized_datasets["validation"].shuffle().select(range(200))
-    tf_generate_dataset = sampled_dataset.to_tf_dataset(
-        columns=["input_ids", "attention_mask", "labels"],
-        collate_fn=data_collator,
-        shuffle=False,
-        batch_size=4,
-    )
-    for batch in tf_generate_dataset:
-        predictions = model.generate(
-            input_ids=batch["input_ids"], attention_mask=batch["attention_mask"]
-        )
+
+    for batch, labels in tqdm(tf_generate_dataset):
+        predictions = generate_with_xla(batch)
         decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
-        labels = batch["labels"].numpy()
+        labels = labels.numpy()
         labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
         decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
         decoded_preds = [pred.strip() for pred in decoded_preds]
diff --git a/chapters/en/chapter7/5.mdx b/chapters/en/chapter7/5.mdx
@@ -289,9 +289,10 @@ def preprocess_function(examples):
         max_length=max_input_length,
         truncation=True,
     )
-    labels = tokenizer(text_target=targets, max_length=max_target_length, truncation=True)
+    labels = tokenizer(
+        examples["review_title"], max_length=max_target_length, truncation=True
+    )
     model_inputs["labels"] = labels["input_ids"]
-    model_inputs["labels_mask"] = labels["attention_mask"]
     return model_inputs
 ```
 
@@ -673,14 +674,14 @@ To wrap up this section, let's take a look at how we can also fine-tune mT5 usin
 We're almost ready to train! We just need to convert our datasets to `tf.data.Dataset`s using the data collator we defined above, and then `compile()` and `fit()` the model. First, the datasets:
 
 ```python
-tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
-    columns=["input_ids", "attention_mask", "labels"],
+tf_train_dataset = model.prepare_tf_dataset(
+    tokenized_datasets["train"],
     collate_fn=data_collator,
     shuffle=True,
     batch_size=8,
 )
-tf_eval_dataset = tokenized_datasets["validation"].to_tf_dataset(
-    columns=["input_ids", "attention_mask", "labels"],
+tf_eval_dataset = model.prepare_tf_dataset(
+    tokenized_datasets["validation"],
     collate_fn=data_collator,
     shuffle=False,
     batch_size=8,
@@ -727,18 +728,40 @@ model.fit(
 )
 ```
 
-We got some loss values during training, but really we'd like to see the ROUGE metrics we computed earlier. To get those metrics, we'll need to generate outputs from the model and convert them to strings. Let's build some lists of labels and predictions for the ROUGE metric to compare (note that if you get import errors for this section, you may need to`!pip install tqdm`):
+We got some loss values during training, but really we'd like to see the ROUGE metrics we computed earlier. To get those metrics, we'll need to generate outputs from the model and convert them to strings. Let's build some lists of labels and predictions for the ROUGE metric to compare (note that if you get import errors for this section, you may need to`!pip install tqdm`). We're also going to use a trick that dramatically increases performance - compiling our generation code with [XLA](https://www.tensorflow.org/xla), TensorFlow's accelerated linear algebra compiler. XLA applies various optimizations to the model's computation graph, and results in significant improvements to speed and memory usage. As described in the Hugging Face [blog](https://huggingface.co/blog/tf-xla-generate), XLA works best when our input shapes don't vary too much. To handle this, we'll pad our inputs to multiples of 128, and make a new dataset with the padding collator, and then we'll apply the `@tf.function(jit_compile=True)` decorator to our generation function, which marks the whole function for compilation with XLA. 
 
 ```python
 from tqdm import tqdm
 import numpy as np
 
+generation_data_collator = DataCollatorForSeq2Seq(
+    tokenizer, model=model, return_tensors="tf", pad_to_multiple_of=320
+)
+
+tf_generate_dataset = model.prepare_tf_dataset(
+    tokenized_datasets["validation"],
+    collate_fn=generation_data_collator,
+    shuffle=False,
+    batch_size=8,
+    drop_remainder=True,
+)
+
+
+@tf.function(jit_compile=True)
+def generate_with_xla(batch):
+    return model.generate(
+        input_ids=batch["input_ids"],
+        attention_mask=batch["attention_mask"],
+        max_new_tokens=32,
+    )
+
+
 all_preds = []
 all_labels = []
-for batch in tqdm(tf_eval_dataset):
-    predictions = model.generate(**batch)
+for batch, labels in tqdm(tf_generate_dataset):
+    predictions = generate_with_xla(batch)
     decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
-    labels = batch["labels"].numpy()
+    labels = labels.numpy()
     labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
     decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
     decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds]
diff --git a/chapters/en/chapter7/6.mdx b/chapters/en/chapter7/6.mdx
@@ -379,17 +379,17 @@ We can see that the examples have been stacked and all the tensors have the same
 
 {#if fw === 'tf'}
 
-Now we can use the `to_tf_dataset()` method to convert our datasets to TensorFlow datasets with the data collator we created above:
+Now we can use the `prepare_tf_dataset()` method to convert our datasets to TensorFlow datasets with the data collator we created above:
 
 ```python
-tf_train_dataset = tokenized_dataset["train"].to_tf_dataset(
-    columns=["input_ids", "attention_mask", "labels"],
+tf_train_dataset = model.prepare_tf_dataset(
+    tokenized_dataset["train"],
     collate_fn=data_collator,
     shuffle=True,
     batch_size=32,
 )
-tf_eval_dataset = tokenized_dataset["valid"].to_tf_dataset(
-    columns=["input_ids", "attention_mask", "labels"],
+tf_eval_dataset = model.prepare_tf_dataset(
+    tokenized_dataset["valid"],
     collate_fn=data_collator,
     shuffle=False,
     batch_size=32,
@@ -515,7 +515,7 @@ model.fit(tf_train_dataset, validation_data=tf_eval_dataset, callbacks=[callback
 
 {:else}
 
-💡 If you have access to a machine with multiple GPUs, you can try using a `MirroredStrategy` context to substantially speed up training. You'll need to create a `tf.distribute.MirroredStrategy` object, and make sure that the `to_tf_dataset` commands as well as model creation and the call to `fit()` are all run in its `scope()` context. You can see documentation on this [here](https://www.tensorflow.org/guide/distributed_training#use_tfdistributestrategy_with_keras_modelfit).
+💡 If you have access to a machine with multiple GPUs, you can try using a `MirroredStrategy` context to substantially speed up training. You'll need to create a `tf.distribute.MirroredStrategy` object, and make sure that any `to_tf_dataset()` or `prepare_tf_dataset()` methods as well as model creation and the call to `fit()` are all run in its `scope()` context. You can see documentation on this [here](https://www.tensorflow.org/guide/distributed_training#use_tfdistributestrategy_with_keras_modelfit).
 
 {/if}
 
diff --git a/chapters/en/chapter7/7.mdx b/chapters/en/chapter7/7.mdx
@@ -862,20 +862,14 @@ data_collator = DefaultDataCollator(return_tensors="tf")
 And now we create the datasets as usual.
 
 ```python
-tf_train_dataset = train_dataset.to_tf_dataset(
-    columns=[
-        "input_ids",
-        "start_positions",
-        "end_positions",
-        "attention_mask",
-        "token_type_ids",
-    ],
+tf_train_dataset = model.prepare_tf_dataset(
+    train_dataset,
     collate_fn=data_collator,
     shuffle=True,
     batch_size=16,
 )
-tf_eval_dataset = validation_dataset.to_tf_dataset(
-    columns=["input_ids", "attention_mask", "token_type_ids"],
+tf_eval_dataset = model.prepare_tf_dataset(
+    validation_dataset,
     collate_fn=data_collator,
     shuffle=False,
     batch_size=16,