Skip to content

Commit c433132

Browse files
TF generation fixes (#344)
* Fixes to chapter 7 Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
1 parent 6e652fe commit c433132

File tree

6 files changed

+82
-51
lines changed

6 files changed

+82
-51
lines changed

chapters/en/chapter7/2.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -371,7 +371,7 @@ As we can see, the second set of labels has been padded to the length of the fir
371371

372372
{:else}
373373

374-
Our data collator is ready to go! Now let's use it to make a `tf.data.Dataset` with the `to_tf_dataset()` method.
374+
Our data collator is ready to go! Now let's use it to make a `tf.data.Dataset` with the `to_tf_dataset()` method. You can also use `model.prepare_tf_dataset()` to do this with a bit less boilerplate code - you'll see this in some of the other sections of this chapter.
375375

376376
```py
377377
tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
@@ -616,7 +616,7 @@ import numpy as np
616616
all_predictions = []
617617
all_labels = []
618618
for batch in tf_eval_dataset:
619-
logits = model.predict(batch)["logits"]
619+
logits = model.predict_on_batch(batch)["logits"]
620620
labels = batch["labels"]
621621
predictions = np.argmax(logits, axis=-1)
622622
for prediction, label in zip(predictions, labels):

chapters/en/chapter7/3.mdx

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,6 @@ model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)
9696
We can see how many parameters this model has by calling the `summary()` method:
9797

9898
```python
99-
model(model.dummy_inputs) # Build the model
10099
model.summary()
101100
```
102101

@@ -636,18 +635,18 @@ in your favorite terminal and log in there.
636635

637636
{#if fw === 'tf'}
638637

639-
Once we're logged in, we can create our `tf.data` datasets. We'll just use the standard data collator here, but you can also try the whole word masking collator and compare the results as an exercise:
638+
Once we're logged in, we can create our `tf.data` datasets. To do so, we'll use the `prepare_tf_dataset()` method, which uses our model to automatically infer which columns should go into the dataset. If you want to control exactly which columns to use, you can use the `Dataset.to_tf_dataset()` method instead. To keep things simple, we'll just use the standard data collator here, but you can also try the whole word masking collator and compare the results as an exercise:
640639

641640
```python
642-
tf_train_dataset = downsampled_dataset["train"].to_tf_dataset(
643-
columns=["input_ids", "attention_mask", "labels"],
641+
tf_train_dataset = model.prepare_tf_dataset(
642+
downsampled_dataset["train"],
644643
collate_fn=data_collator,
645644
shuffle=True,
646645
batch_size=32,
647646
)
648647

649-
tf_eval_dataset = downsampled_dataset["test"].to_tf_dataset(
650-
columns=["input_ids", "attention_mask", "labels"],
648+
tf_eval_dataset = model.prepare_tf_dataset(
649+
downsampled_dataset["test"],
651650
collate_fn=data_collator,
652651
shuffle=False,
653652
batch_size=32,
@@ -675,6 +674,7 @@ model.compile(optimizer=optimizer)
675674
# Train in mixed-precision float16
676675
tf.keras.mixed_precision.set_global_policy("mixed_float16")
677676

677+
model_name = model_checkpoint.split("/")[-1]
678678
callback = PushToHubCallback(
679679
output_dir=f"{model_name}-finetuned-imdb", tokenizer=tokenizer
680680
)

chapters/en/chapter7/4.mdx

Lines changed: 31 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -378,14 +378,14 @@ We will pass this `data_collator` along to the `Seq2SeqTrainer`. Next, let's hav
378378
We can now use this `data_collator` to convert each of our datasets to a `tf.data.Dataset`, ready for training:
379379

380380
```python
381-
tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
382-
columns=["input_ids", "attention_mask", "labels"],
381+
tf_train_dataset = model.prepare_tf_dataset(
382+
tokenized_datasets["train"],
383383
collate_fn=data_collator,
384384
shuffle=True,
385385
batch_size=32,
386386
)
387-
tf_eval_dataset = tokenized_datasets["validation"].to_tf_dataset(
388-
columns=["input_ids", "attention_mask", "labels"],
387+
tf_eval_dataset = model.prepare_tf_dataset(
388+
tokenized_datasets["validation"],
389389
collate_fn=data_collator,
390390
shuffle=False,
391391
batch_size=16,
@@ -495,28 +495,42 @@ The score can go from 0 to 100, and higher is better.
495495

496496
{#if fw === 'tf'}
497497

498-
To get from the model outputs to texts the metric can use, we will use the `tokenizer.batch_decode()` method. We just have to clean up all the `-100`s in the labels; the tokenizer will automatically do the same for the padding token. Let's define a function that takes our model and a dataset and computes metrics on it. Because generation of long sequences can be slow, we subsample the validation set to make sure this doesn't take forever:
498+
To get from the model outputs to texts the metric can use, we will use the `tokenizer.batch_decode()` method. We just have to clean up all the `-100`s in the labels; the tokenizer will automatically do the same for the padding token. Let's define a function that takes our model and a dataset and computes metrics on it. We're also going to use a trick that dramatically increases performance - compiling our generation code with [XLA](https://www.tensorflow.org/xla), TensorFlow's accelerated linear algebra compiler. XLA applies various optimizations to the model's computation graph, and results in significant improvements to speed and memory usage. As described in the Hugging Face [blog](https://huggingface.co/blog/tf-xla-generate), XLA works best when our input shapes don't vary too much. To handle this, we'll pad our inputs to multiples of 128, and make a new dataset with the padding collator, and then we'll apply the `@tf.function(jit_compile=True)` decorator to our generation function, which marks the whole function for compilation with XLA.
499499

500500
```py
501501
import numpy as np
502+
import tensorflow as tf
503+
from tqdm import tqdm
504+
505+
generation_data_collator = DataCollatorForSeq2Seq(
506+
tokenizer, model=model, return_tensors="tf", pad_to_multiple_of=128
507+
)
508+
509+
tf_generate_dataset = model.prepare_tf_dataset(
510+
tokenized_datasets["validation"],
511+
collate_fn=generation_data_collator,
512+
shuffle=False,
513+
batch_size=8,
514+
)
515+
516+
517+
@tf.function(jit_compile=True)
518+
def generate_with_xla(batch):
519+
return model.generate(
520+
input_ids=batch["input_ids"],
521+
attention_mask=batch["attention_mask"],
522+
max_new_tokens=128,
523+
)
502524

503525

504526
def compute_metrics():
505527
all_preds = []
506528
all_labels = []
507-
sampled_dataset = tokenized_datasets["validation"].shuffle().select(range(200))
508-
tf_generate_dataset = sampled_dataset.to_tf_dataset(
509-
columns=["input_ids", "attention_mask", "labels"],
510-
collate_fn=data_collator,
511-
shuffle=False,
512-
batch_size=4,
513-
)
514-
for batch in tf_generate_dataset:
515-
predictions = model.generate(
516-
input_ids=batch["input_ids"], attention_mask=batch["attention_mask"]
517-
)
529+
530+
for batch, labels in tqdm(tf_generate_dataset):
531+
predictions = generate_with_xla(batch)
518532
decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
519-
labels = batch["labels"].numpy()
533+
labels = labels.numpy()
520534
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
521535
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
522536
decoded_preds = [pred.strip() for pred in decoded_preds]

chapters/en/chapter7/5.mdx

Lines changed: 33 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -289,9 +289,10 @@ def preprocess_function(examples):
289289
max_length=max_input_length,
290290
truncation=True,
291291
)
292-
labels = tokenizer(text_target=targets, max_length=max_target_length, truncation=True)
292+
labels = tokenizer(
293+
examples["review_title"], max_length=max_target_length, truncation=True
294+
)
293295
model_inputs["labels"] = labels["input_ids"]
294-
model_inputs["labels_mask"] = labels["attention_mask"]
295296
return model_inputs
296297
```
297298

@@ -673,14 +674,14 @@ To wrap up this section, let's take a look at how we can also fine-tune mT5 usin
673674
We're almost ready to train! We just need to convert our datasets to `tf.data.Dataset`s using the data collator we defined above, and then `compile()` and `fit()` the model. First, the datasets:
674675

675676
```python
676-
tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
677-
columns=["input_ids", "attention_mask", "labels"],
677+
tf_train_dataset = model.prepare_tf_dataset(
678+
tokenized_datasets["train"],
678679
collate_fn=data_collator,
679680
shuffle=True,
680681
batch_size=8,
681682
)
682-
tf_eval_dataset = tokenized_datasets["validation"].to_tf_dataset(
683-
columns=["input_ids", "attention_mask", "labels"],
683+
tf_eval_dataset = model.prepare_tf_dataset(
684+
tokenized_datasets["validation"],
684685
collate_fn=data_collator,
685686
shuffle=False,
686687
batch_size=8,
@@ -727,18 +728,40 @@ model.fit(
727728
)
728729
```
729730

730-
We got some loss values during training, but really we'd like to see the ROUGE metrics we computed earlier. To get those metrics, we'll need to generate outputs from the model and convert them to strings. Let's build some lists of labels and predictions for the ROUGE metric to compare (note that if you get import errors for this section, you may need to`!pip install tqdm`):
731+
We got some loss values during training, but really we'd like to see the ROUGE metrics we computed earlier. To get those metrics, we'll need to generate outputs from the model and convert them to strings. Let's build some lists of labels and predictions for the ROUGE metric to compare (note that if you get import errors for this section, you may need to`!pip install tqdm`). We're also going to use a trick that dramatically increases performance - compiling our generation code with [XLA](https://www.tensorflow.org/xla), TensorFlow's accelerated linear algebra compiler. XLA applies various optimizations to the model's computation graph, and results in significant improvements to speed and memory usage. As described in the Hugging Face [blog](https://huggingface.co/blog/tf-xla-generate), XLA works best when our input shapes don't vary too much. To handle this, we'll pad our inputs to multiples of 128, and make a new dataset with the padding collator, and then we'll apply the `@tf.function(jit_compile=True)` decorator to our generation function, which marks the whole function for compilation with XLA.
731732

732733
```python
733734
from tqdm import tqdm
734735
import numpy as np
735736

737+
generation_data_collator = DataCollatorForSeq2Seq(
738+
tokenizer, model=model, return_tensors="tf", pad_to_multiple_of=320
739+
)
740+
741+
tf_generate_dataset = model.prepare_tf_dataset(
742+
tokenized_datasets["validation"],
743+
collate_fn=generation_data_collator,
744+
shuffle=False,
745+
batch_size=8,
746+
drop_remainder=True,
747+
)
748+
749+
750+
@tf.function(jit_compile=True)
751+
def generate_with_xla(batch):
752+
return model.generate(
753+
input_ids=batch["input_ids"],
754+
attention_mask=batch["attention_mask"],
755+
max_new_tokens=32,
756+
)
757+
758+
736759
all_preds = []
737760
all_labels = []
738-
for batch in tqdm(tf_eval_dataset):
739-
predictions = model.generate(**batch)
761+
for batch, labels in tqdm(tf_generate_dataset):
762+
predictions = generate_with_xla(batch)
740763
decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
741-
labels = batch["labels"].numpy()
764+
labels = labels.numpy()
742765
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
743766
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
744767
decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds]

chapters/en/chapter7/6.mdx

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -379,17 +379,17 @@ We can see that the examples have been stacked and all the tensors have the same
379379

380380
{#if fw === 'tf'}
381381

382-
Now we can use the `to_tf_dataset()` method to convert our datasets to TensorFlow datasets with the data collator we created above:
382+
Now we can use the `prepare_tf_dataset()` method to convert our datasets to TensorFlow datasets with the data collator we created above:
383383

384384
```python
385-
tf_train_dataset = tokenized_dataset["train"].to_tf_dataset(
386-
columns=["input_ids", "attention_mask", "labels"],
385+
tf_train_dataset = model.prepare_tf_dataset(
386+
tokenized_dataset["train"],
387387
collate_fn=data_collator,
388388
shuffle=True,
389389
batch_size=32,
390390
)
391-
tf_eval_dataset = tokenized_dataset["valid"].to_tf_dataset(
392-
columns=["input_ids", "attention_mask", "labels"],
391+
tf_eval_dataset = model.prepare_tf_dataset(
392+
tokenized_dataset["valid"],
393393
collate_fn=data_collator,
394394
shuffle=False,
395395
batch_size=32,
@@ -515,7 +515,7 @@ model.fit(tf_train_dataset, validation_data=tf_eval_dataset, callbacks=[callback
515515

516516
{:else}
517517

518-
💡 If you have access to a machine with multiple GPUs, you can try using a `MirroredStrategy` context to substantially speed up training. You'll need to create a `tf.distribute.MirroredStrategy` object, and make sure that the `to_tf_dataset` commands as well as model creation and the call to `fit()` are all run in its `scope()` context. You can see documentation on this [here](https://www.tensorflow.org/guide/distributed_training#use_tfdistributestrategy_with_keras_modelfit).
518+
💡 If you have access to a machine with multiple GPUs, you can try using a `MirroredStrategy` context to substantially speed up training. You'll need to create a `tf.distribute.MirroredStrategy` object, and make sure that any `to_tf_dataset()` or `prepare_tf_dataset()` methods as well as model creation and the call to `fit()` are all run in its `scope()` context. You can see documentation on this [here](https://www.tensorflow.org/guide/distributed_training#use_tfdistributestrategy_with_keras_modelfit).
519519

520520
{/if}
521521

chapters/en/chapter7/7.mdx

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -862,20 +862,14 @@ data_collator = DefaultDataCollator(return_tensors="tf")
862862
And now we create the datasets as usual.
863863

864864
```python
865-
tf_train_dataset = train_dataset.to_tf_dataset(
866-
columns=[
867-
"input_ids",
868-
"start_positions",
869-
"end_positions",
870-
"attention_mask",
871-
"token_type_ids",
872-
],
865+
tf_train_dataset = model.prepare_tf_dataset(
866+
train_dataset,
873867
collate_fn=data_collator,
874868
shuffle=True,
875869
batch_size=16,
876870
)
877-
tf_eval_dataset = validation_dataset.to_tf_dataset(
878-
columns=["input_ids", "attention_mask", "token_type_ids"],
871+
tf_eval_dataset = model.prepare_tf_dataset(
872+
validation_dataset,
879873
collate_fn=data_collator,
880874
shuffle=False,
881875
batch_size=16,

0 commit comments

Comments
 (0)