[Examples] TPU-based training of a language model using TensorFlow #21657

sayakpaul · 2023-02-16T05:12:07Z

This PR adds an example of performing (masked) language model training using TensorFlow and TPUs. The example is meant to act as a reference for the community on this topic. The following are the main components of the PR:

Tokenizer training script (for completeness)
TFRecords preparation script (recommended practice when using TPUs)
Training script
Evaluation / inference

The purpose of this separation (as opposed to having everything in a single script) is to allow the community to have isolated reference points for performing TPU-based training of our models, which I think is beneficial.

The artifacts produced during this project can be found here: https://huggingface.co/tf-tpu.

Tokenizer (trained from scratch): https://huggingface.co/tf-tpu/unigram-tokenizer-wikitext
Model: https://huggingface.co/tf-tpu/roberta-base-epochs-500-no-wd

Cc: @Rocketknight1 @gante @amyeroberts

sayakpaul · 2023-02-16T05:14:12Z

examples/tensorflow/tpu/language-modeling/train_unigram.py

+    )
+    parser.add_argument(
+        "-vs",
+        "--vocab_size",


Maybe we should play around with this a bit to see if a multiple of 64 actually helps improve the efficiency. Reference: https://twitter.com/karpathy/status/1621578354024677377?s=20

I think we can just use a multiple of 64 anyway, it's not really a big change! The next multiple of 64 after 10000 is 10048.

Do we want to me to me to train the tokenizer and redo the TFRecords with that?

HuggingFaceDocBuilderDev · 2023-02-16T05:30:26Z

The documentation is not available anymore as the PR was closed or merged.

sayakpaul · 2023-02-16T06:22:00Z

examples/tensorflow/tpu/language-modeling/prepare_tfrecord_shards.py

+    parser.add_argument(
+        "--shard_size",
+        type=int,
+        default=1000,
+        help="Number of entries to go in a single shard.",
+    )


We should likely follow some advice from this guide to decide this number when running things at the full scale.

examples/tensorflow/tpu/language-modeling/train_model.py

Co-authored-by: Matt <rocketknight1@gmail.com>

sayakpaul · 2023-02-28T14:10:08Z

@Rocketknight1 I incorporated the group_texts() utility that we discussed over Slack. Let me know if the changes look good to you. Most of it is copy-pasted from here.

Here's Colab Notebook where I verified these.

examples/tensorflow/tpu/language-modeling/train_model.py

sayakpaul · 2023-03-14T13:16:38Z

@Rocketknight1 I took a deeper look into the TFRecord preparation script. I don't understand why there's a discrepancy in the following.

While serializing the TFRecords, I am making each TFRecord shard has got a specific number of samples. When there are lesser samples for a TFRecord shard than the specified amount, that's fine.

But when I load the TFRecords back and create a tf.data.Dataset out of them, the number of entries in the dataset (before batching) is much lesser.

Here is a minimal Colab Notebook that demonstrates the issue: https://colab.research.google.com/gist/sayakpaul/b4b02f3f656c0041c93f6ba78c8e65fd/scratchpad.ipynb.

When you get a moment, could you take a look?

sayakpaul · 2023-03-14T14:01:16Z

Thanks @Rocketknight1 for your help in debugging #21657 (comment) (discussed internally via Slack). I am currently regenerating the TFRecord shards. I will update here once that's done.

sayakpaul · 2023-03-14T15:46:46Z

@Rocketknight1 corrected TFRecord shards have been pushed to gs://tf-tpu-training-resources.

Here are the record counts per split:

Train: 300917
Validation: 626
Test: 722

The TFRecords were generated with a block size of 512.

examples/tensorflow/tpu/language-modeling/train_model.py

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

sayakpaul · 2023-03-21T02:14:51Z

@Rocketknight1 the training code looks good to me, except for a few things:

Maybe we should scale the LR with the batch size?
Take mlm_probability as a CLI arg?
Modularize the dataset preparation code a bit?

But all these are non-blockers. Let's do 4 - 5 training runs varying the number of epochs and the learning rate.

Rocketknight1 · 2023-03-21T12:59:09Z

@sayakpaul MLM probability added as an arg and I modularized the loading!

…f-tpu

sayakpaul · 2023-03-25T05:25:39Z

@Rocketknight1 started a training run with:

python3 train_model.py \
  --tokenizer tf-tpu/unigram-tokenizer-wikitext \
  --per_replica_batch_size 64 \
  --tpu_name local --tpu_zone us-central1 --gcp_project huggingface-ml --bfloat16 \
  --train_dataset gs://tf-tpu-training-resources/train --eval_dataset gs://tf-tpu-training-resources/validation \
  --num_epochs 100 \
  --output_dir roberta-base-epochs-100 --hub_model_id tf-tpu/roberta-base-epochs-100

sayakpaul · 2023-03-26T06:16:06Z

@Rocketknight1 here's the final model trained with the command from here:

https://huggingface.co/tf-tpu/roberta-base-epochs-100

When you try out examples in the widget of the model page ^, pass [MASK] instead of the default <mask>. The results are far from perfect (evident from the validation accuracy), though.

sayakpaul · 2023-04-12T05:59:38Z

@Rocketknight1 could you review this PR?

sgugger

Thanks a lot for working on this! I left a couple of comments.

examples/tensorflow/tpu/language-modeling/README.md

examples/tensorflow/tpu/language-modeling/prepare_tfrecord_shards.py

sgugger · 2023-04-12T11:48:02Z

examples/tensorflow/tpu/language-modeling/run_mlm.py

+        special_tokens_mask = (
+            ~tf.cast(batch["attention_mask"], tf.bool)
+            | (batch["input_ids"] == tokenizer.cls_token_id)
+            | (batch["input_ids"] == tokenizer.sep_token_id)
+        )


Why not have the tokenizer return the special_token_mask instead of computing it manually here?

@Rocketknight1

I thought I was being clever but not storing all that data in the TFRecords, but you're right that it's probably just extra complexity. Let me fix it!

Hm, on second thoughts, fixing it would require regenerating and reuploading the whole dataset and then updating the training loop too. Think it's worth it?

Not necessarily, but it would be cleaner if you ever do a v2.

Noted, will do!

examples/tensorflow/tpu/language-modeling/train_unigram.py

sayakpaul · 2023-04-13T08:58:02Z

@sgugger thanks!

I addressed your comments. For #21657 (comment), I will defer to @Rocketknight1.

gante

🔥

sayakpaul · 2023-04-14T05:10:55Z

Merging since the failing tests are unrelated.

…uggingface#21657) * add: tokenizer training script for TF TPU LM training. * add: script for preparing the TFRecord shards. * add: sequence of execution to readme. * remove limit from the tfrecord shard name. * Add initial train_model.py * Add basic training arguments and model init * Get up to the point of writing the data collator * Pushing progress so far! * Complete first draft of model training code * feat: grouping of texts efficiently. Co-authored-by: Matt <rocketknight1@gmail.com> * Add proper masking collator and get training loop working * fix: things. * Read sample counts from filenames * Read sample counts from filenames * Draft README * Improve TPU warning * Use distribute instead of distribute.experimental * Apply suggestions from code review Co-authored-by: Matt <Rocketknight1@users.noreply.github.com> * Modularize loading and add MLM probability as arg * minor refactoring to better use the cli args. * readme fillup. * include tpu and inference sections in the readme. * table of contents. * parallelize maps. * polish readme. * change script name to run_mlm.py * address PR feedback (round I). --------- Co-authored-by: Matt <rocketknight1@gmail.com> Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

add: tokenizer training script for TF TPU LM training.

0684535

sayakpaul added TensorFlow Anything TensorFlow Examples Which is related to examples in general TPU labels Feb 16, 2023

sayakpaul requested a review from Rocketknight1 February 16, 2023 05:12

sayakpaul assigned Rocketknight1 and sayakpaul Feb 16, 2023

sayakpaul commented Feb 16, 2023

View reviewed changes

sayakpaul added 2 commits February 16, 2023 11:48

add: script for preparing the TFRecord shards.

7b36763

add: sequence of execution to readme.

6a12cf2

sayakpaul commented Feb 16, 2023

View reviewed changes

sayakpaul and others added 3 commits February 16, 2023 14:14

remove limit from the tfrecord shard name.

d6ddbb7

Add initial train_model.py

711ef60

Add basic training arguments and model init

24b9b25

sayakpaul commented Feb 22, 2023

View reviewed changes

examples/tensorflow/tpu/language-modeling/train_model.py Outdated Show resolved Hide resolved

Rocketknight1 and others added 4 commits February 22, 2023 15:32

Get up to the point of writing the data collator

f4656ef

Pushing progress so far!

126f021

Complete first draft of model training code

14b4d9b

feat: grouping of texts efficiently.

af0aa28

Co-authored-by: Matt <rocketknight1@gmail.com>

Add proper masking collator and get training loop working

ad51abb

sayakpaul commented Mar 14, 2023

View reviewed changes

examples/tensorflow/tpu/language-modeling/train_model.py Outdated Show resolved Hide resolved

sayakpaul commented Mar 14, 2023

View reviewed changes

examples/tensorflow/tpu/language-modeling/train_model.py Outdated Show resolved Hide resolved

sayakpaul commented Mar 14, 2023

View reviewed changes

examples/tensorflow/tpu/language-modeling/train_model.py Outdated Show resolved Hide resolved

sayakpaul commented Mar 14, 2023

View reviewed changes

examples/tensorflow/tpu/language-modeling/train_model.py Outdated Show resolved Hide resolved

fix: things.

95bef15

Use distribute instead of distribute.experimental

e2f9925

sayakpaul commented Mar 21, 2023

View reviewed changes

examples/tensorflow/tpu/language-modeling/train_model.py Outdated Show resolved Hide resolved

Apply suggestions from code review

8456011

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

Modularize loading and add MLM probability as arg

6151870

Merge remote-tracking branch 'origin/examples/tf-tpu' into examples/t…

8d54835

…f-tpu

sayakpaul added 2 commits March 25, 2023 11:40

minor refactoring to better use the cli args.

145981f

Merge branch 'main' into examples/tf-tpu

ce3beec

sayakpaul added 6 commits March 27, 2023 09:27

readme fillup.

b2e46de

include tpu and inference sections in the readme.

9ee6456

table of contents.

46872bd

parallelize maps.

661cb92

polish readme.

21e5654

change script name to run_mlm.py

86a88ba

sayakpaul marked this pull request as ready for review April 12, 2023 05:59

sayakpaul requested review from gante and sgugger April 12, 2023 06:01

sgugger reviewed Apr 12, 2023

View reviewed changes

address PR feedback (round I).

566a05d

sgugger approved these changes Apr 13, 2023

View reviewed changes

gante approved these changes Apr 13, 2023

View reviewed changes

sayakpaul merged commit 390e121 into main Apr 14, 2023

sayakpaul deleted the examples/tf-tpu branch April 14, 2023 05:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Examples] TPU-based training of a language model using TensorFlow #21657

[Examples] TPU-based training of a language model using TensorFlow #21657

sayakpaul commented Feb 16, 2023 •

edited

Loading

sayakpaul Feb 16, 2023

Rocketknight1 Mar 20, 2023

sayakpaul Mar 20, 2023

HuggingFaceDocBuilderDev commented Feb 16, 2023 •

edited

Loading

sayakpaul Feb 16, 2023

sayakpaul commented Feb 28, 2023

sayakpaul commented Mar 14, 2023

sayakpaul commented Mar 14, 2023

sayakpaul commented Mar 14, 2023

sayakpaul commented Mar 21, 2023

Rocketknight1 commented Mar 21, 2023

sayakpaul commented Mar 25, 2023 •

edited

Loading

sayakpaul commented Mar 26, 2023 •

edited

Loading

sayakpaul commented Apr 12, 2023

sgugger left a comment

sgugger Apr 12, 2023

sayakpaul Apr 13, 2023

Rocketknight1 Apr 13, 2023

Rocketknight1 Apr 13, 2023

sgugger Apr 13, 2023

Rocketknight1 Apr 13, 2023

sayakpaul commented Apr 13, 2023

gante left a comment

sayakpaul commented Apr 14, 2023

[Examples] TPU-based training of a language model using TensorFlow #21657

[Examples] TPU-based training of a language model using TensorFlow #21657

Conversation

sayakpaul commented Feb 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Feb 16, 2023 • edited Loading

Choose a reason for hiding this comment

sayakpaul commented Feb 28, 2023

sayakpaul commented Mar 14, 2023

sayakpaul commented Mar 14, 2023

sayakpaul commented Mar 14, 2023

sayakpaul commented Mar 21, 2023

Rocketknight1 commented Mar 21, 2023

sayakpaul commented Mar 25, 2023 • edited Loading

sayakpaul commented Mar 26, 2023 • edited Loading

sayakpaul commented Apr 12, 2023

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayakpaul commented Apr 13, 2023

gante left a comment

Choose a reason for hiding this comment

sayakpaul commented Apr 14, 2023

sayakpaul commented Feb 16, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Feb 16, 2023 •

edited

Loading

sayakpaul commented Mar 25, 2023 •

edited

Loading

sayakpaul commented Mar 26, 2023 •

edited

Loading