-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Examples] TPU-based training of a language model using TensorFlow #21657
Conversation
) | ||
parser.add_argument( | ||
"-vs", | ||
"--vocab_size", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should play around with this a bit to see if a multiple of 64 actually helps improve the efficiency. Reference: https://twitter.com/karpathy/status/1621578354024677377?s=20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just use a multiple of 64 anyway, it's not really a big change! The next multiple of 64 after 10000 is 10048.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to me to me to train the tokenizer and redo the TFRecords with that?
The documentation is not available anymore as the PR was closed or merged. |
parser.add_argument( | ||
"--shard_size", | ||
type=int, | ||
default=1000, | ||
help="Number of entries to go in a single shard.", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should likely follow some advice from this guide to decide this number when running things at the full scale.
Co-authored-by: Matt <rocketknight1@gmail.com>
@Rocketknight1 I incorporated the Here's Colab Notebook where I verified these. |
@Rocketknight1 I took a deeper look into the TFRecord preparation script. I don't understand why there's a discrepancy in the following. While serializing the TFRecords, I am making each TFRecord shard has got a specific number of samples. When there are lesser samples for a TFRecord shard than the specified amount, that's fine. But when I load the TFRecords back and create a Here is a minimal Colab Notebook that demonstrates the issue: https://colab.research.google.com/gist/sayakpaul/b4b02f3f656c0041c93f6ba78c8e65fd/scratchpad.ipynb. When you get a moment, could you take a look? |
Thanks @Rocketknight1 for your help in debugging #21657 (comment) (discussed internally via Slack). I am currently regenerating the TFRecord shards. I will update here once that's done. |
@Rocketknight1 corrected TFRecord shards have been pushed to Here are the record counts per split:
The TFRecords were generated with a block size of 512. |
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
@Rocketknight1 the training code looks good to me, except for a few things:
But all these are non-blockers. Let's do 4 - 5 training runs varying the number of epochs and the learning rate. |
@sayakpaul MLM probability added as an arg and I modularized the loading! |
@Rocketknight1 started a training run with: python3 train_model.py \
--tokenizer tf-tpu/unigram-tokenizer-wikitext \
--per_replica_batch_size 64 \
--tpu_name local --tpu_zone us-central1 --gcp_project huggingface-ml --bfloat16 \
--train_dataset gs://tf-tpu-training-resources/train --eval_dataset gs://tf-tpu-training-resources/validation \
--num_epochs 100 \
--output_dir roberta-base-epochs-100 --hub_model_id tf-tpu/roberta-base-epochs-100 |
@Rocketknight1 here's the final model trained with the command from here: https://huggingface.co/tf-tpu/roberta-base-epochs-100 When you try out examples in the widget of the model page ^, pass |
@Rocketknight1 could you review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this! I left a couple of comments.
examples/tensorflow/tpu/language-modeling/prepare_tfrecord_shards.py
Outdated
Show resolved
Hide resolved
examples/tensorflow/tpu/language-modeling/prepare_tfrecord_shards.py
Outdated
Show resolved
Hide resolved
special_tokens_mask = ( | ||
~tf.cast(batch["attention_mask"], tf.bool) | ||
| (batch["input_ids"] == tokenizer.cls_token_id) | ||
| (batch["input_ids"] == tokenizer.sep_token_id) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have the tokenizer return the special_token_mask
instead of computing it manually here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I was being clever but not storing all that data in the TFRecords, but you're right that it's probably just extra complexity. Let me fix it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, on second thoughts, fixing it would require regenerating and reuploading the whole dataset and then updating the training loop too. Think it's worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily, but it would be cleaner if you ever do a v2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted, will do!
@sgugger thanks! I addressed your comments. For #21657 (comment), I will defer to @Rocketknight1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
Merging since the failing tests are unrelated. |
…uggingface#21657) * add: tokenizer training script for TF TPU LM training. * add: script for preparing the TFRecord shards. * add: sequence of execution to readme. * remove limit from the tfrecord shard name. * Add initial train_model.py * Add basic training arguments and model init * Get up to the point of writing the data collator * Pushing progress so far! * Complete first draft of model training code * feat: grouping of texts efficiently. Co-authored-by: Matt <rocketknight1@gmail.com> * Add proper masking collator and get training loop working * fix: things. * Read sample counts from filenames * Read sample counts from filenames * Draft README * Improve TPU warning * Use distribute instead of distribute.experimental * Apply suggestions from code review Co-authored-by: Matt <Rocketknight1@users.noreply.github.com> * Modularize loading and add MLM probability as arg * minor refactoring to better use the cli args. * readme fillup. * include tpu and inference sections in the readme. * table of contents. * parallelize maps. * polish readme. * change script name to run_mlm.py * address PR feedback (round I). --------- Co-authored-by: Matt <rocketknight1@gmail.com> Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
This PR adds an example of performing (masked) language model training using TensorFlow and TPUs. The example is meant to act as a reference for the community on this topic. The following are the main components of the PR:
The purpose of this separation (as opposed to having everything in a single script) is to allow the community to have isolated reference points for performing TPU-based training of our models, which I think is beneficial.
The artifacts produced during this project can be found here: https://huggingface.co/tf-tpu.
Cc: @Rocketknight1 @gante @amyeroberts