Improve T5 docs #13240

NielsRogge · 2021-08-24T13:18:11Z

What does this PR do?

This PR aims to clarify and explain some of the magic that's happening when using T5 for training/inference. It includes:

explaining that the model automatically creates the decoder_input_ids based on the labels (a lot of people were still confused by this, see e.g. T5-Training Arguments #11977, Questions on generating using encoder-decoder models #13213
added code examples, to show a basic forward pass that includes the fact that padding token ids of the labels should be replaced by -100 (at least, for PyTorch, I see that for FLAX one uses the decoder_attention_mask to skip padding tokens), and code examples for inference (both batched/not batched)
adding info about T5's variants, including T5v1.1, mT5 and byT5, with links to their docs.
additional tips & tricks, based on what I found on the forum (learning rate, training on TPU, etc).

In addition, I've added T5v1.1 to the main README as well making it have its own documentation page, and some more info to the mT5 docs.

docs/source/model_doc/t5.rst

NielsRogge · 2021-08-25T08:23:29Z

docs/source/model_doc/t5.rst

+- T5 models need a slightly higher learning rate than the default one set in the :obj:`Trainer`. Typically, 1e-4 and
+  3e-4 work well for almost all problems (classification, summarization, translation, question answering, question
+  generation).


Not sure if this is true (I found this on the forum), cc @patrickvonplaten @patil-suraj. Are the summarization and translation scripts just using AdamW with learning rate 5e-5? Is AdaFactor better?

In the original T5 codebase, the adafactor accumulator variables get loaded from the pre-trained checkpoint, and we always use a learning rate of 1e-3, and it always works fine. I'm not sure what a good learning rate would be for Adam-based training of T5, or even adafactor when the accumulator variables aren't loaded from the checkpoint (as they aren't in HF).

AdaFactor requires much less memory and it worked very well for me for Flax T5 pretraining, but I haven't done too much T5 fine-tuning in PyTorch => so I can't say really here...

In my T5 fine-tuning experiments for most of the tasks using higher lr than the default 5e-5 with Adam worked better. But I'm not sure if this should apply to every task.

README.md

docs/source/index.rst

docs/source/model_doc/byt5.rst

docs/source/model_doc/t5.rst

craffel · 2021-08-26T14:28:30Z

docs/source/model_doc/t5.rst

+  batches. This entails that we must pad/truncate examples to the same length. For encoder-decoder models, one
+  typically defines a :obj:`max_source_length` and :obj:`max_target_length`, which determine the length of the input
+  and output sequences respectively. For summarization and translation, a good setting is setting
+  :obj:`max_source_length` to 1024 and :obj:`max_target_length` to 128. In addition, we must make sure that padding


I think for translation it's more common for inputs/targets to be the same length, right? I would use like 128/128 for translation. For summarization, indeed the inputs need to be a lot longer than targets.

Yes I thought that too, not sure why the max_source_length and max_target_length are set to those values in the translation script.

Think we can change the wording here a bit max_source_length doesn't really define the length of the input, but rather defines the maximum input length. I usually use padding="longest" (which is the default I think), so having max_source_length set to 1024 doesn't mean that the input is padded to 1024, it just means that an input longer than 1024 is cut to 1024.

Also in my experiments I always used 512 as T5 maximum length - if I recall correctly this was used for CNN summarization

=> Overall I'm not really sure it's a good idea to state actual numbers like 1024 and 128 here. Especially the max_target_length strongly depends on the dataset. E.g. if one would train T5 on extreme summarization like XSum: https://huggingface.co/datasets/xsum 128 is not a good max target length. If one has a lot of memory and wants to try T5 on long-range summarization 128 is also not a good max target length...=> so I would rather leave the numbers out here and instead write something like:

Depending on the dataset T5 is fine-tuned on, one should carefully select :obj:max_source_length and :obj:max_target_length.

Max source length default to 512 in T5 I think and the target length should always be carefully chosen depending on the dataset IMO - there is no golden number I think.
So maybe we can write here that if padding="longest" the max_source_length usually just makes sure that the model doesn't OOM and if padding="max_length" padding/truncation is applied.
The target length has to be picked by the user depending on the task IMO

docs/source/model_doc/t5.rst

craffel · 2021-08-26T14:30:25Z

docs/source/model_doc/t5.rst

+- T5 models need a slightly higher learning rate than the default one set in the :obj:`Trainer`. Typically, 1e-4 and
+  3e-4 work well for almost all problems (classification, summarization, translation, question answering, question
+  generation).


In the original T5 codebase, the adafactor accumulator variables get loaded from the pre-trained checkpoint, and we always use a learning rate of 1e-3, and it always works fine. I'm not sure what a good learning rate would be for Adam-based training of T5, or even adafactor when the accumulator variables aren't loaded from the checkpoint (as they aren't in HF).

docs/source/model_doc/t5v1.1.rst

docs/source/model_doc/byt5.rst

docs/source/model_doc/mt5.rst

docs/source/model_doc/t5.rst

docs/source/model_doc/t5v1.1.rst

src/transformers/models/t5/modeling_flax_t5.py

patrickvonplaten

This is really great work @NielsRogge !!! Thanks a lot for looking into it and making the docs much nicer. I especially like that you make it very clear in the beginning how T5v1_1, MT5 and ByT5 are related to the original T5.

I left a couple of nits that could maybe improve / extend the explanation a bit, but overall this looks good to me!

sgugger

Thanks for digging into this!

docs/source/model_doc/t5.rst

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

NielsRogge commented Aug 24, 2021

View reviewed changes

docs/source/model_doc/t5.rst Show resolved Hide resolved

NielsRogge commented Aug 25, 2021

View reviewed changes

NielsRogge marked this pull request as ready for review August 26, 2021 13:10

NielsRogge requested a review from patrickvonplaten August 26, 2021 13:10

craffel reviewed Aug 26, 2021

View reviewed changes