Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve T5 docs #13240

Merged
merged 14 commits into from
Sep 1, 2021
Merged

Improve T5 docs #13240

merged 14 commits into from
Sep 1, 2021

Conversation

NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Aug 24, 2021

What does this PR do?

This PR aims to clarify and explain some of the magic that's happening when using T5 for training/inference. It includes:

  • explaining that the model automatically creates the decoder_input_ids based on the labels (a lot of people were still confused by this, see e.g. T5-Training Arguments #11977, Questions on generating using encoder-decoder models #13213
  • added code examples, to show a basic forward pass that includes the fact that padding token ids of the labels should be replaced by -100 (at least, for PyTorch, I see that for FLAX one uses the decoder_attention_mask to skip padding tokens), and code examples for inference (both batched/not batched)
  • adding info about T5's variants, including T5v1.1, mT5 and byT5, with links to their docs.
  • additional tips & tricks, based on what I found on the forum (learning rate, training on TPU, etc).

In addition, I've added T5v1.1 to the main README as well making it have its own documentation page, and some more info to the mT5 docs.

Comment on lines 183 to 206
- T5 models need a slightly higher learning rate than the default one set in the :obj:`Trainer`. Typically, 1e-4 and
3e-4 work well for almost all problems (classification, summarization, translation, question answering, question
generation).
Copy link
Contributor Author

@NielsRogge NielsRogge Aug 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is true (I found this on the forum), cc @patrickvonplaten @patil-suraj. Are the summarization and translation scripts just using AdamW with learning rate 5e-5? Is AdaFactor better?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original T5 codebase, the adafactor accumulator variables get loaded from the pre-trained checkpoint, and we always use a learning rate of 1e-3, and it always works fine. I'm not sure what a good learning rate would be for Adam-based training of T5, or even adafactor when the accumulator variables aren't loaded from the checkpoint (as they aren't in HF).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AdaFactor requires much less memory and it worked very well for me for Flax T5 pretraining, but I haven't done too much T5 fine-tuning in PyTorch => so I can't say really here...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my T5 fine-tuning experiments for most of the tasks using higher lr than the default 5e-5 with Adam worked better. But I'm not sure if this should apply to every task.

@NielsRogge NielsRogge marked this pull request as ready for review August 26, 2021 13:10
README.md Outdated Show resolved Hide resolved
docs/source/index.rst Outdated Show resolved Hide resolved
docs/source/model_doc/byt5.rst Outdated Show resolved Hide resolved
docs/source/model_doc/byt5.rst Outdated Show resolved Hide resolved
docs/source/model_doc/t5.rst Outdated Show resolved Hide resolved
docs/source/model_doc/t5.rst Outdated Show resolved Hide resolved
batches. This entails that we must pad/truncate examples to the same length. For encoder-decoder models, one
typically defines a :obj:`max_source_length` and :obj:`max_target_length`, which determine the length of the input
and output sequences respectively. For summarization and translation, a good setting is setting
:obj:`max_source_length` to 1024 and :obj:`max_target_length` to 128. In addition, we must make sure that padding
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for translation it's more common for inputs/targets to be the same length, right? I would use like 128/128 for translation. For summarization, indeed the inputs need to be a lot longer than targets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I thought that too, not sure why the max_source_length and max_target_length are set to those values in the translation script.

Copy link
Contributor

@patrickvonplaten patrickvonplaten Aug 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we can change the wording here a bit max_source_length doesn't really define the length of the input, but rather defines the maximum input length. I usually use padding="longest" (which is the default I think), so having max_source_length set to 1024 doesn't mean that the input is padded to 1024, it just means that an input longer than 1024 is cut to 1024.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in my experiments I always used 512 as T5 maximum length - if I recall correctly this was used for CNN summarization

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=> Overall I'm not really sure it's a good idea to state actual numbers like 1024 and 128 here. Especially the max_target_length strongly depends on the dataset. E.g. if one would train T5 on extreme summarization like XSum: https://huggingface.co/datasets/xsum 128 is not a good max target length. If one has a lot of memory and wants to try T5 on long-range summarization 128 is also not a good max target length...=> so I would rather leave the numbers out here and instead write something like:

Depending on the dataset T5 is fine-tuned on, one should carefully select :obj:max_source_length and :obj:max_target_length.

Max source length default to 512 in T5 I think and the target length should always be carefully chosen depending on the dataset IMO - there is no golden number I think.
So maybe we can write here that if padding="longest" the max_source_length usually just makes sure that the model doesn't OOM and if padding="max_length" padding/truncation is applied.
The target length has to be picked by the user depending on the task IMO

docs/source/model_doc/t5.rst Outdated Show resolved Hide resolved
Comment on lines 183 to 206
- T5 models need a slightly higher learning rate than the default one set in the :obj:`Trainer`. Typically, 1e-4 and
3e-4 work well for almost all problems (classification, summarization, translation, question answering, question
generation).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original T5 codebase, the adafactor accumulator variables get loaded from the pre-trained checkpoint, and we always use a learning rate of 1e-3, and it always works fine. I'm not sure what a good learning rate would be for Adam-based training of T5, or even adafactor when the accumulator variables aren't loaded from the checkpoint (as they aren't in HF).

docs/source/model_doc/t5v1.1.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great work @NielsRogge !!! Thanks a lot for looking into it and making the docs much nicer. I especially like that you make it very clear in the beginning how T5v1_1, MT5 and ByT5 are related to the original T5.

I left a couple of nits that could maybe improve / extend the explanation a bit, but overall this looks good to me!

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this!

docs/source/model_doc/t5.rst Outdated Show resolved Hide resolved
docs/source/model_doc/t5.rst Show resolved Hide resolved
@NielsRogge NielsRogge changed the title [WIP] Improve T5 docs Improve T5 docs Sep 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants