-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve T5 docs #13240
Improve T5 docs #13240
Conversation
docs/source/model_doc/t5.rst
Outdated
- T5 models need a slightly higher learning rate than the default one set in the :obj:`Trainer`. Typically, 1e-4 and | ||
3e-4 work well for almost all problems (classification, summarization, translation, question answering, question | ||
generation). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is true (I found this on the forum), cc @patrickvonplaten @patil-suraj. Are the summarization and translation scripts just using AdamW with learning rate 5e-5? Is AdaFactor better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original T5 codebase, the adafactor accumulator variables get loaded from the pre-trained checkpoint, and we always use a learning rate of 1e-3, and it always works fine. I'm not sure what a good learning rate would be for Adam-based training of T5, or even adafactor when the accumulator variables aren't loaded from the checkpoint (as they aren't in HF).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AdaFactor requires much less memory and it worked very well for me for Flax T5 pretraining, but I haven't done too much T5 fine-tuning in PyTorch => so I can't say really here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my T5 fine-tuning experiments for most of the tasks using higher lr than the default 5e-5 with Adam worked better. But I'm not sure if this should apply to every task.
docs/source/model_doc/t5.rst
Outdated
batches. This entails that we must pad/truncate examples to the same length. For encoder-decoder models, one | ||
typically defines a :obj:`max_source_length` and :obj:`max_target_length`, which determine the length of the input | ||
and output sequences respectively. For summarization and translation, a good setting is setting | ||
:obj:`max_source_length` to 1024 and :obj:`max_target_length` to 128. In addition, we must make sure that padding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for translation it's more common for inputs/targets to be the same length, right? I would use like 128/128 for translation. For summarization, indeed the inputs need to be a lot longer than targets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I thought that too, not sure why the max_source_length
and max_target_length
are set to those values in the translation script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think we can change the wording here a bit max_source_length
doesn't really define the length of the input, but rather defines the maximum input length. I usually use padding="longest"
(which is the default I think), so having max_source_length
set to 1024 doesn't mean that the input is padded to 1024, it just means that an input longer than 1024 is cut to 1024.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also in my experiments I always used 512 as T5 maximum length - if I recall correctly this was used for CNN summarization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
=> Overall I'm not really sure it's a good idea to state actual numbers like 1024 and 128 here. Especially the max_target_length
strongly depends on the dataset. E.g. if one would train T5 on extreme summarization like XSum: https://huggingface.co/datasets/xsum 128 is not a good max target length. If one has a lot of memory and wants to try T5 on long-range summarization 128 is also not a good max target length...=> so I would rather leave the numbers out here and instead write something like:
Depending on the dataset T5 is fine-tuned on, one should carefully select :obj:max_source_length
and :obj:max_target_length
.
Max source length default to 512 in T5 I think and the target length should always be carefully chosen depending on the dataset IMO - there is no golden number I think.
So maybe we can write here that if padding="longest"
the max_source_length usually just makes sure that the model doesn't OOM and if padding="max_length" padding/truncation is applied.
The target length has to be picked by the user depending on the task IMO
docs/source/model_doc/t5.rst
Outdated
- T5 models need a slightly higher learning rate than the default one set in the :obj:`Trainer`. Typically, 1e-4 and | ||
3e-4 work well for almost all problems (classification, summarization, translation, question answering, question | ||
generation). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original T5 codebase, the adafactor accumulator variables get loaded from the pre-trained checkpoint, and we always use a learning rate of 1e-3, and it always works fine. I'm not sure what a good learning rate would be for Adam-based training of T5, or even adafactor when the accumulator variables aren't loaded from the checkpoint (as they aren't in HF).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great work @NielsRogge !!! Thanks a lot for looking into it and making the docs much nicer. I especially like that you make it very clear in the beginning how T5v1_1, MT5 and ByT5 are related to the original T5.
I left a couple of nits that could maybe improve / extend the explanation a bit, but overall this looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into this!
80d0967
to
7740239
Compare
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
4b92d0b
to
1e49a86
Compare
What does this PR do?
This PR aims to clarify and explain some of the magic that's happening when using T5 for training/inference. It includes:
decoder_input_ids
based on thelabels
(a lot of people were still confused by this, see e.g. T5-Training Arguments #11977, Questions on generating using encoder-decoder models #13213decoder_attention_mask
to skip padding tokens), and code examples for inference (both batched/not batched)In addition, I've added T5v1.1 to the main README as well making it have its own documentation page, and some more info to the mT5 docs.