-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seq2seq trainer #9241
Seq2seq trainer #9241
Conversation
Oh and this PR does not delete the old |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The porting part looks good with a few suggestions. I re-run tests - found an issue pre-this PR. I will post it separately.
I haven't delved into label smoothing yes, so I've skipped over those parts' logic.
examples/seq2seq/finetune_trainer.py
Outdated
test_output = trainer.predict( | ||
test_dataset=test_dataset, | ||
metric_key_prefix="test", | ||
max_target_length=data_args.test_max_target_length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why github doesn't let me comment on non-modified code in PRs - it makes no sense, since added code in some place could impact code that wasn't changed.
So we need to change the test_max_target_length
help to indicate that this now will be used during predict (and not val_max_target_length
as it was until now
So 2 places to modify:
- for val_max_target_length
- " This argument is also used to override the ``max_length`` param of ``model.generate``, which is used during ``evaluate`` and ``predict``"
+ " This argument is also used to override the ``max_length`` param of ``model.generate``, which is used during ``evaluate``"
- and then the help for test_max_target_length same as above but the last word is
predict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it's a breaking change - you mentioned new args but not that the predict
now uses test_max_target_length
for generate
's max_length
arg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps more importantly if we are adding new args, why not match the core functions' naming?
It's somewhat confusing that generate
and model.config
use max_length
arg/attribute but trainer.predict|evaluate
max_target_length
.
But I can see where one would say the opposite - that it should have been max_target_length
everywhere all along.
(and there is no min_length
override, but that's another story, unrelated to this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for s2s, max_target_length
makes sense IMO since here we have two seqs source seq, and target seq. Just max_length
might create some confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify I'm talking about the arg name of predict
which then calls generate
.
In no way I was proposing to change the cl args.
Are we on the same page?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also do we really need val_max_length
and test_max_length
? -> Does it make much sense to have different values for those two? I can see why one would have data_args.val_beams
smaller than data_args.eval_beams
, but for max_length
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was added for convenience, but not super important, it makes sense to keep just one max_target_length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't touched anything on the script except using test_max_length
here, which was never used before. I thought it was just a mistake. Since there is a lot of debate on this and I'm not sure I understand what you want me to do, I will revert that and leave it as it was before. Then we can do some renaming in a separate PR (or please be clearer on what you want me to do :-) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I think I get it, so all max_target_length
-> max_length
and use one eval_max_length
, like eval_num_beams
. Will push that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly, sorry for the confusing comments ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, thanks Sylvain!
src/transformers/trainer_seq2seq.py
Outdated
class Seq2SeqTrainer(Trainer): | ||
def __init__(self, *args, ignore_pad_token_for_loss=None, **kwargs): | ||
super().__init__(*args, **kwargs) | ||
# TODO: When BART uses -100 everywhere, this can be removed entirely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BART now works with -100, the issue was with label smoothing (which you fixed) and the shift_tokens_right
function, which is fixed in #9135
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the src/transformers
side we can completely remove the deprecation warning and the ignore_pad_token_for_loss
arg IMO. On the other hand, it seemed like ignore_pad_token_for_loss
was the default case for all training code in examples/seq2seq
-> so might be good to keep it as it is now for some time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will adapt the TODO and the warning then, to check all mentions in the examples. Once we're sure it's not used anywhere, will remove the argument. but let's try to have it removed before the next release, so it's never a real "breaking change" to remove that arg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thoughts, will have to change all scripts for the max_target_length
-> max_length
so I'll remove those ignore_pad_token_for_loss
in passing. (just realized it is never actually used inside the code and the tests are still passing.)
src/transformers/trainer_pt_utils.py
Outdated
epsilon (:obj:`float`, `optional`, defaults to 0.1): | ||
The label smoothing factor. | ||
ignore_index (:obj:`int`, `optional`, defaults to -100): | ||
The index in the labels to ignore when computing the loss. Set to :obj:`None` to ignore none of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(question) Not sure if the Set to :obj:None
to ignore none of the indices is really required. For me the default case should be -100
but if the user does not want to ignore any indices, there should simply be no -100 in the labels
no? Or is there a case where one would have -100 in the labels but still wants to take those ignored labels into account for label_smoothing
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the docstring is just badly worded.
src/transformers/trainer_seq2seq.py
Outdated
eval_dataset: Optional[Dataset] = None, | ||
ignore_keys: Optional[List[str]] = None, | ||
metric_key_prefix: str = "eval", | ||
max_target_length: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would also prefer the arg to be named max_length
instead of max_target_length
@@ -56,3 +59,31 @@ def test_distributed_tensor_gatherer(self): | |||
self.assertTrue(np.array_equal(result[0], predictions)) | |||
self.assertTrue(np.array_equal(result[1][0], predictions)) | |||
self.assertTrue(np.array_equal(result[1][1], predictions)) | |||
|
|||
def test_label_smoothing(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice test!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Mostly nits from my side. Two things, I think are worth considering:
- change naming from
max_target_length
->max_length
. Think it's clearer this way that the args ofpredict
andeval
correspond 1-to-1 to themax_length
ofgenerate()
- since Seq2SeqTrainer is in
src/transformers
now , it might be worth to also add a test for it intests/test_trainer.py
IMO
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
@@ -86,28 +93,21 @@ class DataTrainingArguments: | |||
"than this will be truncated, sequences shorter will be padded." | |||
}, | |||
) | |||
max_target_length: Optional[int] = field( | |||
max_length: Optional[int] = field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this done in the last second? This and all the renames below just broke all the scripts that were relying on --max_target_length --val_max_target_length args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The discussion and the agreement as how I understood it was to rename only the new arguments you added to evaluate
+predict
and leaving everything else as before. We need source and target for datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I should have been a bit more careful at my second review of the PR...I'd probably prefer replacing all --val_max_target_length
and test_max_target_length
with --max_length
, but I'm also fine with reverting the change on DataTrainingArguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for clarifying, so indeed you didn't not suggest to change the cl args as I read it.
I guess @sgugger read your mind then - I have no problem with that last minute change this PR introduced, other than needed a deprecation cycle for such changes because it really interferes with my work at integrating / benchmarking / testing deepspeed and fairscale as I have all my scripts using those.
Then if we are now switching to refactoring the cl args - there should be no eval_max_length
either, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, one max_length
should be enough. It does not make much sense to tokenize longer than the model is able to generate and the other way around would result in errors. So IMO the best approach is to have one max_length
referring to the maximum length that the model should be able to generate.
The question then is whether we need a max_source_length
or max_seq_length
as it's called in run_mlm.py
or not. I think you guys have more experience here.
@patrickvonplaten - did you mean to suggest to change the new arguments to evaluate/predict that this PR adds or to rename |
I was more referring to the args of the functions, but more generally I think it would actually be better if there would be only one |
I think it's better to keep We can get rid of sorry about the miscommunication. |
I don't know whether this is a good time, but should we add |
The best way to decide about the naming is to show a few use cases - and then it's clear whether these are needed or not. Please don't forget that Sam or whoever added those in first place had a good reason for it, so it'd be hard to make a decision in the void. Perhaps such decision shouldn't be rushed - but made into an RFC, invite input from users who have a lot more use cases? |
* Add label smoothing in Trainer * Add options for scheduler and Adafactor in Trainer * Put Seq2SeqTrainer in the main lib * Apply suggestions from code review Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Address review comments and adapt scripts * Documentation * Move test not using script to tests folder Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Add label smoothing in Trainer * Add options for scheduler and Adafactor in Trainer * Put Seq2SeqTrainer in the main lib * Apply suggestions from code review Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Address review comments and adapt scripts * Documentation * Move test not using script to tests folder Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
What does this PR do?
This PR graduates
Seq2SeqTrainer
and moves it inside the transformers library. By doing so, it moves some of the features of theSeq2SeqTrainer
inside the mainTrainer
and leaves some in the subclass. More precisely, the following features will be available in the general Trainer:Trainer
(so it can be used in all classification problems), with an easier API, bug fixes (the current implementation did not work with -100 has an ignore index for the loss, now it does; also the current implementation returns a loss that is not averaged thus being too big, see below)The sortish-sampling and predict with generate are left in the subclass.
There are also a few breaking changes in the
Seq2SeqTrainer
API to make its init match the one ofTrainer
mainly the init does not take aconfig
and aDataArguments
, instead:evaluate
orpredict
ignore_pad_token_for_loss
is passed in the init but is deprecated since it should be removed once BART and subclasses use -100.About label smoothing and the mean vs sum. The current implementation takes the sum over the batch size and sequence length (only counting tokens that have a label != -100). This gives something that does not have the same scale as the usual cross entropy loss (which is the mean on the same dimensions) thus would require a special learning rate to be useful. With the fix, the label-smoothed loss had the same scale as the non-smoothed loss, which means the same command with the same learning rate should produce comparable results.