Seq2seq trainer #9241

sgugger · 2020-12-21T22:36:01Z

What does this PR do?

This PR graduates Seq2SeqTrainer and moves it inside the transformers library. By doing so, it moves some of the features of the Seq2SeqTrainer inside the main Trainer and leaves some in the subclass. More precisely, the following features will be available in the general Trainer:

label smoothing is passed to the main Trainer (so it can be used in all classification problems), with an easier API, bug fixes (the current implementation did not work with -100 has an ignore index for the loss, now it does; also the current implementation returns a loss that is not averaged thus being too big, see below)
the ability to pick any scheduler
the ability to use Adafactor instead of AdamW

The sortish-sampling and predict with generate are left in the subclass.

There are also a few breaking changes in the Seq2SeqTrainer API to make its init match the one of Trainer mainly the init does not take a config and a DataArguments, instead:

the token IDs are taken from the tokenizer
the arguments for generation are passed to evaluate or predict
the ignore_pad_token_for_loss is passed in the init but is deprecated since it should be removed once BART and subclasses use -100.

About label smoothing and the mean vs sum. The current implementation takes the sum over the batch size and sequence length (only counting tokens that have a label != -100). This gives something that does not have the same scale as the usual cross entropy loss (which is the mean on the same dimensions) thus would require a special learning rate to be useful. With the fix, the label-smoothed loss had the same scale as the non-smoothed loss, which means the same command with the same learning rate should produce comparable results.

sgugger · 2020-12-21T22:42:10Z

Oh and this PR does not delete the old Seq2SeqTrainer just yet since I'm planning to merge it just before my vacation. That way if something goes horribly wrong, people can still use the old Seq2SeqTrainer. The plan would be to delete it in January after the new Seq2SeqTrainer has been tested.

stas00

The porting part looks good with a few suggestions. I re-run tests - found an issue pre-this PR. I will post it separately.

I haven't delved into label smoothing yes, so I've skipped over those parts' logic.

stas00 · 2020-12-22T03:41:51Z

examples/seq2seq/finetune_trainer.py

+        test_output = trainer.predict(
+            test_dataset=test_dataset,
+            metric_key_prefix="test",
+            max_target_length=data_args.test_max_target_length,


Not sure why github doesn't let me comment on non-modified code in PRs - it makes no sense, since added code in some place could impact code that wasn't changed.

So we need to change the test_max_target_length help to indicate that this now will be used during predict (and not val_max_target_length as it was until now

So 2 places to modify:

for val_max_target_length

- " This argument is also used to override the ``max_length`` param of ``model.generate``, which is used during ``evaluate`` and ``predict``" + " This argument is also used to override the ``max_length`` param of ``model.generate``, which is used during ``evaluate``"

and then the help for test_max_target_length same as above but the last word is predict

And it's a breaking change - you mentioned new args but not that the predict now uses test_max_target_length for generate's max_length arg.

Perhaps more importantly if we are adding new args, why not match the core functions' naming?

It's somewhat confusing that generate and model.config use max_length arg/attribute but trainer.predict|evaluate max_target_length.

But I can see where one would say the opposite - that it should have been max_target_length everywhere all along.

(and there is no min_length override, but that's another story, unrelated to this PR)

for s2s, max_target_length makes sense IMO since here we have two seqs source seq, and target seq. Just max_length might create some confusion.

Just to clarify I'm talking about the arg name of predict which then calls generate.

In no way I was proposing to change the cl args.

Are we on the same page?

Also do we really need val_max_length and test_max_length? -> Does it make much sense to have different values for those two? I can see why one would have data_args.val_beams smaller than data_args.eval_beams, but for max_length as well?

It was added for convenience, but not super important, it makes sense to keep just one max_target_length

I haven't touched anything on the script except using test_max_length here, which was never used before. I thought it was just a mistake. Since there is a lot of debate on this and I'm not sure I understand what you want me to do, I will revert that and leave it as it was before. Then we can do some renaming in a separate PR (or please be clearer on what you want me to do :-) )

Oh I think I get it, so all max_target_length -> max_length and use one eval_max_length, like eval_num_beams. Will push that.

exactly, sorry for the confusing comments ;)

src/transformers/optimization.py

src/transformers/training_args_seq2seq.py

patil-suraj

This looks good to me, thanks Sylvain!

patil-suraj · 2020-12-22T06:59:39Z

src/transformers/trainer_seq2seq.py

+class Seq2SeqTrainer(Trainer):
+    def __init__(self, *args, ignore_pad_token_for_loss=None, **kwargs):
+        super().__init__(*args, **kwargs)
+        # TODO: When BART uses -100 everywhere, this can be removed entirely.


BART now works with -100, the issue was with label smoothing (which you fixed) and the shift_tokens_right function, which is fixed in #9135

From the src/transformers side we can completely remove the deprecation warning and the ignore_pad_token_for_loss arg IMO. On the other hand, it seemed like ignore_pad_token_for_loss was the default case for all training code in examples/seq2seq -> so might be good to keep it as it is now for some time

Will adapt the TODO and the warning then, to check all mentions in the examples. Once we're sure it's not used anywhere, will remove the argument. but let's try to have it removed before the next release, so it's never a real "breaking change" to remove that arg.

On second thoughts, will have to change all scripts for the max_target_length -> max_length so I'll remove those ignore_pad_token_for_loss in passing. (just realized it is never actually used inside the code and the tests are still passing.)

src/transformers/trainer_pt_utils.py

patrickvonplaten · 2020-12-22T10:47:50Z

src/transformers/trainer_pt_utils.py

+        epsilon (:obj:`float`, `optional`, defaults to 0.1):
+            The label smoothing factor.
+        ignore_index (:obj:`int`, `optional`, defaults to -100):
+            The index in the labels to ignore when computing the loss. Set to :obj:`None` to ignore none of the


(question) Not sure if the Set to :obj:None to ignore none of the indices is really required. For me the default case should be -100 but if the user does not want to ignore any indices, there should simply be no -100 in the labels no? Or is there a case where one would have -100 in the labels but still wants to take those ignored labels into account for label_smoothing?

Yeah, the docstring is just badly worded.

src/transformers/trainer_seq2seq.py

patrickvonplaten · 2020-12-22T10:57:04Z

src/transformers/trainer_seq2seq.py

+        eval_dataset: Optional[Dataset] = None,
+        ignore_keys: Optional[List[str]] = None,
+        metric_key_prefix: str = "eval",
+        max_target_length: Optional[int] = None,


would also prefer the arg to be named max_length instead of max_target_length

src/transformers/trainer_seq2seq.py

src/transformers/training_args_seq2seq.py

patrickvonplaten · 2020-12-22T11:02:35Z

tests/test_trainer_utils.py

@@ -56,3 +59,31 @@ def test_distributed_tensor_gatherer(self):
        self.assertTrue(np.array_equal(result[0], predictions))
        self.assertTrue(np.array_equal(result[1][0], predictions))
        self.assertTrue(np.array_equal(result[1][1], predictions))
+
+    def test_label_smoothing(self):


patrickvonplaten

Great! Mostly nits from my side. Two things, I think are worth considering:

change naming from max_target_length -> max_length. Think it's clearer this way that the args of predict and eval correspond 1-to-1 to the max_length of generate()
since Seq2SeqTrainer is in src/transformers now , it might be worth to also add a test for it in tests/test_trainer.py IMO

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

stas00 · 2020-12-22T17:38:57Z

examples/seq2seq/finetune_trainer.py

@@ -86,28 +93,21 @@ class DataTrainingArguments:
            "than this will be truncated, sequences shorter will be padded."
        },
    )
-    max_target_length: Optional[int] = field(
+    max_length: Optional[int] = field(


Why was this done in the last second? This and all the renames below just broke all the scripts that were relying on --max_target_length --val_max_target_length args.

The discussion and the agreement as how I understood it was to rename only the new arguments you added to evaluate+predict and leaving everything else as before. We need source and target for datasets.

Sorry, I should have been a bit more careful at my second review of the PR...I'd probably prefer replacing all --val_max_target_length and test_max_target_length with --max_length, but I'm also fine with reverting the change on DataTrainingArguments

Thank you for clarifying, so indeed you didn't not suggest to change the cl args as I read it.

I guess @sgugger read your mind then - I have no problem with that last minute change this PR introduced, other than needed a deprecation cycle for such changes because it really interferes with my work at integrating / benchmarking / testing deepspeed and fairscale as I have all my scripts using those.

Then if we are now switching to refactoring the cl args - there should be no eval_max_length either, no?

IMO, one max_length should be enough. It does not make much sense to tokenize longer than the model is able to generate and the other way around would result in errors. So IMO the best approach is to have one max_length referring to the maximum length that the model should be able to generate.

The question then is whether we need a max_source_length or max_seq_length as it's called in run_mlm.py or not. I think you guys have more experience here.

stas00 · 2020-12-22T18:03:10Z

change naming from max_target_length -> max_length. Think it's clearer this way that the args of predict and eval correspond 1-to-1 to the max_length of generate()

@patrickvonplaten - did you mean to suggest to change the new arguments to evaluate/predict that this PR adds or to rename --max_target_length, --val_max_target_length cl args?

patrickvonplaten · 2020-12-22T18:22:29Z

suggest

I was more referring to the args of the functions, but more generally I think it would actually be better if there would be only one max_length in the data_args - so leave the max_length that we have now and completely remove the source_max_length argument. IMO, source_max_length should be fully defined by the tokenizer of the model. I don't really see a need to let the user define the maximum input length, but this is probably better to be done in a separate PR. On the other hand, we also do have a max_seq_length argument in run_mlm.py so not 100% sure what's best here...What is your opinion here @stas00 @sgugger @patil-suraj ?

patil-suraj · 2020-12-22T18:37:01Z

I think it's better to keep max_source_length, max_target_length since in some cases the input length could be way shorter than the tokenizer or model's max length and these two could be used to truncate the text .

We can get rid of val_max_target_length and test_max_target_length in DataTrainingArguments, since in almost all scripts we are using the same length for all three arguments (max_target_length, val_max_target_length, test_max_target_length). Then we could pass the same max_target_length to both evaluate and predict methods.

sorry about the miscommunication.

stas00 · 2020-12-22T18:41:29Z

I don't know whether this is a good time, but should we add min_length as well here while this part of the API is being redesigned? Surely if generate has min_length someone might need to redefine it too? But I'm totally fine to deferring this until and if someone asks for it.

stas00 · 2020-12-22T18:42:53Z

The best way to decide about the naming is to show a few use cases - and then it's clear whether these are needed or not.

Please don't forget that Sam or whoever added those in first place had a good reason for it, so it'd be hard to make a decision in the void.

Perhaps such decision shouldn't be rushed - but made into an RFC, invite input from users who have a lot more use cases?

* Add label smoothing in Trainer * Add options for scheduler and Adafactor in Trainer * Put Seq2SeqTrainer in the main lib * Apply suggestions from code review Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Address review comments and adapt scripts * Documentation * Move test not using script to tests folder Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

sgugger added 3 commits December 21, 2020 13:50

Add label smoothing in Trainer

efefd06

Add options for scheduler and Adafactor in Trainer

194ca71

Put Seq2SeqTrainer in the main lib

fe7960b

sgugger requested review from stas00, patrickvonplaten and patil-suraj December 21, 2020 22:36

stas00 approved these changes Dec 22, 2020

View reviewed changes

patil-suraj approved these changes Dec 22, 2020

View reviewed changes