Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seq2seq trainer #9241

Merged
merged 7 commits into from
Dec 22, 2020
Merged

Seq2seq trainer #9241

merged 7 commits into from
Dec 22, 2020

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Dec 21, 2020

What does this PR do?

This PR graduates Seq2SeqTrainer and moves it inside the transformers library. By doing so, it moves some of the features of the Seq2SeqTrainer inside the main Trainer and leaves some in the subclass. More precisely, the following features will be available in the general Trainer:

  • label smoothing is passed to the main Trainer (so it can be used in all classification problems), with an easier API, bug fixes (the current implementation did not work with -100 has an ignore index for the loss, now it does; also the current implementation returns a loss that is not averaged thus being too big, see below)
  • the ability to pick any scheduler
  • the ability to use Adafactor instead of AdamW

The sortish-sampling and predict with generate are left in the subclass.

There are also a few breaking changes in the Seq2SeqTrainer API to make its init match the one of Trainer mainly the init does not take a config and a DataArguments, instead:

  • the token IDs are taken from the tokenizer
  • the arguments for generation are passed to evaluate or predict
  • the ignore_pad_token_for_loss is passed in the init but is deprecated since it should be removed once BART and subclasses use -100.

About label smoothing and the mean vs sum. The current implementation takes the sum over the batch size and sequence length (only counting tokens that have a label != -100). This gives something that does not have the same scale as the usual cross entropy loss (which is the mean on the same dimensions) thus would require a special learning rate to be useful. With the fix, the label-smoothed loss had the same scale as the non-smoothed loss, which means the same command with the same learning rate should produce comparable results.

@sgugger
Copy link
Collaborator Author

sgugger commented Dec 21, 2020

Oh and this PR does not delete the old Seq2SeqTrainer just yet since I'm planning to merge it just before my vacation. That way if something goes horribly wrong, people can still use the old Seq2SeqTrainer. The plan would be to delete it in January after the new Seq2SeqTrainer has been tested.

Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The porting part looks good with a few suggestions. I re-run tests - found an issue pre-this PR. I will post it separately.

I haven't delved into label smoothing yes, so I've skipped over those parts' logic.

test_output = trainer.predict(
test_dataset=test_dataset,
metric_key_prefix="test",
max_target_length=data_args.test_max_target_length,
Copy link
Contributor

@stas00 stas00 Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why github doesn't let me comment on non-modified code in PRs - it makes no sense, since added code in some place could impact code that wasn't changed.

So we need to change the test_max_target_length help to indicate that this now will be used during predict (and not val_max_target_length as it was until now

So 2 places to modify:

  1. for val_max_target_length
- " This argument is also used to override the ``max_length`` param of ``model.generate``, which is used during ``evaluate`` and ``predict``"
+ " This argument is also used to override the ``max_length`` param of ``model.generate``, which is used during ``evaluate``"
  1. and then the help for test_max_target_length same as above but the last word is predict

Copy link
Contributor

@stas00 stas00 Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it's a breaking change - you mentioned new args but not that the predict now uses test_max_target_length for generate's max_length arg.

Copy link
Contributor

@stas00 stas00 Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps more importantly if we are adding new args, why not match the core functions' naming?

It's somewhat confusing that generate and model.config use max_length arg/attribute but trainer.predict|evaluate max_target_length.

But I can see where one would say the opposite - that it should have been max_target_length everywhere all along.

(and there is no min_length override, but that's another story, unrelated to this PR)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for s2s, max_target_length makes sense IMO since here we have two seqs source seq, and target seq. Just max_length might create some confusion.

Copy link
Contributor

@stas00 stas00 Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify I'm talking about the arg name of predict which then calls generate.

In no way I was proposing to change the cl args.

Are we on the same page?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also do we really need val_max_length and test_max_length? -> Does it make much sense to have different values for those two? I can see why one would have data_args.val_beams smaller than data_args.eval_beams, but for max_length as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was added for convenience, but not super important, it makes sense to keep just one max_target_length

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't touched anything on the script except using test_max_length here, which was never used before. I thought it was just a mistake. Since there is a lot of debate on this and I'm not sure I understand what you want me to do, I will revert that and leave it as it was before. Then we can do some renaming in a separate PR (or please be clearer on what you want me to do :-) )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I think I get it, so all max_target_length -> max_length and use one eval_max_length, like eval_num_beams. Will push that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, sorry for the confusing comments ;)

src/transformers/optimization.py Outdated Show resolved Hide resolved
src/transformers/training_args_seq2seq.py Outdated Show resolved Hide resolved
Copy link
Contributor

@patil-suraj patil-suraj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks Sylvain!

class Seq2SeqTrainer(Trainer):
def __init__(self, *args, ignore_pad_token_for_loss=None, **kwargs):
super().__init__(*args, **kwargs)
# TODO: When BART uses -100 everywhere, this can be removed entirely.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BART now works with -100, the issue was with label smoothing (which you fixed) and the shift_tokens_right function, which is fixed in #9135

Copy link
Contributor

@patrickvonplaten patrickvonplaten Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the src/transformers side we can completely remove the deprecation warning and the ignore_pad_token_for_loss arg IMO. On the other hand, it seemed like ignore_pad_token_for_loss was the default case for all training code in examples/seq2seq -> so might be good to keep it as it is now for some time

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will adapt the TODO and the warning then, to check all mentions in the examples. Once we're sure it's not used anywhere, will remove the argument. but let's try to have it removed before the next release, so it's never a real "breaking change" to remove that arg.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thoughts, will have to change all scripts for the max_target_length -> max_length so I'll remove those ignore_pad_token_for_loss in passing. (just realized it is never actually used inside the code and the tests are still passing.)

epsilon (:obj:`float`, `optional`, defaults to 0.1):
The label smoothing factor.
ignore_index (:obj:`int`, `optional`, defaults to -100):
The index in the labels to ignore when computing the loss. Set to :obj:`None` to ignore none of the
Copy link
Contributor

@patrickvonplaten patrickvonplaten Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(question) Not sure if the Set to :obj:None to ignore none of the indices is really required. For me the default case should be -100 but if the user does not want to ignore any indices, there should simply be no -100 in the labels no? Or is there a case where one would have -100 in the labels but still wants to take those ignored labels into account for label_smoothing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the docstring is just badly worded.

eval_dataset: Optional[Dataset] = None,
ignore_keys: Optional[List[str]] = None,
metric_key_prefix: str = "eval",
max_target_length: Optional[int] = None,
Copy link
Contributor

@patrickvonplaten patrickvonplaten Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would also prefer the arg to be named max_length instead of max_target_length

@@ -56,3 +59,31 @@ def test_distributed_tensor_gatherer(self):
self.assertTrue(np.array_equal(result[0], predictions))
self.assertTrue(np.array_equal(result[1][0], predictions))
self.assertTrue(np.array_equal(result[1][1], predictions))

def test_label_smoothing(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice test!

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Mostly nits from my side. Two things, I think are worth considering:

  • change naming from max_target_length -> max_length. Think it's clearer this way that the args of predict and eval correspond 1-to-1 to the max_length of generate()
  • since Seq2SeqTrainer is in src/transformers now , it might be worth to also add a test for it in tests/test_trainer.py IMO

sgugger and others added 4 commits December 22, 2020 10:06
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
@sgugger sgugger merged commit 490b39e into master Dec 22, 2020
@sgugger sgugger deleted the seq2seq_trainer branch December 22, 2020 16:33
@@ -86,28 +93,21 @@ class DataTrainingArguments:
"than this will be truncated, sequences shorter will be padded."
},
)
max_target_length: Optional[int] = field(
max_length: Optional[int] = field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this done in the last second? This and all the renames below just broke all the scripts that were relying on --max_target_length --val_max_target_length args.

Copy link
Contributor

@stas00 stas00 Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion and the agreement as how I understood it was to rename only the new arguments you added to evaluate+predict and leaving everything else as before. We need source and target for datasets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have been a bit more careful at my second review of the PR...I'd probably prefer replacing all --val_max_target_length and test_max_target_length with --max_length, but I'm also fine with reverting the change on DataTrainingArguments

Copy link
Contributor

@stas00 stas00 Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for clarifying, so indeed you didn't not suggest to change the cl args as I read it.

I guess @sgugger read your mind then - I have no problem with that last minute change this PR introduced, other than needed a deprecation cycle for such changes because it really interferes with my work at integrating / benchmarking / testing deepspeed and fairscale as I have all my scripts using those.

Then if we are now switching to refactoring the cl args - there should be no eval_max_length either, no?

Copy link
Contributor

@patrickvonplaten patrickvonplaten Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, one max_length should be enough. It does not make much sense to tokenize longer than the model is able to generate and the other way around would result in errors. So IMO the best approach is to have one max_length referring to the maximum length that the model should be able to generate.

The question then is whether we need a max_source_length or max_seq_length as it's called in run_mlm.py or not. I think you guys have more experience here.

@stas00
Copy link
Contributor

stas00 commented Dec 22, 2020

change naming from max_target_length -> max_length. Think it's clearer this way that the args of predict and eval correspond 1-to-1 to the max_length of generate()

@patrickvonplaten - did you mean to suggest to change the new arguments to evaluate/predict that this PR adds or to rename --max_target_length, --val_max_target_length cl args?

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Dec 22, 2020

suggest

I was more referring to the args of the functions, but more generally I think it would actually be better if there would be only one max_length in the data_args - so leave the max_length that we have now and completely remove the source_max_length argument. IMO, source_max_length should be fully defined by the tokenizer of the model. I don't really see a need to let the user define the maximum input length, but this is probably better to be done in a separate PR. On the other hand, we also do have a max_seq_length argument in run_mlm.py so not 100% sure what's best here...What is your opinion here @stas00 @sgugger @patil-suraj ?

@patil-suraj
Copy link
Contributor

I think it's better to keep max_source_length, max_target_length since in some cases the input length could be way shorter than the tokenizer or model's max length and these two could be used to truncate the text .

We can get rid of val_max_target_length and test_max_target_length in DataTrainingArguments, since in almost all scripts we are using the same length for all three arguments (max_target_length, val_max_target_length, test_max_target_length). Then we could pass the same max_target_length to both evaluate and predict methods.

sorry about the miscommunication.

@stas00
Copy link
Contributor

stas00 commented Dec 22, 2020

I don't know whether this is a good time, but should we add min_length as well here while this part of the API is being redesigned? Surely if generate has min_length someone might need to redefine it too? But I'm totally fine to deferring this until and if someone asks for it.

@stas00
Copy link
Contributor

stas00 commented Dec 22, 2020

The best way to decide about the naming is to show a few use cases - and then it's clear whether these are needed or not.

Please don't forget that Sam or whoever added those in first place had a good reason for it, so it'd be hard to make a decision in the void.

Perhaps such decision shouldn't be rushed - but made into an RFC, invite input from users who have a lot more use cases?

thevasudevgupta pushed a commit to thevasudevgupta/transformers that referenced this pull request Dec 23, 2020
* Add label smoothing in Trainer

* Add options for scheduler and Adafactor in Trainer

* Put Seq2SeqTrainer in the main lib

* Apply suggestions from code review

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Address review comments and adapt scripts

* Documentation

* Move test not using script to tests folder

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
guyrosin pushed a commit to guyrosin/transformers that referenced this pull request Jan 15, 2021
* Add label smoothing in Trainer

* Add options for scheduler and Adafactor in Trainer

* Put Seq2SeqTrainer in the main lib

* Apply suggestions from code review

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Address review comments and adapt scripts

* Documentation

* Move test not using script to tests folder

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
@sgugger sgugger restored the seq2seq_trainer branch January 20, 2021 03:05
@LysandreJik LysandreJik deleted the seq2seq_trainer branch April 27, 2022 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants