[Seq2Seq] Allow EncoderDecoderModels to be trained with Seq2Seq #7809

patrickvonplaten · 2020-10-15T12:35:46Z

What does this PR do?

This PR changes the Seq2Seq Trainer a bit to:

Make it work with EncoderDecoder
Align its API more with the general Trainer

@sshleifer @patil-suraj @sgugger - it would be great if you could take a look and give your general opinion on it :-)
If this would be ok for you, I will fix the examples test.

patrickvonplaten · 2020-10-15T12:37:52Z

examples/seq2seq/seq2seq_trainer.py

@@ -41,12 +41,13 @@


 class Seq2SeqTrainer(Trainer):
-    def __init__(self, config, data_args, *args, **kwargs):
+    def __init__(self, *args, **kwargs):


@patil-suraj @sshleifer - I think it would be better to align the init of Seq2SeqTrainer 100% with Trainer.
Is there a reason why we would insert config instead of using the model's config?

Also I don't really think the variable data_args is necessary. Both max_length and num_beams can be defined in the config and don't have to be "force" passed to the generate() method.

model.config breaks under DistributedDataParallel, so we decided to pass it explicitly. See #7461 and #7460.

if default num_beams and max_length is too high it'll slow down evaluation, so we allow the user to control it during training. And not overriding config since defaults will be needed for inference after training.

Okok I see! I'm a bit confused why Trainer does not break with DistributedDataParallel when only using model.config.... , but Seq2SeqTrainer does? Do you guys know why?

eval_beams/eval_max_gen_length reasoning:
@patil-suraj said exactly this LOL, but in my words:
users are not good at modifying configs locally. We want to have a way to run num_beams=2 during the generation step, but then end up with a trained model with the default # beams. In general, we try not to manipulate config attributes that would only be desired during training.

I mean modifying the configs locally is as simple as config.num_beams = 4 and I would think one wants to evaluate a model during training with exactly the beam size and max_length that is stored in the config (I mean changing the beam size and max_length does not simple reduce time, but also changes the output...) But I guess I can see the use case where the people want to tweak max_length and num_beams without changing the config. Would it be fine to make data_args optional and call them generation_args that will just be passed as **generation_args to the generate function?

ups that was supposed to land further below not here. @sshleifer for reference.

data_args -> generation_kwargs seems like a good change (at least in seq2seq_trainer.py), but the CLI naming has a purpose:
It wouldn't have been obvious to me that passing --min_length 32 would affect generation, rather than truncating source docs. That's why the eval_ prefix was added.

patrickvonplaten · 2020-10-15T12:38:38Z

examples/seq2seq/seq2seq_trainer.py

        if self.args.label_smoothing == 0:
-            # Same behavior as modeling_bart.py
-            loss_fct = torch.nn.CrossEntropyLoss(ignore_index=self.config.pad_token_id)


This does not seem to work for all models (EncoderDecoderModel does not work with it) -> Let's instead use the loss function of each model here.

loss functions of model use -100 as ignore_index , we will also need to replace pad tokens in labels with -100

I usually do this manually before -> should that be the role of the Seq2SeqTrainer? Trainer also does not have this feature

Ignoring pad_token_id confused lots of people and helps metrics so we automated it.
Related: #7828

I usually do this manually before

we could do this in the collator, but we won't need to do if #7828 is merged

we will still need to cover FSMT/T5.
I would definitely not do this change right now, it works as is and is much easier than checking that every model ignores padding.

PyTorch's CE loss function has -100 as a default value and from what I understood it is the default behavior of the library to ignore tokens when there have the index -100 and not when there are equal to the padding token (often we set padding token == -100): https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html

It would require models to manually replace tokens with -100, but I think that's how it should be done in general in the library. How would be handle models that don't have a padding_token or want to disregard loss of more than just the padding token? For such cases I think it can be quite handy if the user overwrites all labels he does not want to consider with -100

we will discuss on zoom!

patrickvonplaten · 2020-10-15T12:39:05Z

examples/seq2seq/seq2seq_trainer.py

+            # in case the batch is shorter than max length, the output should be padded
+            generated_tokens = self._pad_tensors_to_max_len(generated_tokens, self.model.config.max_length)
+
+            # compute loss on predict data
        with torch.no_grad():


generate() is always in torch.no_grad() context.

sgugger

Looks good to me but I'm not an expert on seq2seq
All for aligning the signatures with the general Trainer however, thanks for doing that! We can have a Seq2SeqTrainingArguments that subclasses TrainingArguments if that helps.

patil-suraj · 2020-10-15T13:30:44Z

examples/seq2seq/seq2seq_trainer.py

        if self.args.label_smoothing == 0:
-            # Same behavior as modeling_bart.py
-            loss_fct = torch.nn.CrossEntropyLoss(ignore_index=self.config.pad_token_id)


loss functions of model use -100 as ignore_index , we will also need to replace pad tokens in labels with -100

patil-suraj · 2020-10-15T13:35:14Z

examples/seq2seq/seq2seq_trainer.py

-                    attention_mask=inputs["attention_mask"],
-                    use_cache=True,
-                    num_beams=self.data_args.eval_beams,
-                    max_length=self.max_gen_length,


we need this if eval_beams and and max_length are different than default

patil-suraj · 2020-10-15T13:39:39Z

LGTM, thanks for aligning it! We just need some way to pass eval_beams and max_gen_length.

We can have a Seq2SeqTrainingArguments that subclasses TrainingArguments if that helps.

@sgugger we do have Seq2SeqTrainingArguments class

transformers/examples/seq2seq/finetune_trainer.py

Line 37 in d99ed7a

class Seq2SeqTrainingArguments(TrainingArguments):

sgugger · 2020-10-15T13:50:12Z

@sgugger we do have Seq2SeqTrainingArguments class

Ah, had forgotten about that :-)

sshleifer · 2020-10-15T13:56:04Z

eval_beams/eval_max_gen_length reasoning:
@patil-suraj said exactly this LOL, but in my words:
users are not good at modifying configs locally. We want to have a way to run num_beams=2 during the generation step, but then end up with a trained model with the default # beams. In general, we try not to manipulate config attributes that would only be desired during training.

sshleifer · 2020-10-16T04:32:49Z

Also would <3 an encoder decoder test in examples/seq2seq/test_finetune_trainer.py.

patrickvonplaten · 2020-10-18T19:42:57Z

examples/seq2seq/finetune_trainer.py

@@ -230,7 +233,7 @@ def main():
        freeze_params(model.get_encoder())
        assert_all_frozen(model.get_encoder())

-    dataset_class = Seq2SeqDataset if hasattr(tokenizer, "prepare_seq2seq_batch") else LegacySeq2SeqDataset


prepare_seq2seq_batch is now as a function in PretrainedTokenizer so this cannot be False.

great catch

patrickvonplaten · 2020-10-18T19:43:16Z

examples/seq2seq/finetune_trainer.py

@@ -137,6 +136,10 @@ class DataTrainingArguments:
    src_lang: Optional[str] = field(default=None, metadata={"help": "Source language id for translation."})
    tgt_lang: Optional[str] = field(default=None, metadata={"help": "Target language id for translation."})
    eval_beams: Optional[int] = field(default=None, metadata={"help": "# num_beams to use for evaluation."})
+    ignore_pad_token_for_loss: bool = field(


put at True for backward compatibility

patrickvonplaten · 2020-10-18T19:45:04Z

After discussion @sshleifer - changed the Seq2SeqTrainer to be fully backwards compatible and to work with EncoderDecoder.
@sshleifer - cannot add EncDec test yet because the complete command line setup is too constrained (requires prepare_seq2seq_batch to be defined for all tokenizers, etc...) => will see how to add this in the future.

@sshleifer , @patil-suraj - could you do another review please? :-)

sshleifer

I think if we are trying to show that config.pad_token_id is not mandatory, we should add a test, even if that test does not use the command line interface. Sorry for being difficult.

examples/seq2seq/seq2seq_trainer.py

sshleifer · 2020-10-19T13:51:38Z

examples/seq2seq/seq2seq_trainer.py

+                **gen_kwargs,
+            )
+            # in case the batch is shorter than max length, the output should be padded
+            if self.config.pad_token_id is not None:


I would expect this case to break. _pad_tensors_to_max_len is needed for some sort of Trainer/consistent shapes reason @patil-suraj .

Yes, Trainer expects all returned preds to be of same shape, which it concatenates at for every batch eval

I don't get it -> if config.pad_token_id is not defined we cannot run _pad_tensors_to_max_len. How is this breaking anything? I am running all my experiments with no pad_token_id defined, so this case works.

since Trainer concatenates the preds I assuming they should be of same length across batches. It was breaking in my last experiment when not using _pad_tensors_to_max_len

I think this is fine -> see the test I added for bert2bert. Such a model does not have a self.config.pad_token_id defined and still works.

examples/seq2seq/seq2seq_trainer.py

…ormers into adapt_seq2seq_trainer

patrickvonplaten · 2020-10-22T16:20:01Z

examples/seq2seq/seq2seq_trainer.py

@@ -41,12 +41,21 @@


 class Seq2SeqTrainer(Trainer):
-    def __init__(self, config, data_args, *args, **kwargs):
+    def __init__(self, config=None, data_args=None, *args, **kwargs):


Make those variables optional to align better with Trainer and to keep 100% backwards compatibility

patrickvonplaten · 2020-10-22T16:21:22Z

examples/seq2seq/seq2seq_trainer.py

+            # set all ids to -100 to be ignored
+            if self.data_args is not None and self.data_args.ignore_pad_token_for_loss:
+                assert self.config.pad_token_id >= 0, "Make sure that `config.pad_token_id` is correcly defined"
+                inputs["labels"][inputs["labels"] == self.config.pad_token_id] = -100


This keeps 100% backwards compatibility

Will this cause a TPU issue @LysandreJik ?

Why would this cause a TPU issue? All of our models work with -100 to ignore CE loss

I think some tensor manipulations/assignments on TPU requiring sending the tensor back to CPU to do the op then returning it to CPU. Lys told me it was bad to assert -100 in inputs['labels'], for that reason. In that case we could do this in the collator I guess.

sshleifer

Thanks for the test!
2 comments about moving asserts to __init__ for quicker failure
Most important is whether the line I tagged lysandre on causes TPU slowdown.

sshleifer · 2020-10-22T16:26:42Z

examples/seq2seq/finetune_trainer.py

@@ -230,7 +233,7 @@ def main():
        freeze_params(model.get_encoder())
        assert_all_frozen(model.get_encoder())

-    dataset_class = Seq2SeqDataset if hasattr(tokenizer, "prepare_seq2seq_batch") else LegacySeq2SeqDataset


great catch

examples/seq2seq/seq2seq_trainer.py

sshleifer · 2020-10-22T16:27:52Z

examples/seq2seq/seq2seq_trainer.py

+            # set all ids to -100 to be ignored
+            if self.data_args is not None and self.data_args.ignore_pad_token_for_loss:
+                assert self.config.pad_token_id >= 0, "Make sure that `config.pad_token_id` is correcly defined"
+                inputs["labels"][inputs["labels"] == self.config.pad_token_id] = -100


Will this cause a TPU issue @LysandreJik ?

patrickvonplaten · 2020-10-22T20:08:31Z

Should be good - I don't really see how -100 would slow down the TPU, but let's wait for @LysandreJik opinion here.

sgugger · 2020-10-22T20:13:46Z

Can't seem to reply to the comment, but yes, the line @sshleifer is pointing at will slow down on TPU since it's probably using a torch.where behind the scene which does not have an XLA operation AFAIK.

patrickvonplaten · 2020-10-23T07:25:55Z

Can't seem to reply to the comment, but yes, the line @sshleifer is pointing at will slow down on TPU since it's probably using a torch.where behind the scene which does not have an XLA operation AFAIK.

Okey, I see -> let's move back in the old CE loss function then to keep backward compatibility!

@sshleifer - one last review please :-)

sshleifer · 2020-10-23T13:22:39Z

examples/seq2seq/seq2seq_trainer.py

        else:
+            # compute label smoothed loss
+            labels = inputs.pop("labels")
+            logits = model(**inputs, use_cache=False)[0]


I think use_cache=False everywhere or nowhere

removed it - think it's better this way to not give the false impression that use_cache=True will break training. All models have use_cache=True by default and training works by default. It's all about whether past_key_values are inserted or not.

Oh this actually breaks a test - it shouldn't. This is related to this Bart bug we never solved: #6353 :-/

will add use_cache=False again for now and remove it when fixing the bug in Bart.

sshleifer · 2020-10-23T13:23:43Z

examples/seq2seq/test_finetune_trainer.py

+        bert2bert.config.vocab_size = bert2bert.config.encoder.vocab_size
+        bert2bert.config.decoder_start_token_id = tokenizer.cls_token_id
+
+        train_dataset = datasets.load_dataset("cnn_dailymail", "3.0.0", split="train[:1%]")


this is cool.
cc @stas00 if you ever want to add more training data to a unit-test.

sshleifer · 2020-10-23T13:25:07Z

examples/seq2seq/test_finetune_trainer.py

+
+            return batch
+
+        def _compute_metrics(pred):


FYI by default you will get rouge1, rouge2, rougeL (if you don't overwrite compute_metrics

Make Seq2Seq Trainer more similar to Trainer

18a61d7

patrickvonplaten commented Oct 15, 2020

View reviewed changes

fix typo

dccf5bf

sgugger reviewed Oct 15, 2020

View reviewed changes

patil-suraj reviewed Oct 15, 2020

View reviewed changes

fix seq2seq trainer

2231622

patrickvonplaten changed the title ~~[Examples] Align Seq2Seq Trainer with Trainer~~ [Examples] Allow EncoderDecoderModels to be trained with Seq2Seq Oct 18, 2020

patrickvonplaten added 3 commits October 18, 2020 21:40

remove from tests

9d06360

remove lock

7334053

remove train files

c3845d8

patrickvonplaten commented Oct 18, 2020

View reviewed changes

sshleifer suggested changes Oct 19, 2020

View reviewed changes

patrickvonplaten added 3 commits October 22, 2020 15:54

:uuuuuuMerge branch 'master' of https://github.com/huggingface/transf…

82a5013

…ormers into adapt_seq2seq_trainer

add test

990ba2e

delete test files

e6b6047

patrickvonplaten commented Oct 22, 2020

View reviewed changes

correct typo

24757ce

patrickvonplaten commented Oct 22, 2020

View reviewed changes

patrickvonplaten requested a review from sshleifer October 22, 2020 16:23

sshleifer reviewed Oct 22, 2020

View reviewed changes

check at init

642d903

make sure trainer is not slowed down on TPU

4e6442d

patrickvonplaten added 2 commits October 23, 2020 09:27

Merge branch 'master' into adapt_seq2seq_trainer

1a61965

correct isort

9fb1f27

sshleifer approved these changes Oct 23, 2020

View reviewed changes

patrickvonplaten added 3 commits October 23, 2020 16:27

remove use cache

2af3235

fix use cache

62a2068

add last use chache = false

ba187fd

patrickvonplaten merged commit 3c682ea into huggingface:master Oct 23, 2020

patrickvonplaten deleted the adapt_seq2seq_trainer branch October 23, 2020 21:06

patrickvonplaten changed the title ~~[Examples] Allow EncoderDecoderModels to be trained with Seq2Seq~~ [Seq2Seq] Allow EncoderDecoderModels to be trained with Seq2Seq Oct 26, 2020

[Seq2Seq] Allow EncoderDecoderModels to be trained with Seq2Seq #7809

[Seq2Seq] Allow EncoderDecoderModels to be trained with Seq2Seq #7809

Conversation

patrickvonplaten commented Oct 15, 2020 • edited by sshleifer Loading

What does this PR do?

Choose a reason for hiding this comment

patil-suraj Oct 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sshleifer Oct 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sshleifer Oct 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patil-suraj commented Oct 15, 2020

sgugger commented Oct 15, 2020

sshleifer commented Oct 15, 2020 • edited Loading

sshleifer commented Oct 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten commented Oct 18, 2020

sshleifer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sshleifer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten commented Oct 22, 2020

sgugger commented Oct 22, 2020

patrickvonplaten commented Oct 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten Oct 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten commented Oct 15, 2020 •

edited by sshleifer

Loading

patil-suraj Oct 15, 2020 •

edited

Loading

sshleifer Oct 16, 2020 •

edited

Loading

sshleifer Oct 16, 2020 •

edited

Loading

sshleifer commented Oct 15, 2020 •

edited

Loading

patrickvonplaten Oct 23, 2020 •

edited

Loading