From a59f217919b07e7154856cd17fb681ae2612d32b Mon Sep 17 00:00:00 2001
From: Stas Bekman <stas00@users.noreply.github.com>
Date: Sat, 7 Nov 2020 18:32:28 -0800
Subject: [PATCH] [s2s test_finetune_trainer] failing test

Sam,
```
RUN_SLOW=1 pytest examples/seq2seq/test_finetune_trainer.py::TestFinetuneTrainer::test_finetune_trainer_slow
```
fails for me - not learning anything.
```
>       assert first_step_stats["eval_bleu"] < last_step_stats["eval_bleu"]  # model learned nothing
E       AssertionError: assert 0.0 < 0.0
```
Looking at the logs, it gains some knowledge in the first half of the epochs and then drops back to 0.00 in the last ones.

Changing to lr 3e-3 (this PR) seems to make it more stable, but it could be a card specific thing - this is with rtx3090.

Alternatively the test should compare not the first and last metrics, but perhaps something more flexible?

But other way it feels too dependent on the card/config - perhaps a long term approach to make it more resilient is by feeding it more than 8 records.

@sshleifer
---
 examples/seq2seq/test_finetune_trainer.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/seq2seq/test_finetune_trainer.py b/examples/seq2seq/test_finetune_trainer.py
index 6da0e240c41959..399c1b6c047e8c 100644
--- a/examples/seq2seq/test_finetune_trainer.py
+++ b/examples/seq2seq/test_finetune_trainer.py
@@ -177,7 +177,7 @@ def run_trainer(self, eval_steps: int, max_len: str, model_name: str, num_train_
             --num_train_epochs {str(num_train_epochs)}
             --per_device_train_batch_size 4
             --per_device_eval_batch_size 4
-            --learning_rate 3e-4
+            --learning_rate 3e-3
             --warmup_steps 8
             --evaluate_during_training
             --predict_with_generate