From a59f217919b07e7154856cd17fb681ae2612d32b Mon Sep 17 00:00:00 2001 From: Stas Bekman Date: Sat, 7 Nov 2020 18:32:28 -0800 Subject: [PATCH] [s2s test_finetune_trainer] failing test Sam, ``` RUN_SLOW=1 pytest examples/seq2seq/test_finetune_trainer.py::TestFinetuneTrainer::test_finetune_trainer_slow ``` fails for me - not learning anything. ``` > assert first_step_stats["eval_bleu"] < last_step_stats["eval_bleu"] # model learned nothing E AssertionError: assert 0.0 < 0.0 ``` Looking at the logs, it gains some knowledge in the first half of the epochs and then drops back to 0.00 in the last ones. Changing to lr 3e-3 (this PR) seems to make it more stable, but it could be a card specific thing - this is with rtx3090. Alternatively the test should compare not the first and last metrics, but perhaps something more flexible? But other way it feels too dependent on the card/config - perhaps a long term approach to make it more resilient is by feeding it more than 8 records. @sshleifer --- examples/seq2seq/test_finetune_trainer.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/seq2seq/test_finetune_trainer.py b/examples/seq2seq/test_finetune_trainer.py index 6da0e240c41959..399c1b6c047e8c 100644 --- a/examples/seq2seq/test_finetune_trainer.py +++ b/examples/seq2seq/test_finetune_trainer.py @@ -177,7 +177,7 @@ def run_trainer(self, eval_steps: int, max_len: str, model_name: str, num_train_ --num_train_epochs {str(num_train_epochs)} --per_device_train_batch_size 4 --per_device_eval_batch_size 4 - --learning_rate 3e-4 + --learning_rate 3e-3 --warmup_steps 8 --evaluate_during_training --predict_with_generate