examples/seq2seq/test_bash_script.py :: actually learn something #6049

sshleifer · 2020-07-27T03:24:13Z

At the moment validation bleu barely gets above zero in the tests, so they don't really prove much about our code.

we could use a larger model like sshleifer/student_marian_6_3, and more data, and train for 10 minutes . This would allows us to test whether changing default parameters/batch techniques obviously degrades performance.

The github actions CI reuses it's own disk, so this will only run there and hopefully not have super slow downloads.

stale · 2020-09-26T10:44:05Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

sshleifer · 2020-09-26T17:53:29Z

This is still an issue :)

sshleifer · 2020-11-04T14:58:59Z

This command runs in 3 mins (without including downloads)

# export WANDB_PROJECT=dmar
export MAX_LEN=64
python finetune.py \
  --learning_rate=3e-4 \
  --do_train \
  --do_predict \
  --fp16 \
  --val_check_interval 0.25 --n_train 100000 --n_val 500 --n_test 500 \
  --data_dir wmt_en_ro \
  --max_source_length $MAX_LEN --max_target_length $MAX_LEN --val_max_target_length $MAX_LEN --test_max_target_length $MAX_LEN \
  --freeze_encoder --freeze_embeds \
  --train_batch_size=64 --eval_batch_size=64 \
  --tokenizer_name Helsinki-NLP/opus-mt-en-ro \
  --model_name_or_path sshleifer/mar_enro_6_3_student \
  --warmup_steps 500 --sortish_sampler \
  --gpus 1 --fp16_opt_level=O1 --task translation --num_sanity_val_steps=0 --output_dir dmar_utest_1gpu --num_train_epochs=1 \
  --overwrite_output_dir

Test results

cat dmar_utest_1gpu/test_results.txt

bs: 32.000000
loss: 1.122035
src_pad_frac: 0.322266
src_pad_tok: 330.000000
step_count: 5.000000
test_avg_bleu: 20.660713
test_avg_gen_len: 63.750000
test_avg_gen_time: 0.033025
test_avg_loss: 2.215564
test_bleu: 20.660713
test_loss: 2.215564
tpb: 1475.000000
val_avg_bleu: 20.099000
val_avg_gen_len: 63.125000
val_avg_gen_time: 0.031409
val_avg_loss: 2.265883
val_bleu: 20.099001
val_loss: 2.265883

The validation BLEU also improves over the course of training:

 cat dmar_utest_1gpu/metrics.json | grep bleu

            "val_avg_bleu": 16.3561625,
            "val_avg_bleu": 19.0204625,
            "val_avg_bleu": 19.704875,
            "val_avg_bleu": 20.099,
            "test_avg_bleu": 20.660712500000002

So this would be a good template for the test.

Spec

convert the command into a unit-test (With programatic download of the right amount of data, possibly through via a new s3 .tgz file.
Replace existing test_bash_script marian test.
Try to cut n_train/further
Anything less than 15 mins is fine, but the faster the better.
Minimum learning requirement 1: BLEU improves over the course of training by more than 2 pts
Minimum learning requirement 2: BLEU finishes above 17
Minimum learning requirement 3: test bleu and val bleu within 1 pt.

(this command meets all 3 learning requirements).

Wdyt @stas00 ?

stas00 · 2020-11-04T16:09:16Z

I will work on that, thank you.

stas00 · 2020-11-04T19:19:06Z

@sshleifer, could you please validate that this is the command you run?

I get very different (bad) results:

bs: 9.000000
loss: 6.701375
src_pad_frac: 0.118056
src_pad_tok: 68.000000
step_count: 2.000000
test_avg_bleu: 0.021700
test_avg_gen_len: 512.000000
test_avg_gen_time: 0.439663
test_avg_loss: 5.679669
test_bleu: 0.021700
test_loss: 5.679669
tpb: 1025.000000
val_avg_bleu: 0.082700
val_avg_gen_len: 512.000000
val_avg_gen_time: 0.483860
val_avg_loss: 5.404536
val_bleu: 0.082700
val_loss: 5.404536

In the OP you mentioned " sshleifer/student_marian_6_3" but here you used "sshleifer/mar_enro_6_3_student" - not sure if that's the difference.

stas00 · 2020-11-04T19:20:36Z

Also for the second time you use wmt_en_ro instead of test_data/wmt_en_ro - do you use a different dataset?

stas00 · 2020-11-04T19:23:00Z

Your spec on timing would be a small issue, since I get what you said 3min on your hw in 33secs (unoptimized rtx3090), so might have to re-test on CI. But again I'm not sure we are testing against the same dataset, since my results are terrible.

stas00 · 2020-11-04T19:30:05Z

Retested with sshleifer/student_marian_en_ro_6_3 and 5 epochs - still under < 1 bleu - so this is probably an issue of insufficient data and you must be using a different dataset.

sshleifer · 2020-11-04T19:31:16Z

I am using full dataset (as in README.md)

stas00 · 2020-11-04T19:32:47Z

Ah, that explains it.

So run the slow test with the full dataset downloaded at runtime, right?

stas00 · 2020-11-04T19:38:04Z

OK, I was able to reproduce your results with the full dataset, slightly under 3min and slightly better bleu scores.

stas00 · 2020-11-04T19:48:54Z

Not sure if there is a point to it, but 7zip shaves off about 35% in download size (but CI might not have it installed).

-rw-rw-r--  1 stas stas  58M Nov  4 11:40 wmt_en_ro.tar.gz
-rw-rw-r--  1 stas stas  37M Nov  4 11:39 wmt_en_ro.7z

sshleifer · 2020-11-04T20:38:10Z

Another way to save download time would be to only zip up 100k (or fewer) training examples, 500 val examples, 500 test examples. Those are all we use given the --ntrain --nval --ntest flags.
I would also check whether 10k/25k/50k meet the learning requirements.

stas00 · 2020-11-05T04:47:08Z

While trying to match the suggested hparams to the ones in train_mbart_cc25_enro.sh I've been wondering - I think I'm missing the point of this whole thing - if the intention is to test a bash script with specific fixed hparams, but the test replaces half of these presets and adds quite a few new ones, how are we testing this script?

stas00 · 2020-11-05T05:08:35Z

Why do we use "--foo=bar" and "--foo bar" both seemingly totally at random - half the args are set the first way, the other half the other way.

stas00 · 2020-11-05T05:11:14Z

question: do you want this as a new test or modify the existing test_train_mbart_cc25_enro_script - I'm trying to do the latter at the moment - perhaps that's why I'm questioning what do we test here.

sshleifer · 2020-11-05T15:00:10Z

The high level goal originally was to test that the bash scripts we check in work.
I have a secondary goal of making sure the training code is actually good at training models.
I am fine with any setup that accomplishes both of those tasks, with bonus points for enough traceback control that a developer could tell that they have made performance/vs. speed worse or some such.

As I slacked, we want a test to detect if we've regressed the training code. For example, if you set dropout=0.95 or freeze all parameters, or set the LR too low, or mess up the special tokens logic, the test should fail. Does that make sense? I didn't test all these for my command line, but it would be nice.

Relatedly, we just had a 2x slowdown in the finetune_trainer.py code that was not detected by unit-tests.

I know this is something of a scope expansion, so feel free to break it up/ignore parts as you see fit. I trust you to make reasonable decisions.

stas00 · 2020-11-05T17:08:58Z

Thank you for this useful brain dump. Let's take it point by point.

the bash scripts

If a test rewrites the script's guts before doing the testing should we not just modify those scripts themselves - we want to test that the script works, so we should test it as is, with the only changes allowed in some numerical settings to make tests faster.

If we want different pre-sets for different purposes - then have a set of scripts rather then do-them-all in one?
Best regression tests are written when an actual regression is discovered because then you know exactly which side of things to put under the "magnifier glass". When another regression is discovered down the road a new test should be made that focuses just on that part. Over time one ends up with a great coverage and the test suite becomes strong. Trying to accomplish all of these in one test will over time lose the very specific setups that exposed very specific side of things. It also helps to annotate that this test solves a regression in this git sha, so that it flags to future developers to not try to refactor or slip extra checks or slightly change things in the existing test.

It's very difficult to think of all the possible things that could regress in the void, but surely it is a great start.

Relatedly, we just had a 2x slowdown in the finetune_trainer.py code that was not detected by unit-tests.

That's great! Let's find a git sha before and after and write a test that detects that regression.

I hope this approach makes sense?

sshleifer · 2020-11-05T17:40:35Z

Yeah you are right, let me try to isolate the bad commit https://github.com/huggingface/transformers/commits/master/examples/seq2seq

related issue: #8154

sshleifer · 2020-11-05T20:32:55Z

I don't think there was an actual regression, I think my command lines are subtly different.
I still think the current test in the linked PR is more aggressive/better and should be added to the test suite in some form, but I am open to other opinions.

stas00 · 2020-11-06T00:51:04Z

edit: reusing the same ouput_dir during debug is a terrible idea - it gives total bogus test results - basically remembers the very first run and generates test reports based on it all the subsequent times, ignoring the actual test results. Why is that?

I am growing to dislike --overwrite_output_dir - it's so ambiguous - but I guess it was never meant to be used as a debug flag.

This works for debug:

        if DEBUG:
            output_dir = self.get_auto_remove_tmp_dir("./xxx", before=True, after=False)

So after re-evaluating:

Try to cut n_train/further

40k works.

25k w/ 2 epochs is almost there, but it's slower, than adding a bit more data, so went with 40k

going with a subset "tr40k-va0.5k-te0.5k"

stas00 · 2020-11-06T01:07:46Z

Created https://cdn-datasets.huggingface.co/translation/wmt_en_ro-tr40k-va0.5k-te0.5k.tar.gz - hope the name is intuitive - self-documenting. It's just 3.6M (vs 56M original)

I made it using this script:
https://github.com/stas00/porting/blob/master/transformers/translation/make-wmt_en_ro-subset.md

stas00 · 2020-11-06T01:47:24Z

In all these tests where we measure a relatively exact quality metrics - should we use a fixed seed?

sshleifer self-assigned this Jul 27, 2020

sshleifer added Tests Related to tests Help wanted Extra attention is needed, help appreciated labels Jul 27, 2020

stale bot added the wontfix label Sep 26, 2020

stale bot removed the wontfix label Sep 26, 2020

sshleifer assigned stas00 and unassigned sshleifer Nov 4, 2020

stas00 mentioned this issue Nov 5, 2020

[s2s] test_bash_script.py - actually learn something #8318

Merged

4 tasks

sshleifer closed this as completed in #8318 Nov 6, 2020

stas00 mentioned this issue Nov 6, 2020

[s2s trainer examples] a tight quality regression test #8353

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples/seq2seq/test_bash_script.py :: actually learn something #6049

examples/seq2seq/test_bash_script.py :: actually learn something #6049

sshleifer commented Jul 27, 2020 •

edited

Loading

stale bot commented Sep 26, 2020

sshleifer commented Sep 26, 2020

sshleifer commented Nov 4, 2020 •

edited

Loading

stas00 commented Nov 4, 2020

stas00 commented Nov 4, 2020 •

edited

Loading

stas00 commented Nov 4, 2020

stas00 commented Nov 4, 2020 •

edited

Loading

stas00 commented Nov 4, 2020

sshleifer commented Nov 4, 2020

stas00 commented Nov 4, 2020

stas00 commented Nov 4, 2020

stas00 commented Nov 4, 2020 •

edited

Loading

sshleifer commented Nov 4, 2020

stas00 commented Nov 5, 2020

stas00 commented Nov 5, 2020

stas00 commented Nov 5, 2020

sshleifer commented Nov 5, 2020

stas00 commented Nov 5, 2020 •

edited

Loading

sshleifer commented Nov 5, 2020 •

edited

Loading

sshleifer commented Nov 5, 2020

stas00 commented Nov 6, 2020 •

edited

Loading

stas00 commented Nov 6, 2020 •

edited

Loading

stas00 commented Nov 6, 2020 •

edited

Loading

examples/seq2seq/test_bash_script.py :: actually learn something #6049

examples/seq2seq/test_bash_script.py :: actually learn something #6049

Comments

sshleifer commented Jul 27, 2020 • edited Loading

stale bot commented Sep 26, 2020

sshleifer commented Sep 26, 2020

sshleifer commented Nov 4, 2020 • edited Loading

Test results

Spec

stas00 commented Nov 4, 2020

stas00 commented Nov 4, 2020 • edited Loading

stas00 commented Nov 4, 2020

stas00 commented Nov 4, 2020 • edited Loading

stas00 commented Nov 4, 2020

sshleifer commented Nov 4, 2020

stas00 commented Nov 4, 2020

stas00 commented Nov 4, 2020

stas00 commented Nov 4, 2020 • edited Loading

sshleifer commented Nov 4, 2020

stas00 commented Nov 5, 2020

stas00 commented Nov 5, 2020

stas00 commented Nov 5, 2020

sshleifer commented Nov 5, 2020

stas00 commented Nov 5, 2020 • edited Loading

sshleifer commented Nov 5, 2020 • edited Loading

sshleifer commented Nov 5, 2020

stas00 commented Nov 6, 2020 • edited Loading

stas00 commented Nov 6, 2020 • edited Loading

stas00 commented Nov 6, 2020 • edited Loading

sshleifer commented Jul 27, 2020 •

edited

Loading

sshleifer commented Nov 4, 2020 •

edited

Loading

stas00 commented Nov 4, 2020 •

edited

Loading

stas00 commented Nov 4, 2020 •

edited

Loading

stas00 commented Nov 4, 2020 •

edited

Loading

stas00 commented Nov 5, 2020 •

edited

Loading

sshleifer commented Nov 5, 2020 •

edited

Loading

stas00 commented Nov 6, 2020 •

edited

Loading

stas00 commented Nov 6, 2020 •

edited

Loading

stas00 commented Nov 6, 2020 •

edited

Loading