Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

examples/seq2seq/test_bash_script.py :: actually learn something #6049

Closed
sshleifer opened this issue Jul 27, 2020 · 23 comments · Fixed by #8318
Closed

examples/seq2seq/test_bash_script.py :: actually learn something #6049

sshleifer opened this issue Jul 27, 2020 · 23 comments · Fixed by #8318
Assignees
Labels
Help wanted Extra attention is needed, help appreciated Tests Related to tests

Comments

@sshleifer
Copy link
Contributor

sshleifer commented Jul 27, 2020

At the moment validation bleu barely gets above zero in the tests, so they don't really prove much about our code.

we could use a larger model like sshleifer/student_marian_6_3, and more data, and train for 10 minutes . This would allows us to test whether changing default parameters/batch techniques obviously degrades performance.

The github actions CI reuses it's own disk, so this will only run there and hopefully not have super slow downloads.

@sshleifer sshleifer self-assigned this Jul 27, 2020
@sshleifer sshleifer added Tests Related to tests Help wanted Extra attention is needed, help appreciated labels Jul 27, 2020
@stale
Copy link

stale bot commented Sep 26, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Sep 26, 2020
@sshleifer
Copy link
Contributor Author

This is still an issue :)

@stale stale bot removed the wontfix label Sep 26, 2020
@sshleifer
Copy link
Contributor Author

sshleifer commented Nov 4, 2020

This command runs in 3 mins (without including downloads)

# export WANDB_PROJECT=dmar
export MAX_LEN=64
python finetune.py \
  --learning_rate=3e-4 \
  --do_train \
  --do_predict \
  --fp16 \
  --val_check_interval 0.25 --n_train 100000 --n_val 500 --n_test 500 \
  --data_dir wmt_en_ro \
  --max_source_length $MAX_LEN --max_target_length $MAX_LEN --val_max_target_length $MAX_LEN --test_max_target_length $MAX_LEN \
  --freeze_encoder --freeze_embeds \
  --train_batch_size=64 --eval_batch_size=64 \
  --tokenizer_name Helsinki-NLP/opus-mt-en-ro \
  --model_name_or_path sshleifer/mar_enro_6_3_student \
  --warmup_steps 500 --sortish_sampler \
  --gpus 1 --fp16_opt_level=O1 --task translation --num_sanity_val_steps=0 --output_dir dmar_utest_1gpu --num_train_epochs=1 \
  --overwrite_output_dir

Test results

cat dmar_utest_1gpu/test_results.txt
bs: 32.000000
loss: 1.122035
src_pad_frac: 0.322266
src_pad_tok: 330.000000
step_count: 5.000000
test_avg_bleu: 20.660713
test_avg_gen_len: 63.750000
test_avg_gen_time: 0.033025
test_avg_loss: 2.215564
test_bleu: 20.660713
test_loss: 2.215564
tpb: 1475.000000
val_avg_bleu: 20.099000
val_avg_gen_len: 63.125000
val_avg_gen_time: 0.031409
val_avg_loss: 2.265883
val_bleu: 20.099001
val_loss: 2.265883

The validation BLEU also improves over the course of training:

 cat dmar_utest_1gpu/metrics.json | grep bleu
            "val_avg_bleu": 16.3561625,
            "val_avg_bleu": 19.0204625,
            "val_avg_bleu": 19.704875,
            "val_avg_bleu": 20.099,
            "test_avg_bleu": 20.660712500000002

So this would be a good template for the test.

Spec

  • convert the command into a unit-test (With programatic download of the right amount of data, possibly through via a new s3 .tgz file.
  • Replace existing test_bash_script marian test.
  • Try to cut n_train/further
  • Anything less than 15 mins is fine, but the faster the better.
  • Minimum learning requirement 1: BLEU improves over the course of training by more than 2 pts
  • Minimum learning requirement 2: BLEU finishes above 17
  • Minimum learning requirement 3: test bleu and val bleu within 1 pt.

(this command meets all 3 learning requirements).

Wdyt @stas00 ?

@stas00
Copy link
Contributor

stas00 commented Nov 4, 2020

I will work on that, thank you.

@sshleifer sshleifer assigned stas00 and unassigned sshleifer Nov 4, 2020
@stas00
Copy link
Contributor

stas00 commented Nov 4, 2020

@sshleifer, could you please validate that this is the command you run?

I get very different (bad) results:

bs: 9.000000
loss: 6.701375
src_pad_frac: 0.118056
src_pad_tok: 68.000000
step_count: 2.000000
test_avg_bleu: 0.021700
test_avg_gen_len: 512.000000
test_avg_gen_time: 0.439663
test_avg_loss: 5.679669
test_bleu: 0.021700
test_loss: 5.679669
tpb: 1025.000000
val_avg_bleu: 0.082700
val_avg_gen_len: 512.000000
val_avg_gen_time: 0.483860
val_avg_loss: 5.404536
val_bleu: 0.082700
val_loss: 5.404536

In the OP you mentioned " sshleifer/student_marian_6_3" but here you used "sshleifer/mar_enro_6_3_student" - not sure if that's the difference.

@stas00
Copy link
Contributor

stas00 commented Nov 4, 2020

Also for the second time you use wmt_en_ro instead of test_data/wmt_en_ro - do you use a different dataset?

@stas00
Copy link
Contributor

stas00 commented Nov 4, 2020

Your spec on timing would be a small issue, since I get what you said 3min on your hw in 33secs (unoptimized rtx3090), so might have to re-test on CI. But again I'm not sure we are testing against the same dataset, since my results are terrible.

@stas00
Copy link
Contributor

stas00 commented Nov 4, 2020

Retested with sshleifer/student_marian_en_ro_6_3 and 5 epochs - still under < 1 bleu - so this is probably an issue of insufficient data and you must be using a different dataset.

@sshleifer
Copy link
Contributor Author

I am using full dataset (as in README.md)

@stas00
Copy link
Contributor

stas00 commented Nov 4, 2020

Ah, that explains it.

So run the slow test with the full dataset downloaded at runtime, right?

@stas00
Copy link
Contributor

stas00 commented Nov 4, 2020

OK, I was able to reproduce your results with the full dataset, slightly under 3min and slightly better bleu scores.

@stas00
Copy link
Contributor

stas00 commented Nov 4, 2020

Not sure if there is a point to it, but 7zip shaves off about 35% in download size (but CI might not have it installed).

-rw-rw-r--  1 stas stas  58M Nov  4 11:40 wmt_en_ro.tar.gz
-rw-rw-r--  1 stas stas  37M Nov  4 11:39 wmt_en_ro.7z

@sshleifer
Copy link
Contributor Author

Another way to save download time would be to only zip up 100k (or fewer) training examples, 500 val examples, 500 test examples. Those are all we use given the --ntrain --nval --ntest flags.
I would also check whether 10k/25k/50k meet the learning requirements.

@stas00
Copy link
Contributor

stas00 commented Nov 5, 2020

While trying to match the suggested hparams to the ones in train_mbart_cc25_enro.sh I've been wondering - I think I'm missing the point of this whole thing - if the intention is to test a bash script with specific fixed hparams, but the test replaces half of these presets and adds quite a few new ones, how are we testing this script?

@stas00
Copy link
Contributor

stas00 commented Nov 5, 2020

Why do we use "--foo=bar" and "--foo bar" both seemingly totally at random - half the args are set the first way, the other half the other way.

@stas00
Copy link
Contributor

stas00 commented Nov 5, 2020

question: do you want this as a new test or modify the existing test_train_mbart_cc25_enro_script - I'm trying to do the latter at the moment - perhaps that's why I'm questioning what do we test here.

@sshleifer
Copy link
Contributor Author

The high level goal originally was to test that the bash scripts we check in work.
I have a secondary goal of making sure the training code is actually good at training models.
I am fine with any setup that accomplishes both of those tasks, with bonus points for enough traceback control that a developer could tell that they have made performance/vs. speed worse or some such.

As I slacked, we want a test to detect if we've regressed the training code. For example, if you set dropout=0.95 or freeze all parameters, or set the LR too low, or mess up the special tokens logic, the test should fail. Does that make sense? I didn't test all these for my command line, but it would be nice.

Relatedly, we just had a 2x slowdown in the finetune_trainer.py code that was not detected by unit-tests.

I know this is something of a scope expansion, so feel free to break it up/ignore parts as you see fit. I trust you to make reasonable decisions.

@stas00
Copy link
Contributor

stas00 commented Nov 5, 2020

Thank you for this useful brain dump. Let's take it point by point.

  1. the bash scripts

    If a test rewrites the script's guts before doing the testing should we not just modify those scripts themselves - we want to test that the script works, so we should test it as is, with the only changes allowed in some numerical settings to make tests faster.

    If we want different pre-sets for different purposes - then have a set of scripts rather then do-them-all in one?

  2. Best regression tests are written when an actual regression is discovered because then you know exactly which side of things to put under the "magnifier glass". When another regression is discovered down the road a new test should be made that focuses just on that part. Over time one ends up with a great coverage and the test suite becomes strong. Trying to accomplish all of these in one test will over time lose the very specific setups that exposed very specific side of things. It also helps to annotate that this test solves a regression in this git sha, so that it flags to future developers to not try to refactor or slip extra checks or slightly change things in the existing test.

    It's very difficult to think of all the possible things that could regress in the void, but surely it is a great start.

Relatedly, we just had a 2x slowdown in the finetune_trainer.py code that was not detected by unit-tests.

That's great! Let's find a git sha before and after and write a test that detects that regression.

I hope this approach makes sense?

@sshleifer
Copy link
Contributor Author

sshleifer commented Nov 5, 2020

Yeah you are right, let me try to isolate the bad commit https://github.com/huggingface/transformers/commits/master/examples/seq2seq

related issue: #8154

@sshleifer
Copy link
Contributor Author

I don't think there was an actual regression, I think my command lines are subtly different.
I still think the current test in the linked PR is more aggressive/better and should be added to the test suite in some form, but I am open to other opinions.

@stas00
Copy link
Contributor

stas00 commented Nov 6, 2020

edit: reusing the same ouput_dir during debug is a terrible idea - it gives total bogus test results - basically remembers the very first run and generates test reports based on it all the subsequent times, ignoring the actual test results. Why is that?

I am growing to dislike --overwrite_output_dir - it's so ambiguous - but I guess it was never meant to be used as a debug flag.

This works for debug:

        if DEBUG:
            output_dir = self.get_auto_remove_tmp_dir("./xxx", before=True, after=False)

So after re-evaluating:

Try to cut n_train/further

40k works.

25k w/ 2 epochs is almost there, but it's slower, than adding a bit more data, so went with 40k

going with a subset "tr40k-va0.5k-te0.5k"

@stas00
Copy link
Contributor

stas00 commented Nov 6, 2020

Created https://cdn-datasets.huggingface.co/translation/wmt_en_ro-tr40k-va0.5k-te0.5k.tar.gz - hope the name is intuitive - self-documenting. It's just 3.6M (vs 56M original)

I made it using this script:
https://github.com/stas00/porting/blob/master/transformers/translation/make-wmt_en_ro-subset.md

@stas00
Copy link
Contributor

stas00 commented Nov 6, 2020

In all these tests where we measure a relatively exact quality metrics - should we use a fixed seed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Help wanted Extra attention is needed, help appreciated Tests Related to tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants