-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
examples/seq2seq/test_bash_script.py :: actually learn something #6049
Comments
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This is still an issue :) |
This command runs in 3 mins (without including downloads) # export WANDB_PROJECT=dmar
export MAX_LEN=64
python finetune.py \
--learning_rate=3e-4 \
--do_train \
--do_predict \
--fp16 \
--val_check_interval 0.25 --n_train 100000 --n_val 500 --n_test 500 \
--data_dir wmt_en_ro \
--max_source_length $MAX_LEN --max_target_length $MAX_LEN --val_max_target_length $MAX_LEN --test_max_target_length $MAX_LEN \
--freeze_encoder --freeze_embeds \
--train_batch_size=64 --eval_batch_size=64 \
--tokenizer_name Helsinki-NLP/opus-mt-en-ro \
--model_name_or_path sshleifer/mar_enro_6_3_student \
--warmup_steps 500 --sortish_sampler \
--gpus 1 --fp16_opt_level=O1 --task translation --num_sanity_val_steps=0 --output_dir dmar_utest_1gpu --num_train_epochs=1 \
--overwrite_output_dir Test resultscat dmar_utest_1gpu/test_results.txt
The validation BLEU also improves over the course of training:
So this would be a good template for the test. Spec
(this command meets all 3 learning requirements). Wdyt @stas00 ? |
I will work on that, thank you. |
@sshleifer, could you please validate that this is the command you run? I get very different (bad) results:
In the OP you mentioned " sshleifer/student_marian_6_3" but here you used "sshleifer/mar_enro_6_3_student" - not sure if that's the difference. |
Also for the second time you use |
Your spec on timing would be a small issue, since I get what you said 3min on your hw in 33secs (unoptimized rtx3090), so might have to re-test on CI. But again I'm not sure we are testing against the same dataset, since my results are terrible. |
Retested with |
I am using full dataset (as in README.md) |
Ah, that explains it. So run the slow test with the full dataset downloaded at runtime, right? |
OK, I was able to reproduce your results with the full dataset, slightly under 3min and slightly better bleu scores. |
Not sure if there is a point to it, but 7zip shaves off about 35% in download size (but CI might not have it installed).
|
Another way to save download time would be to only zip up 100k (or fewer) training examples, 500 val examples, 500 test examples. Those are all we use given the |
While trying to match the suggested hparams to the ones in |
Why do we use "--foo=bar" and "--foo bar" both seemingly totally at random - half the args are set the first way, the other half the other way. |
question: do you want this as a new test or modify the existing |
The high level goal originally was to test that the bash scripts we check in work. As I slacked, we want a test to detect if we've regressed the training code. For example, if you set dropout=0.95 or freeze all parameters, or set the LR too low, or mess up the special tokens logic, the test should fail. Does that make sense? I didn't test all these for my command line, but it would be nice. Relatedly, we just had a 2x slowdown in the I know this is something of a scope expansion, so feel free to break it up/ignore parts as you see fit. I trust you to make reasonable decisions. |
Thank you for this useful brain dump. Let's take it point by point.
That's great! Let's find a git sha before and after and write a test that detects that regression. I hope this approach makes sense? |
Yeah you are right, let me try to isolate the bad commit https://github.com/huggingface/transformers/commits/master/examples/seq2seq related issue: #8154 |
I don't think there was an actual regression, I think my command lines are subtly different. |
edit: reusing the same ouput_dir during debug is a terrible idea - it gives total bogus test results - basically remembers the very first run and generates test reports based on it all the subsequent times, ignoring the actual test results. Why is that? I am growing to dislike This works for debug:
So after re-evaluating:
40k works. 25k w/ 2 epochs is almost there, but it's slower, than adding a bit more data, so went with 40k going with a subset "tr40k-va0.5k-te0.5k" |
Created https://cdn-datasets.huggingface.co/translation/wmt_en_ro-tr40k-va0.5k-te0.5k.tar.gz - hope the name is intuitive - self-documenting. It's just 3.6M (vs 56M original) I made it using this script: |
In all these tests where we measure a relatively exact quality metrics - should we use a fixed seed? |
At the moment validation bleu barely gets above zero in the tests, so they don't really prove much about our code.
we could use a larger model like sshleifer/student_marian_6_3, and more data, and train for 10 minutes . This would allows us to test whether changing default parameters/batch techniques obviously degrades performance.
The github actions CI reuses it's own disk, so this will only run there and hopefully not have super slow downloads.
The text was updated successfully, but these errors were encountered: