[train] New training options for logging/validation based on number of steps #3379

klshuster · 2021-01-13T22:39:47Z

Are you here because you got a warning?

You have been warned for setting one of the following options during a distributed training run (multiprocessing_train or distributed_train):

Stopping criteria: --num-epochs/-eps, --max-train-time/-ttim
Logging criteria: --log-every-n-secs/-ltim
Validation criteria: --validation-every-n-sec/-vtim, --validation-every-n-epochs/-veps

In order to have any of these options work in distributed training, we must perform a status synchronization between all workers on every single training batch. This is a significant communication cost, and slows down training. You may still use these options, but your training could be faster.

Instead, if you limit yourself to these options, you may see as much as a 5-15% speedup in your training.

Stopping criteria: --max-train-steps/-tstep (too dependent on data/model to recommend values)
Logging criteria: --log-every-n-steps/-lstep (good values are 10, 50, or 100)
Validation criteria: --validation-every-n-steps/-vstep (good values are 100, 500, or 1000)

Patch description

To add to the litany of options for controlling logging, validation, and training, we now have 3 brand new, shiny options:

-lstep, --log-every-n-steps - log every n training updates
-vstep, --validation-every-n-steps - run validation every n training updates
-tstep, --max-train-steps - train for no more than n training updates

As a result, --max-lr-steps is now deprecated (It's still respected, but discouraged). Additionally, --max-lr-steps now applies to ALL LEARNING RATE STEPS; prior to this change, the option only applied to steps post-warmup.

Speed test

Testing a 1000 step training run on a 287M parameter transformer/generator with the GPT2 tokenizer, with no validations, and syncing turned on and off.

Testing steps

Manual testing and new CI. Manual benchmarking

Correctness

Added new tests to test_train_model.py

$ pytest -k TestTrainModel
.
.
.
tests/test_train_model.py ...... [100%]

==== slowest 10 durations ====
5.85s call     tests/test_train_model.py::TestTrainModel::test_opt_step
5.06s call     tests/test_train_model.py::TestTrainModel::test_opt_step_update_freq_2
1.74s call     tests/test_train_model.py::TestTrainModel::test_multitasking_metrics_micro
1.71s call     tests/test_train_model.py::TestTrainModel::test_multitasking_metrics_macro
0.30s call     tests/test_train_model.py::TestTrainModel::test_fast_final_eval
0.01s call     tests/test_train_model.py::TestTrainModel::test_multitasking_id_overlap

(4 durations < 0.005s hidden.  Use -vv to show these durations.)
==== 6 passed, 865 deselected, 2 warnings in 20.49s ====

stephenroller

I still think we need to remove the metadata syncs at the top of the while True loop, if we're only using these new options.

stephenroller · 2021-01-14T17:37:15Z

parlai/nn/lr_scheduler.py

+        max_lr_steps = opt.get('max_train_steps', -1)
+        deprecated_max_lr_steps = opt.get('max_lr_steps', -1)
+        if deprecated_max_lr_steps > 0:
+            logging.warn(


Let's bump this to an error. We already have enough warns that people ignore

parlai/scripts/train_model.py

klshuster · 2021-01-15T17:12:20Z

i think you're right, as my distributed sweeps are failing atm

emilydinan

lgtm!

parlai/scripts/train_model.py

github-actions · 2021-02-26T00:27:15Z

This PR has not had activity in 30 days. Closing due to staleness.

klshuster · 2021-02-26T00:58:22Z

i've just started a sweep, hopefully will have results soon

stephenroller

I reviewed and can't see a reason why it would have the inverse effect we predicted. Can we rerun our benchmark?

parlai/nn/lr_scheduler.py

parlai/scripts/train_model.py

stephenroller · 2021-02-27T22:31:09Z

Maybe your benchmark has a max train time? Or maybe our sweeper adds one?

stephenroller · 2021-03-05T23:43:07Z

We've validated the speedups from this change in offline tests.

klshuster added 2 commits January 13, 2021 17:31

black

7b12344

re-enable tests

f7ec308

klshuster requested review from stephenroller and emilydinan January 13, 2021 22:39

facebook-github-bot added the CLA Signed label Jan 13, 2021

stephenroller reviewed Jan 15, 2021

View reviewed changes

klshuster added 2 commits January 15, 2021 15:56

don't sync; and black

b77d97a

change flag

bf3f645

emilydinan reviewed Jan 25, 2021

View reviewed changes

stephenroller reviewed Jan 26, 2021

View reviewed changes

parlai/scripts/train_model.py Outdated Show resolved Hide resolved

parlai/scripts/train_model.py Show resolved Hide resolved

Merge branch 'master' into vstep_tstep_lstep_olay

52961af

github-actions bot added the stale label Feb 26, 2021

klshuster added the donotreap Avoid automatically marking as stale. label Feb 26, 2021

update per stephen

c23e5a2

oops forgot to push this

984ab5f

github-actions bot removed the stale label Feb 27, 2021

stephenroller reviewed Feb 27, 2021

View reviewed changes

parlai/nn/lr_scheduler.py Show resolved Hide resolved

parlai/scripts/train_model.py Outdated Show resolved Hide resolved

parlai/scripts/train_model.py Outdated Show resolved Hide resolved

parlai/scripts/train_model.py Outdated Show resolved Hide resolved

klshuster and others added 3 commits March 2, 2021 10:54

move func out of in-line

49f2fea

Missed a sync criteria

ba55096

Get rid of warnings

77e032f

stephenroller changed the title ~~Vstep, Tstep, Lstep~~ [train] Logging/validation/training limit based on number of SGD steps Mar 8, 2021

stephenroller added 2 commits March 8, 2021 14:08

Change default, add warning.

2290140

Merge branch 'master' into vstep_tstep_lstep_olay

cc0408b

stephenroller changed the title ~~[train] Logging/validation/training limit based on number of SGD steps~~ [train] New training options for logging/validation based on optimizer steps Mar 8, 2021

stephenroller changed the title ~~[train] New training options for logging/validation based on optimizer steps~~ [train] New training options for logging/validation based on number of steps Mar 8, 2021

stephenroller approved these changes Mar 8, 2021

View reviewed changes

stephenroller merged commit 394e568 into master Mar 8, 2021

stephenroller deleted the vstep_tstep_lstep_olay branch March 8, 2021 23:35

stephenroller mentioned this pull request Jun 15, 2021

Fix LR scheduler cooldown #3719

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] New training options for logging/validation based on number of steps #3379

[train] New training options for logging/validation based on number of steps #3379

klshuster commented Jan 13, 2021 •

edited by stephenroller

Loading

stephenroller left a comment

stephenroller Jan 14, 2021

klshuster commented Jan 15, 2021

emilydinan left a comment

github-actions bot commented Feb 26, 2021

klshuster commented Feb 26, 2021

stephenroller left a comment

stephenroller commented Feb 27, 2021

stephenroller commented Mar 5, 2021

[train] New training options for logging/validation based on number of steps #3379

[train] New training options for logging/validation based on number of steps #3379

Conversation

klshuster commented Jan 13, 2021 • edited by stephenroller Loading

Are you here because you got a warning?

Patch description

Speed test

Testing steps

Correctness

stephenroller left a comment

Choose a reason for hiding this comment

stephenroller Jan 14, 2021

Choose a reason for hiding this comment

klshuster commented Jan 15, 2021

emilydinan left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 26, 2021

klshuster commented Feb 26, 2021

stephenroller left a comment

Choose a reason for hiding this comment

stephenroller commented Feb 27, 2021

stephenroller commented Mar 5, 2021

klshuster commented Jan 13, 2021 •

edited by stephenroller

Loading