Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

[train] New training options for logging/validation based on number of steps #3379

Merged
merged 12 commits into from
Mar 8, 2021

Conversation

klshuster
Copy link
Contributor

@klshuster klshuster commented Jan 13, 2021

Are you here because you got a warning?

You have been warned for setting one of the following options during a distributed training run (multiprocessing_train or distributed_train):

  • Stopping criteria: --num-epochs/-eps, --max-train-time/-ttim
  • Logging criteria: --log-every-n-secs/-ltim
  • Validation criteria: --validation-every-n-sec/-vtim, --validation-every-n-epochs/-veps

In order to have any of these options work in distributed training, we must perform a status synchronization between all workers on every single training batch. This is a significant communication cost, and slows down training. You may still use these options, but your training could be faster.

Instead, if you limit yourself to these options, you may see as much as a 5-15% speedup in your training.

  • Stopping criteria: --max-train-steps/-tstep (too dependent on data/model to recommend values)
  • Logging criteria: --log-every-n-steps/-lstep (good values are 10, 50, or 100)
  • Validation criteria: --validation-every-n-steps/-vstep (good values are 100, 500, or 1000)

Patch description

To add to the litany of options for controlling logging, validation, and training, we now have 3 brand new, shiny options:

  • -lstep, --log-every-n-steps - log every n training updates
  • -vstep, --validation-every-n-steps - run validation every n training updates
  • -tstep, --max-train-steps - train for no more than n training updates

As a result, --max-lr-steps is now deprecated (It's still respected, but discouraged). Additionally, --max-lr-steps now applies to ALL LEARNING RATE STEPS; prior to this change, the option only applied to steps post-warmup.

Speed test

Testing a 1000 step training run on a 287M parameter transformer/generator with the GPT2 tokenizer, with no validations, and syncing turned on and off.

image

Testing steps

Manual testing and new CI. Manual benchmarking

Correctness

Added new tests to test_train_model.py

$ pytest -k TestTrainModel
.
.
.
tests/test_train_model.py ...... [100%]

==== slowest 10 durations ====
5.85s call     tests/test_train_model.py::TestTrainModel::test_opt_step
5.06s call     tests/test_train_model.py::TestTrainModel::test_opt_step_update_freq_2
1.74s call     tests/test_train_model.py::TestTrainModel::test_multitasking_metrics_micro
1.71s call     tests/test_train_model.py::TestTrainModel::test_multitasking_metrics_macro
0.30s call     tests/test_train_model.py::TestTrainModel::test_fast_final_eval
0.01s call     tests/test_train_model.py::TestTrainModel::test_multitasking_id_overlap

(4 durations < 0.005s hidden.  Use -vv to show these durations.)
==== 6 passed, 865 deselected, 2 warnings in 20.49s ====

Copy link
Contributor

@stephenroller stephenroller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we need to remove the metadata syncs at the top of the while True loop, if we're only using these new options.

max_lr_steps = opt.get('max_train_steps', -1)
deprecated_max_lr_steps = opt.get('max_lr_steps', -1)
if deprecated_max_lr_steps > 0:
logging.warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's bump this to an error. We already have enough warns that people ignore

parlai/scripts/train_model.py Outdated Show resolved Hide resolved
parlai/scripts/train_model.py Outdated Show resolved Hide resolved
@klshuster
Copy link
Contributor Author

i think you're right, as my distributed sweeps are failing atm

Copy link
Contributor

@emilydinan emilydinan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

parlai/scripts/train_model.py Outdated Show resolved Hide resolved
parlai/scripts/train_model.py Show resolved Hide resolved
@github-actions
Copy link

This PR has not had activity in 30 days. Closing due to staleness.

@github-actions github-actions bot added the stale label Feb 26, 2021
@klshuster klshuster added the donotreap Avoid automatically marking as stale. label Feb 26, 2021
@klshuster
Copy link
Contributor Author

i've just started a sweep, hopefully will have results soon

@github-actions github-actions bot removed the stale label Feb 27, 2021
Copy link
Contributor

@stephenroller stephenroller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed and can't see a reason why it would have the inverse effect we predicted. Can we rerun our benchmark?

parlai/nn/lr_scheduler.py Show resolved Hide resolved
parlai/scripts/train_model.py Outdated Show resolved Hide resolved
parlai/scripts/train_model.py Outdated Show resolved Hide resolved
parlai/scripts/train_model.py Outdated Show resolved Hide resolved
@stephenroller
Copy link
Contributor

Maybe your benchmark has a max train time? Or maybe our sweeper adds one?

@stephenroller
Copy link
Contributor

We've validated the speedups from this change in offline tests.

@stephenroller stephenroller changed the title Vstep, Tstep, Lstep [train] Logging/validation/training limit based on number of SGD steps Mar 8, 2021
@stephenroller stephenroller changed the title [train] Logging/validation/training limit based on number of SGD steps [train] New training options for logging/validation based on optimizer steps Mar 8, 2021
@stephenroller stephenroller changed the title [train] New training options for logging/validation based on optimizer steps [train] New training options for logging/validation based on number of steps Mar 8, 2021
@stephenroller stephenroller merged commit 394e568 into master Mar 8, 2021
@stephenroller stephenroller deleted the vstep_tstep_lstep_olay branch March 8, 2021 23:35
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed donotreap Avoid automatically marking as stale.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants