-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[train] New training options for logging/validation based on number of steps #3379
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we need to remove the metadata syncs at the top of the while True loop, if we're only using these new options.
parlai/nn/lr_scheduler.py
Outdated
max_lr_steps = opt.get('max_train_steps', -1) | ||
deprecated_max_lr_steps = opt.get('max_lr_steps', -1) | ||
if deprecated_max_lr_steps > 0: | ||
logging.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's bump this to an error. We already have enough warns that people ignore
i think you're right, as my distributed sweeps are failing atm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
This PR has not had activity in 30 days. Closing due to staleness. |
i've just started a sweep, hopefully will have results soon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed and can't see a reason why it would have the inverse effect we predicted. Can we rerun our benchmark?
Maybe your benchmark has a max train time? Or maybe our sweeper adds one? |
We've validated the speedups from this change in offline tests. |
Are you here because you got a warning?
You have been warned for setting one of the following options during a distributed training run (multiprocessing_train or distributed_train):
--num-epochs
/-eps
,--max-train-time
/-ttim
--log-every-n-secs
/-ltim
--validation-every-n-sec
/-vtim
,--validation-every-n-epochs
/-veps
In order to have any of these options work in distributed training, we must perform a status synchronization between all workers on every single training batch. This is a significant communication cost, and slows down training. You may still use these options, but your training could be faster.
Instead, if you limit yourself to these options, you may see as much as a 5-15% speedup in your training.
--max-train-steps
/-tstep
(too dependent on data/model to recommend values)--log-every-n-steps
/-lstep
(good values are 10, 50, or 100)--validation-every-n-steps
/-vstep
(good values are 100, 500, or 1000)Patch description
To add to the litany of options for controlling logging, validation, and training, we now have 3 brand new, shiny options:
-lstep, --log-every-n-steps
- log every n training updates-vstep, --validation-every-n-steps
- run validation every n training updates-tstep, --max-train-steps
- train for no more than n training updatesAs a result,
--max-lr-steps
is now deprecated (It's still respected, but discouraged). Additionally,--max-lr-steps
now applies to ALL LEARNING RATE STEPS; prior to this change, the option only applied to steps post-warmup.Speed test
Testing a 1000 step training run on a 287M parameter transformer/generator with the GPT2 tokenizer, with no validations, and syncing turned on and off.
Testing steps
Manual testing and new CI. Manual benchmarking
Correctness
Added new tests to
test_train_model.py