Roll-forward with fixes: Fix interaction between scheduler.step() and gradient accumulation steps, refactor schedulers to use LambdaLR
, and add cosine annealing LR scheduler as a decay method.
#3555
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summarizing findings with the LR scheduler, with proposed fixes:
1. Reliance on default argument values with a broken resetting mechanism.
The main issue with the cosine decay LR PR is the incorporation of default values in the construction of the scheduler.
step_info.num_warmup_steps
is set incorrectly due to arbitrary default argument values (steps_per_checkpoint=1000
,total_steps=10000
) used in theLRScheduler
’s constructor.step_info.num_warmup_steps
is set to correct values only on the first call toscheduler.reset()
which happens at the start of the primary train loop.LambdaLR
object (b).step_info.num_warmup_steps
by reference. The subsequent call toscheduler.reset()
(which updatesstep_info.num_warmup_steps
) does use the correct number of warmup steps.step_info.num_warmup_steps
by value. The subsequent call toscheduler.reset()
(which updatesstep_info.num_warmup_steps
) doesn’t update/reconstructLambdaLR
.This is why we haven’t seen any issues wrt regular training even though the default arguments have been there for at least 8 months.
Proposal:
steps_per_checkpoint=0
,total_steps=0
) at Trainer init time.steps_per_checkpoint
andtotal_steps
at the start of the train loop, instead of callingscheduler.reset()
, construct a newLRScheduler
object with these values.LambdaLR
will be initialized with the correctstep_info.num_warmup_steps
.2.
gradient_accumulation_steps
onscheduler.step()
's control flowThis has been an issue independent of the cosine decay PR.
scheduler.step()
:step % gradient_accumulation_steps == 0
orstep == is_checkpoint_step
.step()
, torch scheduler’s internal current step is incremented by 1..step()
on them.LRScheduler
requires pre-calculating when thescheduler.step()
is called so that we have an accurate mapping from scheduler’s current step and the actual occurred number of training steps.The formula for this pre-calculation is possible, but complex, as it involves consolidating both
gradient_accumulation_steps
andsteps_per_checkpoint
, and passing this map to each scheduler.Proposal
schedule.step()
and training steps by simply callingschedule.step()
every training step.scheduler.step()
has been called is synchronized.