Add cosine_with_min_lr_schedule_with_warmup_lr_rate scheduler in Trainer #31870
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Add cosine_with_min_lr_schedule_with_warmup_lr_rate scheduler in Trainer, which is based on #29341.
As mentioned in the previous PR, it implemented "a warmup period during which it increases linearly between 0 and the initial learning rate set in the optimizer." However, I recently investigated the DeepSpeed framework and noticed that their function implements additional features, namely "warmup_min_ratio" in https://github.com/richardodliu/DeepSpeed/blob/master/deepspeed/runtime/lr_schedules.py#L774, which supports a warmup start learning rate ratio differs from 0. Considering that DeepSpeed is a crucial component framework in the implementation of Transformers, I aim to ensure consistency between the two in the implementation of learning rate schedulers through this PR. This is to prevent potential confusion for users who employ both frameworks simultaneously.
Our implementation is based on improvements from previous PR, providing significant benefits. Specifically, we allow the reuse of this method without modifying any input parameters(Don't worry; if you prefer to specify the parameter, you are certainly allowed to do so by simply passing it as an argument "warmup_lr_rate", which is used to specify the ratio between the start learning rate and the initial learning rate). In such cases, the implementation is equivalent to setting "warmup_lr_rate" to "1/warmup_steps". Regarding enhancements to the training process, since it is recommended to perform "optimizer.step()" before "lr_scheduler.step()", the learning rate for the batch corresponding to the first step would be zero under the previous implementation. This means that the gradients would not be updated for that batch, effectively wasting it. Our updated method addresses this issue. While this phenomenon may not be noticeable with larger dataset sizes, it becomes significant with smaller datasets where the total number of steps is limited. Additionally, our implementation ensures that the final small learning rate is reachable, rather than being updated only after the training is completed. Overall, our approach is better suited as an improvement to the existing method rather than a complete overhaul. Therefore, we implemented our idea by creating a new function, allowing users the flexibility to choose which implementation method they prefer.
Fixes: WarmupCosineLR missed in WarmupCosineLR
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@muellerzr and @SunMarc