-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deepspeed resume from ckpt fixes and adding support for deepspeed optimizer and HF scheduler #25863
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for tackling the issue you described and conducting (+ showing) the experiments you ran to prove that it works. Personally, I miss the experience with deepspeed required to understand the bigger picture, so I cannot provide a full on review, only some small comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks! Let's definitely keep an eye out for pickle problems, and be prepared to move that to a util if needed
…imizer and HF scheduler (huggingface#25863) * Add support for deepspeed optimizer and HF scheduler * fix bug * fix the import * fix issue with deepspeed scheduler saving for hf optim + hf scheduler scenario * fix loading of hf scheduler when loading deepspeed checkpoint * fix import of `DeepSpeedSchedulerWrapper` * add tests * add the comment and skip the failing tests * address comment
…imizer and HF scheduler (#25863) * Add support for deepspeed optimizer and HF scheduler * fix bug * fix the import * fix issue with deepspeed scheduler saving for hf optim + hf scheduler scenario * fix loading of hf scheduler when loading deepspeed checkpoint * fix import of `DeepSpeedSchedulerWrapper` * add tests * add the comment and skip the failing tests * address comment
…imizer and HF scheduler (huggingface#25863) * Add support for deepspeed optimizer and HF scheduler * fix bug * fix the import * fix issue with deepspeed scheduler saving for hf optim + hf scheduler scenario * fix loading of hf scheduler when loading deepspeed checkpoint * fix import of `DeepSpeedSchedulerWrapper` * add tests * add the comment and skip the failing tests * address comment
…imizer and HF scheduler (huggingface#25863) * Add support for deepspeed optimizer and HF scheduler * fix bug * fix the import * fix issue with deepspeed scheduler saving for hf optim + hf scheduler scenario * fix loading of hf scheduler when loading deepspeed checkpoint * fix import of `DeepSpeedSchedulerWrapper` * add tests * add the comment and skip the failing tests * address comment
|
What does this PR do?
LRScheduler
. Should be merged after Add support for deepspeed optimizer and custom scheduler accelerate#1909Below we will run the 4 combinations of optimizer and schedulers for the
run_glue.py
transformers exampleInitial setup:
a. HF Optimizer + HF Scheduler Case:
i. ds config
ds_config_z3_hf_optim_hf_scheduler.json
:ii. command to run:
Kill the process after epoch 1. run the above command with
--resume_from_checkpoint
as below:iii. Plots of loss and learning rate:
a. DS Optimizer + DS Scheduler Case:
i. ds config
ds_config_z3_ds_optim_ds_scheduler.json
:rest of the steps as above. Plots:
c. HF Optimizer + DS Scheduler Case:
i. ds config
ds_config_z3_hf_optim_ds_scheduler.json
:rest of the steps as above. Plots:
c. DS Optimizer + HF Scheduler Case:
i. ds config
ds_config_z3_ds_optim_hf_scheduler.json
:rest of the steps as above. Plots: