-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resume training from a checkpoint #23
base: main
Are you sure you want to change the base?
Conversation
For example, torchrun --nnodes=1 --nproc_per_node=8 train.py --model DiT-XL/2 --data-path /path/to/imagenet/train --resume results/000/checkpoints/0100000.pt
Hi @yukang2017! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
1 similar comment
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
Thank you very much for your resume. I encountered an error here. What is ‘_ddp_dict,’? Thank you! |
Hi, It should be as following.
|
Thanks!I wish you a happy life and work! |
This looks really awesome! @yukang2017 I was doing something similar, but when I resume from a checkpoint, it seems the loss is not exactly going down from the point where it stopped, do you observe a similar phenomenon? |
@yukang2017 this is great and a much needed feature to be added. I tried your modifications to resume from a checkpoint. The loss in the beginning was around 0.21 and after 1M iterations was about 0.14. But upon restarting from the checkpoint with your modifications, the loss goes back to the starting value (0.21). I believe there could be bug that resets the loss value ? I also check that you save the optimizer state, so not sure what this is about. It would be great if you could please investigate. |
@yukang2017 I observe the same exact issue as you mentioned. Loss goes back up. I wonder if this may be due to the EMA weights ? |
@achen46 I am also having a similar question whether this is ude to the EMA weights, but I thought EMA weights has been stored right? |
@NathanYanJing I believe so as the saved model is too large ~ 9G. But it could also be the fact that we overwrite them again and hence loss going back to the beginning value. It would be great to know @yukang2017 opinion |
Hi @achen46 @Littleor @NathanYanJing, Do you find the reasons for the loss go back to initial values? I also encounter the same question. |
Thanks for your great work. I implement the resuming choice for training. It can be used as following.
For example,
torchrun --nnodes=1 --nproc_per_node=8 train.py --model DiT-XL/2 --data-path /path/to/imagenet/train --resume results/000/checkpoints/0100000.pt