Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume training from a checkpoint #23

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

yukang2017
Copy link

Thanks for your great work. I implement the resuming choice for training. It can be used as following.

For example,

torchrun --nnodes=1 --nproc_per_node=8 train.py --model DiT-XL/2 --data-path /path/to/imagenet/train --resume results/000/checkpoints/0100000.pt

For example,
torchrun --nnodes=1 --nproc_per_node=8 train.py --model DiT-XL/2 --data-path /path/to/imagenet/train --resume results/000/checkpoints/0100000.pt
@facebook-github-bot
Copy link

Hi @yukang2017!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 16, 2023
@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

1 similar comment
@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@Gongrunlin
Copy link

Thank you very much for your resume. I encountered an error here. What is ‘_ddp_dict,’? Thank you!
line 203 in train.py
NameError: name '_ddp_dict' is not defined

@yukang2017
Copy link
Author

Hi,

It should be as following.

def _ddp_dict(_dict):
    new_dict = {}
    for k in _dict:
        new_dict['module.' + k] = _dict[k]
    return new_dict

@Gongrunlin
Copy link

Thanks!I wish you a happy life and work!

@NathanYanJing
Copy link

NathanYanJing commented Mar 19, 2023

This looks really awesome! @yukang2017 I was doing something similar, but when I resume from a checkpoint, it seems the loss is not exactly going down from the point where it stopped, do you observe a similar phenomenon?

@achen46
Copy link

achen46 commented Mar 21, 2023

@yukang2017 this is great and a much needed feature to be added. I tried your modifications to resume from a checkpoint. The loss in the beginning was around 0.21 and after 1M iterations was about 0.14. But upon restarting from the checkpoint with your modifications, the loss goes back to the starting value (0.21).

I believe there could be bug that resets the loss value ? I also check that you save the optimizer state, so not sure what this is about.

It would be great if you could please investigate.

@achen46
Copy link

achen46 commented Mar 21, 2023

@yukang2017 I observe the same exact issue as you mentioned. Loss goes back up. I wonder if this may be due to the EMA weights ?

@NathanYanJing
Copy link

@achen46 I am also having a similar question whether this is ude to the EMA weights, but I thought EMA weights has been stored right?

@achen46
Copy link

achen46 commented Mar 22, 2023

@NathanYanJing I believe so as the saved model is too large ~ 9G. But it could also be the fact that we overwrite them again and hence loss going back to the beginning value.

It would be great to know @yukang2017 opinion

@Littleor
Copy link

@achen46 I’ve encountered the exact same issue you described earlier, and as a result, I’ve created a new pull request #36 . I’ve tested it on my end, and it works for me. Hopefully, it will be helpful to you as well.

@achen46
Copy link

achen46 commented Mar 29, 2023

@achen46 I’ve encountered the exact same issue you described earlier, and as a result, I’ve created a new pull request #36 . I’ve tested it on my end, and it works for me. Hopefully, it will be helpful to you as well.

Hi @Littleor thanks a lot. I will check it out and verify.

@haohang96
Copy link

Hi @achen46 @Littleor @NathanYanJing, Do you find the reasons for the loss go back to initial values? I also encounter the same question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants