Resume training from a checkpoint #23

yukang2017 · 2023-02-16T03:18:33Z

Thanks for your great work. I implement the resuming choice for training. It can be used as following.

For example,

torchrun --nnodes=1 --nproc_per_node=8 train.py --model DiT-XL/2 --data-path /path/to/imagenet/train --resume results/000/checkpoints/0100000.pt

For example, torchrun --nnodes=1 --nproc_per_node=8 train.py --model DiT-XL/2 --data-path /path/to/imagenet/train --resume results/000/checkpoints/0100000.pt

facebook-github-bot · 2023-02-16T03:18:38Z

Hi @yukang2017!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

facebook-github-bot · 2023-02-16T09:21:02Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

facebook-github-bot · 2023-02-16T10:00:07Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

Gongrunlin · 2023-02-20T11:52:03Z

Thank you very much for your resume. I encountered an error here. What is ‘_ddp_dict,’? Thank you！
line 203 in train.py
NameError: name '_ddp_dict' is not defined

yukang2017 · 2023-02-21T14:27:20Z

Hi,

It should be as following.

def _ddp_dict(_dict):
    new_dict = {}
    for k in _dict:
        new_dict['module.' + k] = _dict[k]
    return new_dict

Gongrunlin · 2023-02-22T02:50:01Z

Thanks！I wish you a happy life and work！

NathanYanJing · 2023-03-19T04:50:30Z

This looks really awesome! @yukang2017 I was doing something similar, but when I resume from a checkpoint, it seems the loss is not exactly going down from the point where it stopped, do you observe a similar phenomenon?

achen46 · 2023-03-21T00:12:37Z

@yukang2017 this is great and a much needed feature to be added. I tried your modifications to resume from a checkpoint. The loss in the beginning was around 0.21 and after 1M iterations was about 0.14. But upon restarting from the checkpoint with your modifications, the loss goes back to the starting value (0.21).

I believe there could be bug that resets the loss value ? I also check that you save the optimizer state, so not sure what this is about.

It would be great if you could please investigate.

achen46 · 2023-03-21T14:50:59Z

@yukang2017 I observe the same exact issue as you mentioned. Loss goes back up. I wonder if this may be due to the EMA weights ?

NathanYanJing · 2023-03-21T17:05:41Z

@achen46 I am also having a similar question whether this is ude to the EMA weights, but I thought EMA weights has been stored right?

achen46 · 2023-03-22T14:50:12Z

@NathanYanJing I believe so as the saved model is too large ~ 9G. But it could also be the fact that we overwrite them again and hence loss going back to the beginning value.

It would be great to know @yukang2017 opinion

Littleor · 2023-03-23T12:25:39Z

@achen46 I’ve encountered the exact same issue you described earlier, and as a result, I’ve created a new pull request #36 . I’ve tested it on my end, and it works for me. Hopefully, it will be helpful to you as well.

achen46 · 2023-03-29T03:08:07Z

@achen46 I’ve encountered the exact same issue you described earlier, and as a result, I’ve created a new pull request #36 . I’ve tested it on my end, and it works for me. Hopefully, it will be helpful to you as well.

Hi @Littleor thanks a lot. I will check it out and verify.

haohang96 · 2024-09-06T09:15:17Z

Hi @achen46 @Littleor @NathanYanJing, Do you find the reasons for the loss go back to initial values? I also encounter the same question.

Resume training from a checkpoint

6559078

For example, torchrun --nnodes=1 --nproc_per_node=8 train.py --model DiT-XL/2 --data-path /path/to/imagenet/train --resume results/000/checkpoints/0100000.pt

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 16, 2023

Update train.py

3f14f1d

achen46 mentioned this pull request Mar 22, 2023

Resume training from a checkpoint #35

Open

Littleor mentioned this pull request Mar 23, 2023

Add support for model training from checkpoint #36

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume training from a checkpoint #23

Resume training from a checkpoint #23

yukang2017 commented Feb 16, 2023

facebook-github-bot commented Feb 16, 2023

facebook-github-bot commented Feb 16, 2023

facebook-github-bot commented Feb 16, 2023

Gongrunlin commented Feb 20, 2023

yukang2017 commented Feb 21, 2023

Gongrunlin commented Feb 22, 2023

NathanYanJing commented Mar 19, 2023 •

edited

Loading

achen46 commented Mar 21, 2023

achen46 commented Mar 21, 2023

NathanYanJing commented Mar 21, 2023

achen46 commented Mar 22, 2023

Littleor commented Mar 23, 2023

achen46 commented Mar 29, 2023

haohang96 commented Sep 6, 2024

Resume training from a checkpoint #23

Are you sure you want to change the base?

Resume training from a checkpoint #23

Conversation

yukang2017 commented Feb 16, 2023

facebook-github-bot commented Feb 16, 2023

Action Required

Process

facebook-github-bot commented Feb 16, 2023

facebook-github-bot commented Feb 16, 2023

Gongrunlin commented Feb 20, 2023

yukang2017 commented Feb 21, 2023

Gongrunlin commented Feb 22, 2023

NathanYanJing commented Mar 19, 2023 • edited Loading

achen46 commented Mar 21, 2023

achen46 commented Mar 21, 2023

NathanYanJing commented Mar 21, 2023

achen46 commented Mar 22, 2023

Littleor commented Mar 23, 2023

achen46 commented Mar 29, 2023

haohang96 commented Sep 6, 2024

NathanYanJing commented Mar 19, 2023 •

edited

Loading