fix resume fsdp #23111

qywu · 2023-05-02T14:15:16Z

What does this PR do?

When training a model with FSDP, the checkpoint is not saved and loaded correctly. Only rank 0's optimizer state dict is saved. This PR fixes this issue.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@pacman100

HuggingFaceDocBuilderDev · 2023-05-02T14:33:56Z

The documentation is not available anymore as the PR was closed or merged.

pacman100

Thank you @qywu for the super quick fix, LGTM! 🤗

pacman100 · 2023-05-03T18:01:26Z

Please run make style and make quality to fix the quality issues

pacman100

Hello, I went over it again and noticed that you aren't saving and loading the optimizer state only on rank 0. Please do that. Refer the implementation in accelerate here for reference: https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils/dataclasses.py#L924-L952

qywu · 2023-05-03T20:21:41Z

I have fixed the issues. The optimizer saving had no problems. For using scatter_full_optim_state_dict, indeed loading on rank 0 is enough, which can save CPU memory usage.

pacman100 · 2023-05-03T21:20:00Z

src/transformers/trainer.py

@@ -2388,7 +2394,11 @@ def _save_checkpoint(self, model, trial, metrics=None):
                    torch.save(self.scaler.state_dict(), os.path.join(output_dir, SCALER_NAME))
        elif self.args.should_save and not self.deepspeed:
            # deepspeed.save_checkpoint above saves model/optim/sched
-            torch.save(self.optimizer.state_dict(), os.path.join(output_dir, OPTIMIZER_NAME))
+            if self.fsdp:
+                torch.save(full_osd, os.path.join(output_dir, OPTIMIZER_NAME))


saving on rank 0 should be efficient and enough, right?

I believe self.args.should_save in this case is already handling saving on rank 0

oh, okay, got it, thank you!

pacman100

Thank you @qywu for iterating 🤗

pacman100 · 2023-05-04T06:11:13Z

cc @sgugger for a second look

sgugger

Thanks for the fix!

wentinghome · 2023-05-10T19:04:35Z

thanks for the fix!

* fix resume fsdp * fix rank 0 loading * fix style and quality

fix resume fsdp

35b23df

pacman100 approved these changes May 3, 2023

View reviewed changes

pacman100 suggested changes May 3, 2023

View reviewed changes

qywu added 3 commits May 3, 2023 14:38

fix rank 0 loading

257ee88

fix style and quality

8be9d02

Merge branch 'huggingface:main' into fix_fsdp_state_resume

8436ea6

qywu requested a review from pacman100 May 3, 2023 20:22

pacman100 reviewed May 3, 2023

View reviewed changes

qywu requested a review from pacman100 May 3, 2023 22:35

pacman100 approved these changes May 4, 2023

View reviewed changes

sgugger approved these changes May 4, 2023

View reviewed changes

sgugger merged commit adb0760 into huggingface:main May 4, 2023

gojiteji pushed a commit to gojiteji/transformers that referenced this pull request Jun 5, 2023

fix resume fsdp (huggingface#23111)

fede167

* fix resume fsdp * fix rank 0 loading * fix style and quality

novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023

fix resume fsdp (huggingface#23111)

49eb648

* fix resume fsdp * fix rank 0 loading * fix style and quality

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix resume fsdp #23111

fix resume fsdp #23111

qywu commented May 2, 2023

HuggingFaceDocBuilderDev commented May 2, 2023 •

edited

Loading

pacman100 left a comment

pacman100 commented May 3, 2023

pacman100 left a comment

qywu commented May 3, 2023 •

edited

Loading

pacman100 May 3, 2023

qywu May 3, 2023

pacman100 May 4, 2023

pacman100 left a comment

pacman100 commented May 4, 2023

sgugger left a comment

wentinghome commented May 10, 2023

fix resume fsdp #23111

fix resume fsdp #23111

Conversation

qywu commented May 2, 2023

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented May 2, 2023 • edited Loading

pacman100 left a comment

Choose a reason for hiding this comment

pacman100 commented May 3, 2023

pacman100 left a comment

Choose a reason for hiding this comment

qywu commented May 3, 2023 • edited Loading

pacman100 May 3, 2023

Choose a reason for hiding this comment

qywu May 3, 2023

Choose a reason for hiding this comment

pacman100 May 4, 2023

Choose a reason for hiding this comment

pacman100 left a comment

Choose a reason for hiding this comment

pacman100 commented May 4, 2023

sgugger left a comment

Choose a reason for hiding this comment

wentinghome commented May 10, 2023

HuggingFaceDocBuilderDev commented May 2, 2023 •

edited

Loading

qywu commented May 3, 2023 •

edited

Loading