-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix resume fsdp #23111
fix resume fsdp #23111
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @qywu for the super quick fix, LGTM! 🤗
Please run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, I went over it again and noticed that you aren't saving and loading the optimizer state only on rank 0. Please do that. Refer the implementation in accelerate here for reference: https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils/dataclasses.py#L924-L952
I have fixed the issues. The optimizer saving had no problems. For using scatter_full_optim_state_dict, indeed loading on rank 0 is enough, which can save CPU memory usage. |
@@ -2388,7 +2394,11 @@ def _save_checkpoint(self, model, trial, metrics=None): | |||
torch.save(self.scaler.state_dict(), os.path.join(output_dir, SCALER_NAME)) | |||
elif self.args.should_save and not self.deepspeed: | |||
# deepspeed.save_checkpoint above saves model/optim/sched | |||
torch.save(self.optimizer.state_dict(), os.path.join(output_dir, OPTIMIZER_NAME)) | |||
if self.fsdp: | |||
torch.save(full_osd, os.path.join(output_dir, OPTIMIZER_NAME)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
saving on rank 0 should be efficient and enough, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe self.args.should_save
in this case is already handling saving on rank 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, okay, got it, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @qywu for iterating 🤗
cc @sgugger for a second look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix!
thanks for the fix! |
* fix resume fsdp * fix rank 0 loading * fix style and quality
* fix resume fsdp * fix rank 0 loading * fix style and quality
What does this PR do?
Fixes # 23034
When training a model with FSDP, the checkpoint is not saved and loaded correctly. Only rank 0's optimizer state dict is saved. This PR fixes this issue.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@pacman100