-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[deepspeed] saving checkpoint fallback when fp16 weights aren't saved #14948
Conversation
So the idea is that |
That's right. I just want to make sure that someone won't lose their work in case they misconfigured their DS setup. We could check the DS config on our side as well, but I think all these things should be Deespeed's business. |
Looking at your DS pull request, I wonder what do you think about using a new method to tell us if a checkpoint actually exists on disk (perhaps using the tag argument and checking the contents of the That new method could give us more flexibility/control when calling save_checkpoint() |
That would be an ambiguous API, since the checkpoint on disc could pre-exist from an old run. So checking timestamps will be required and it quickly becomes complicated and uncertain. Given that the saving is super-fast, even for huge 100B models, I think it's no problem if on a rare occasion it will get saved twice. I think we could have worked out something better in the HF Trainer normal path, but if someone uses parts of the API, that's when things become potentially "unsafe". If a user uses parts of the HF Trainer API but builds their own training loop and doesn't care for the saved model then they won't call it. Am I missing some path that will do an inefficient double saving in the normal case? |
It's just my OCD, I need to get that fixed at some point :-). As far as I'm concerned, the normal case works 100% correctly. Again, many thanks for your work! |
IMHO, while OCD in life can be harmful at times, OCD in software should be the standard and not considered a handicap. Especially in a library used by tens of thousands of users. In other words if you see a path that is invalid we should fix it. |
I'm still interested in this. In the mean time, the associated deepspeed change has been included in release 0.5.9. |
Bump required deepspeed version to match usage when saving checkpoints
Thank you, @MihaiBalint - the version update has to happen in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for fixing!
_save_checkpoint
saves the deepspeed checkpoint, but this path:for z3 if
stage3_gather_fp16_weights_on_model_save=false
will not save the model, so this PR adds a fallback to saving the full checkpoint instead from which weights can be recovered.Blocking events:
all resolved
@sgugger