Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[deepspeed] saving checkpoint fallback when fp16 weights aren't saved #14948

Merged
merged 6 commits into from
Jan 28, 2022

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Dec 27, 2021

_save_checkpoint saves the deepspeed checkpoint, but this path:

push_to_hub => save_model => questionable outcome 

for z3 if stage3_gather_fp16_weights_on_model_save=false will not save the model, so this PR adds a fallback to saving the full checkpoint instead from which weights can be recovered.

Blocking events:

all resolved

@sgugger

@MihaiBalint
Copy link
Contributor

So the idea is that deepspeed.save_fp16_model() will return True if the weights have already been saved at the given path in which case there is no need to call deepspeed.save_checkpoint()?

@stas00
Copy link
Contributor Author

stas00 commented Dec 27, 2021

That's right. I just want to make sure that someone won't lose their work in case they misconfigured their DS setup.

We could check the DS config on our side as well, but I think all these things should be Deespeed's business.

@MihaiBalint
Copy link
Contributor

Looking at your DS pull request, I wonder what do you think about using a new method to tell us if a checkpoint actually exists on disk (perhaps using the tag argument and checking the contents of the latest file).

That new method could give us more flexibility/control when calling save_checkpoint()

@stas00
Copy link
Contributor Author

stas00 commented Dec 27, 2021

That would be an ambiguous API, since the checkpoint on disc could pre-exist from an old run. So checking timestamps will be required and it quickly becomes complicated and uncertain.

Given that the saving is super-fast, even for huge 100B models, I think it's no problem if on a rare occasion it will get saved twice.

I think we could have worked out something better in the HF Trainer normal path, but if someone uses parts of the API, that's when things become potentially "unsafe".

If a user uses parts of the HF Trainer API but builds their own training loop and doesn't care for the saved model then they won't call it.

Am I missing some path that will do an inefficient double saving in the normal case?

@MihaiBalint
Copy link
Contributor

It's just my OCD, I need to get that fixed at some point :-). As far as I'm concerned, the normal case works 100% correctly. Again, many thanks for your work!

@stas00
Copy link
Contributor Author

stas00 commented Dec 27, 2021

IMHO, while OCD in life can be harmful at times, OCD in software should be the standard and not considered a handicap. Especially in a library used by tens of thousands of users.

In other words if you see a path that is invalid we should fix it.

@MihaiBalint
Copy link
Contributor

I'm still interested in this. In the mean time, the associated deepspeed change has been included in release 0.5.9.

@huggingface huggingface deleted a comment from github-actions bot Jan 28, 2022
@stas00
Copy link
Contributor Author

stas00 commented Jan 28, 2022

Thank you, @MihaiBalint - the version update has to happen in setup.py and the dependency table is automatically updated. I adjusted things. testing now as CI doesn't test deepspeed here.

@stas00 stas00 marked this pull request as ready for review January 28, 2022 18:49
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing!

@stas00 stas00 merged commit 297602c into huggingface:master Jan 28, 2022
@stas00 stas00 deleted the save_fp16_model-fallback branch January 28, 2022 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants