[deepspeed] saving checkpoint fallback when fp16 weights aren't saved #14948

stas00 · 2021-12-27T18:59:07Z

_save_checkpoint saves the deepspeed checkpoint, but this path:

push_to_hub => save_model => questionable outcome

for z3 if stage3_gather_fp16_weights_on_model_save=false will not save the model, so this PR adds a fallback to saving the full checkpoint instead from which weights can be recovered.

Blocking events:

[save_fp16_model] return status microsoft/DeepSpeed#1663 merged
new deepspeed version released
our dep table has deepspeed version adjusted to the above version

all resolved

@sgugger

…lback

MihaiBalint · 2021-12-27T19:39:45Z

So the idea is that deepspeed.save_fp16_model() will return True if the weights have already been saved at the given path in which case there is no need to call deepspeed.save_checkpoint()?

stas00 · 2021-12-27T19:55:06Z

That's right. I just want to make sure that someone won't lose their work in case they misconfigured their DS setup.

We could check the DS config on our side as well, but I think all these things should be Deespeed's business.

MihaiBalint · 2021-12-27T20:08:47Z

Looking at your DS pull request, I wonder what do you think about using a new method to tell us if a checkpoint actually exists on disk (perhaps using the tag argument and checking the contents of the latest file).

That new method could give us more flexibility/control when calling save_checkpoint()

stas00 · 2021-12-27T20:22:11Z

That would be an ambiguous API, since the checkpoint on disc could pre-exist from an old run. So checking timestamps will be required and it quickly becomes complicated and uncertain.

Given that the saving is super-fast, even for huge 100B models, I think it's no problem if on a rare occasion it will get saved twice.

I think we could have worked out something better in the HF Trainer normal path, but if someone uses parts of the API, that's when things become potentially "unsafe".

If a user uses parts of the HF Trainer API but builds their own training loop and doesn't care for the saved model then they won't call it.

Am I missing some path that will do an inefficient double saving in the normal case?

MihaiBalint · 2021-12-27T20:41:48Z

It's just my OCD, I need to get that fixed at some point :-). As far as I'm concerned, the normal case works 100% correctly. Again, many thanks for your work!

stas00 · 2021-12-27T21:02:12Z

IMHO, while OCD in life can be harmful at times, OCD in software should be the standard and not considered a handicap. Especially in a library used by tens of thousands of users.

In other words if you see a path that is invalid we should fix it.

MihaiBalint · 2022-01-28T15:09:54Z

I'm still interested in this. In the mean time, the associated deepspeed change has been included in release 0.5.9.

…lback

Bump required deepspeed version to match usage when saving checkpoints

stas00 · 2022-01-28T18:43:29Z

Thank you, @MihaiBalint - the version update has to happen in setup.py and the dependency table is automatically updated. I adjusted things. testing now as CI doesn't test deepspeed here.

sgugger

LGTM, thanks for fixing!

[deepspeed] saving checkpoint fallback when fp16 weights aren't saved

74f53f0

stas00 mentioned this pull request Dec 27, 2021

Fix duplicate call to save_checkpoint when using deepspeed #14946

Merged

5 tasks

Merge remote-tracking branch 'origin/master' into save_fp16_model-fal…

2b19c5b

…lback

Bump required deepspeed version to match usage when saving checkpoints

fef8ab4

huggingface deleted a comment from github-actions bot Jan 28, 2022

stas00 and others added 3 commits January 28, 2022 10:36

Merge remote-tracking branch 'origin/master' into save_fp16_model-fal…

3e749a8

…lback

Merge pull request #2 from MihaiBalint/save_fp16_model-fallback

1e4cd32

Bump required deepspeed version to match usage when saving checkpoints

update version

b703c26

stas00 marked this pull request as ready for review January 28, 2022 18:49

sgugger approved these changes Jan 28, 2022

View reviewed changes

stas00 merged commit 297602c into huggingface:master Jan 28, 2022

stas00 deleted the save_fp16_model-fallback branch January 28, 2022 19:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[deepspeed] saving checkpoint fallback when fp16 weights aren't saved #14948

[deepspeed] saving checkpoint fallback when fp16 weights aren't saved #14948

stas00 commented Dec 27, 2021 •

edited

Loading

MihaiBalint commented Dec 27, 2021

stas00 commented Dec 27, 2021

MihaiBalint commented Dec 27, 2021

stas00 commented Dec 27, 2021

MihaiBalint commented Dec 27, 2021

stas00 commented Dec 27, 2021

MihaiBalint commented Jan 28, 2022

stas00 commented Jan 28, 2022 •

edited

Loading

sgugger left a comment

[deepspeed] saving checkpoint fallback when fp16 weights aren't saved #14948

[deepspeed] saving checkpoint fallback when fp16 weights aren't saved #14948

Conversation

stas00 commented Dec 27, 2021 • edited Loading

MihaiBalint commented Dec 27, 2021

stas00 commented Dec 27, 2021

MihaiBalint commented Dec 27, 2021

stas00 commented Dec 27, 2021

MihaiBalint commented Dec 27, 2021

stas00 commented Dec 27, 2021

MihaiBalint commented Jan 28, 2022

stas00 commented Jan 28, 2022 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

stas00 commented Dec 27, 2021 •

edited

Loading

stas00 commented Jan 28, 2022 •

edited

Loading