Save ZeRO3 (partitioned) fp16 weights #882

tjruwase · 2021-03-19T22:05:56Z

Save ZeRO3 (partitioned) fp16 weights. This is a first step to using ZeRO3 weights outside DeepSpeed, #872.

stas00 · 2021-03-19T22:51:48Z

That still leaves the partitions separate, so this is great if a user wants to load each partition separately, but doesn't work for when a user needs the model weights consolidated.

Also I don't think this PR should do this by default as it adds an overhead that most users won't need. So it should be configurable.

And also as suggested elsewhere the model_states.pt file with fake weights probably shouldn't even be saved as it just confuses the users who try to load it and it's guaranteed to fail.

stas00 · 2021-03-24T21:56:01Z

deepspeed/runtime/zero/stage3.py

+    def save_partitioned_weights(self, state_dict):
+        for name, param in self.module.named_parameters():
+            if name in state_dict.keys():
+                state_dict[name] = param.ds_tensor


Found an issue here: param.ds_tensor in this place appears to be is a flattened buffer. So state_dicts ends up being populated with 1D vectors.

but we can't shape it back to the original since we only have a part of the tensor, so doing something like narrow(0, 0, param.ds_numel).view(param.ds_shape) from _allgather_param() won't work and the shape has no meaning here anyway.

So this line of logic is useful when it's used to load the param.ds_tensor directly by each gpu, as coded in the rest of this PR.

I just tried to use it to get the partitioned fp16 weights, but now I understand this is not possible using this approach.

Bottom line - there is no problem here, just needed to understand that this is not a real state_dict that is being saved but something like flattened_params_state_dict.

All is good!

tjruwase · 2021-03-26T21:55:00Z

Redundant by #892 and #893

Save ZeRO3 (partitioned) fp16 weights

856ab31

tjruwase requested a review from samyam March 19, 2021 22:05

tjruwase requested review from RezaYazdaniAminabadi, ShadenSmith, arashashari, awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz and niumanar as code owners March 19, 2021 22:05

Compare ds_tensors

8cd046f

This was referenced Mar 21, 2021

[zero3] how to get the model reconstructed for saving? #872

Closed

[DeepSpeed] ZeRO Stage 3 huggingface/transformers#10753

Merged

stas00 reviewed Mar 24, 2021

View reviewed changes

tjruwase closed this Mar 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save ZeRO3 (partitioned) fp16 weights #882

Save ZeRO3 (partitioned) fp16 weights #882

Uh oh!

tjruwase commented Mar 19, 2021

Uh oh!

stas00 commented Mar 19, 2021 •

edited

Loading

Uh oh!

stas00 Mar 24, 2021 •

edited

Loading

Uh oh!

stas00 Mar 24, 2021

Uh oh!

tjruwase commented Mar 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Save ZeRO3 (partitioned) fp16 weights #882

Save ZeRO3 (partitioned) fp16 weights #882

Uh oh!

Conversation

tjruwase commented Mar 19, 2021

Uh oh!

stas00 commented Mar 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 Mar 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stas00 Mar 24, 2021

Choose a reason for hiding this comment

Uh oh!

tjruwase commented Mar 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stas00 commented Mar 19, 2021 •

edited

Loading

stas00 Mar 24, 2021 •

edited

Loading