[zero3] how to get the model reconstructed for saving?

Under zero2 `self.model.state_dict()` returns a fp16 version of the model, under zero3 it returns some placeholder: with all the weights being just `tensor([1.],`, so how can we get the trained model out of deepspeed? 

This is related to https://github.com/microsoft/DeepSpeed/issues/800 but there under zero2 we at least had the fp16 version save-able, now there is no way at all. This is a total lock-in, unless I'm missing on some API that was added with zero3.

Ideally it'd be awesome if it were to reconstruct it directly on the disk, since that will ensure that there will be enough memory to do so.

To summarize different requests so far - users have 3 different needs:

1. being able to leave DeepSpeed after training with it. So they should be able to take a DeepSpeed checkpoint and recover the full consolidated fp32 model in a single file
2. same as (1) but fp16. Surely once can go from (1) to (2) easily, but I think it'd be much faster to get to (2) directly - I could be wrong. If I'm wrong than (1) is enough.
3. being able to call `deepspeed.consolidate_weights()` in the rank0 process which would give users full non-partitioned weights back (perhaps with a bool arg of whether they want the fp16 or fp32 version). So now they can just save the model as they do with any other pytorch tools. This would only be practical for small-ish models. The key here is that while this would be somewhat costly they will be able to use their code almost w/o any change if they train in various ways and not just with deepspeed. I think this must happen on cpu, since it's unlikely gpus will have the memory for that. It probably will have to return a copy of the model with the consolidated weights, so that the user can continue training the original model. So probably something along the lines of https://github.com/microsoft/DeepSpeed/pull/882 but in addition the partitioning will need to be removed too. The result of this call will provide users an equivalent of what they have under zero2 at this moment (if it's fp16).

I think all 3 would be more or less the same code, with just different ways of using it - (3) using the existing deepspeed engine and not needing to access the filesystem, (1) and (2) w/o needing an engine and relying exclusively on the filesystem.

Thank you!

@jeffra, @tjruwase 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[zero3] how to get the model reconstructed for saving? #872

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[zero3] how to get the model reconstructed for saving? #872

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions