-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
Under zero2 self.model.state_dict() returns a fp16 version of the model, under zero3 it returns some placeholder: with all the weights being just tensor([1.],, so how can we get the trained model out of deepspeed?
This is related to #800 but there under zero2 we at least had the fp16 version save-able, now there is no way at all. This is a total lock-in, unless I'm missing on some API that was added with zero3.
Ideally it'd be awesome if it were to reconstruct it directly on the disk, since that will ensure that there will be enough memory to do so.
To summarize different requests so far - users have 3 different needs:
- being able to leave DeepSpeed after training with it. So they should be able to take a DeepSpeed checkpoint and recover the full consolidated fp32 model in a single file
- same as (1) but fp16. Surely once can go from (1) to (2) easily, but I think it'd be much faster to get to (2) directly - I could be wrong. If I'm wrong than (1) is enough.
- being able to call
deepspeed.consolidate_weights()in the rank0 process which would give users full non-partitioned weights back (perhaps with a bool arg of whether they want the fp16 or fp32 version). So now they can just save the model as they do with any other pytorch tools. This would only be practical for small-ish models. The key here is that while this would be somewhat costly they will be able to use their code almost w/o any change if they train in various ways and not just with deepspeed. I think this must happen on cpu, since it's unlikely gpus will have the memory for that. It probably will have to return a copy of the model with the consolidated weights, so that the user can continue training the original model. So probably something along the lines of Save ZeRO3 (partitioned) fp16 weights #882 but in addition the partitioning will need to be removed too. The result of this call will provide users an equivalent of what they have under zero2 at this moment (if it's fp16).
I think all 3 would be more or less the same code, with just different ways of using it - (3) using the existing deepspeed engine and not needing to access the filesystem, (1) and (2) w/o needing an engine and relying exclusively on the filesystem.
Thank you!