[zero] restore fp16 params if no zero ckpts available#1322
Conversation
|
I confirm that it solved the problem! Thank you, Jeff! Good to go to merge this into the big-science branch! |
|
BTW, the checkpoint still requires Do we really need those? Or is this another legacy check? |
I am pretty sure we still need these, since they’re associated with tensor parallelism model weights. @ShadenSmith to confirm wrt PP checkpoints though? |
|
I think it needs those at least for getting the saved |
|
So now this needs to be replayed to the |
* restore fp16 params if no zero ckpts available * formatting
pushed this commit to big-science now :) |
In fine-tuning scenarios the user often wants to only load the model parameter checkpoints and may not have the zero optimizer states on disk since they are not needed. This fixes a bug where the weights are not properly restored from the ckpt if the zero optimizer states are not present on disk.