-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported. #3810
Comments
@stas00 Can you help to sovle this problem? Thanks |
You're giving too little information to go about. You trained it how? I assume with ZeRO-3 And you're now trying to load the model using the deepspeed checkpoint? Unfortunately changing the topology after the training has started isn't yet supported by Deepspeed - please see this feature request #2921 So meanwhile the only thing you can do is to extract the fp32 weights using You can read about the extraction here: You can add a comment in #2921 and request that this will be implemented - the more users ask for it the higher are the chances it'll get implemented. |
how to solve |
raise ZeRORuntimeException("The checkpoint being loaded used a DP " |
@ArtificialZeng @hahchenchen You can now resume training with different DP size (differnet Node size) via Universal Checkpointing. You can find more examples in Megatron-DeepSpeed repo. |
@tjruwase I believe this issue could be closed :). |
hi,
I have trained a LLM model with 4 nodes (8 gpus per node), but when I load the checkpoint with 16 nodes, I get the follows error:
deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 32 but the current world size is 128. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.
The text was updated successfully, but these errors were encountered: