Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported. #3810

Closed
hahchenchen opened this issue Jun 26, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@hahchenchen
Copy link

hahchenchen commented Jun 26, 2023

hi,
I have trained a LLM model with 4 nodes (8 gpus per node), but when I load the checkpoint with 16 nodes, I get the follows error:

deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 32 but the current world size is 128. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

@hahchenchen hahchenchen added the enhancement New feature or request label Jun 26, 2023
@hahchenchen
Copy link
Author

@stas00 Can you help to sovle this problem? Thanks

@stas00
Copy link
Collaborator

stas00 commented Jun 28, 2023

You're giving too little information to go about. You trained it how? I assume with ZeRO-3

And you're now trying to load the model using the deepspeed checkpoint? Unfortunately changing the topology after the training has started isn't yet supported by Deepspeed - please see this feature request #2921

So meanwhile the only thing you can do is to extract the fp32 weights using zero_to_fp32.py that you will find in the checkpoint folder and start a new training (or inference) using this extracted checkpoint. So if you wanted to continue using the optimizer you can't.

You can read about the extraction here:
https://huggingface.co/docs/transformers/main/en/main_classes/deepspeed#getting-the-model-weights-out
scroll down to the Offline FP32 Weights Recovery: section.

You can add a comment in #2921 and request that this will be implemented - the more users ask for it the higher are the chances it'll get implemented.

@ArtificialZeng
Copy link

how to solve

@ArtificialZeng
Copy link

raise ZeRORuntimeException("The checkpoint being loaded used a DP "
[rank5]: deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 8 but the current world size is 16. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

@xylian86
Copy link
Contributor

xylian86 commented Oct 9, 2024

@ArtificialZeng @hahchenchen You can now resume training with different DP size (differnet Node size) via Universal Checkpointing.

You can find more examples in Megatron-DeepSpeed repo.

@xylian86
Copy link
Contributor

xylian86 commented Oct 9, 2024

@tjruwase I believe this issue could be closed :).

@tjruwase tjruwase closed this as completed Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants