Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported. #3810

hahchenchen · 2023-06-26T09:18:22Z

hi,
I have trained a LLM model with 4 nodes (8 gpus per node), but when I load the checkpoint with 16 nodes, I get the follows error:

deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 32 but the current world size is 128. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

The text was updated successfully, but these errors were encountered:

hahchenchen · 2023-06-28T07:54:05Z

@stas00 Can you help to sovle this problem? Thanks

stas00 · 2023-06-28T20:08:56Z

You're giving too little information to go about. You trained it how? I assume with ZeRO-3

And you're now trying to load the model using the deepspeed checkpoint? Unfortunately changing the topology after the training has started isn't yet supported by Deepspeed - please see this feature request #2921

So meanwhile the only thing you can do is to extract the fp32 weights using zero_to_fp32.py that you will find in the checkpoint folder and start a new training (or inference) using this extracted checkpoint. So if you wanted to continue using the optimizer you can't.

You can read about the extraction here:
https://huggingface.co/docs/transformers/main/en/main_classes/deepspeed#getting-the-model-weights-out
scroll down to the Offline FP32 Weights Recovery: section.

You can add a comment in #2921 and request that this will be implemented - the more users ask for it the higher are the chances it'll get implemented.

ArtificialZeng · 2024-08-20T10:35:29Z

how to solve

ArtificialZeng · 2024-08-26T06:57:43Z

raise ZeRORuntimeException("The checkpoint being loaded used a DP "
[rank5]: deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 8 but the current world size is 16. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

xylian86 · 2024-10-09T16:20:24Z

@ArtificialZeng @hahchenchen You can now resume training with different DP size (differnet Node size) via Universal Checkpointing.

You can find more examples in Megatron-DeepSpeed repo.

xylian86 · 2024-10-09T16:21:51Z

@tjruwase I believe this issue could be closed :).

hahchenchen added the enhancement New feature or request label Jun 26, 2023

pengzhangzhi mentioned this issue Mar 6, 2024

universal resume checkpoint from deepspeed Lightning-AI/pytorch-lightning#19585

Closed

tjruwase closed this as completed Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported. #3810

Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported. #3810

hahchenchen commented Jun 26, 2023 •

edited

Loading

hahchenchen commented Jun 28, 2023

stas00 commented Jun 28, 2023 •

edited

Loading

ArtificialZeng commented Aug 20, 2024

ArtificialZeng commented Aug 26, 2024

xylian86 commented Oct 9, 2024

xylian86 commented Oct 9, 2024

Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported. #3810

Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported. #3810

Comments

hahchenchen commented Jun 26, 2023 • edited Loading

hahchenchen commented Jun 28, 2023

stas00 commented Jun 28, 2023 • edited Loading

ArtificialZeng commented Aug 20, 2024

ArtificialZeng commented Aug 26, 2024

xylian86 commented Oct 9, 2024

xylian86 commented Oct 9, 2024

hahchenchen commented Jun 26, 2023 •

edited

Loading

stas00 commented Jun 28, 2023 •

edited

Loading