-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Fail to Resume From Checkpoint with Different GPU Number(Huggingface Trainer + Deepspeed) #5405
Comments
@Orion-Zheng, thanks for reporting this issue. It is unclear to me that checkpointing consolidation is needed here since it seems you are checkpoint was saved with 4GPUs and resuming with 4GPUs. In other words, no change in number of GPUs between saving and resumption. Is that correct? |
Thank you for the timely reply! 😃 I am not very familiar with the components of deepspeed checkpoint. But I think the rank_0 and rank_1 means partitions on different node🤔I guess although the total number of gpu is the same(4 GPU), but the checkpoint of 2 node * 2 gpu is different from 1 node * 4 gpu. Am I correct? |
@Orion-Zheng, deepspeed checkpoint is not aware of node-level information. What matters is parallel dimensions such as data parallel, pipeline parallel, and tensor parallel. So, the checkpoints from 2 nodes * 2 GPUs should be the same as 1 node * 4 GPUs.
|
@Orion-Zheng, also can you share the log of the run that saved the checkpoint? |
@tjruwase Thank you for the information!😃
Deepspeed Config
For the log you mentioned, I am not sure which one you refer to because the Huggingface Trainer seems to only print loss during training, maybe I should set verbose level somewhere🤔I will find it later. Given that
By the way, I also hope to know if I want to resume from this ZeRO 3 checkpoint with different GPU numbers, say 3 GPUs. Does |
I find the 1 Node * 4 GPUs checkpoint structure looks like this, which is different from 2 Node * 2 GPU
|
Hello, I tried to use
|
This confirms that the 2 * 2 run was saving in local folder not a global (e.g., nfs) folder. With zero3, we have distributed checkpoints where each rank will save a pair of Can you inspect the checkpoint path in both nodes of your 2 * 2 run? |
OH, I understand. Yes you are right! I just find another directory where the shards from rank 3 and 4 were store in
|
@tjruwase Thank you! Now I understand what you mean.
|
Oh I think I know the reason. It was my fault. I start distributed training by manually run the same script on two nodes. In my script, the checkpoint path will use the real timestamp as the suffix of the ckpt path. Because two scripts didn't launch exactly at the same time, they save checkpoint to different directories. |
@Orion-Zheng, I am glad the original mystery is now resolved. For your second question, unfortunately zero3 is not yet supported. Is it possible for you to use zero2? I think zero2 should be adequate for your model size. |
Thank you! Yes, the ZeRO-2 code does run successfully. But in my case, when using a large batch size, I can only use ZeRO-3. Probably I have to wait for the ZeRO-3 support😃 |
@Orion-Zheng, with zero2 checkpoints there is a single If zero2 is failing on large batch size, a more appropriate solution is gradient checkpointing. |
Oh yes, good idea! I will try ZeRO-2 + Gradient Checkpointing later :) Thank you for your help! |
@tjruwase Hi😳sorry to bother again. I tried ZeRO-2 and got a ZeRO-2 checkpoint. But it seems the Accelerate+Deepspeed checkpoint structure is a bit different from Universal Checkpoint examples in Megatron-Deepspeed. So some errors occurred.
This is my ZeRO-2 checkpoint's structure and this is the Google Drive link.
Although I have not idea how to rectify it, I think for you guys it only takes one look to see how to make it compatible with the ZeRO-2 format in my case.😃 Any help will be very appreciated! |
yes, i also find this , but i use transformers trainer call deepspeed,and get the same file as you , have you reslove this problem? |
Since you are doing data parallel training, those should be duplicates and you need only one. Things will get interesting with model parallel training. |
@Orion-Zheng, you are hitting this error because some work is needed to port Universal Checkpointing to Accelerate. Currently, we have only ported to Megatron-DeepSpeed. However, in this case, can you confirm that you are planning to change the numbers of GPUs in your training? |
@Orion-Zheng, are you still having this issue? |
@tjruwase is this issue solved? I try to continue training the model switching from 4 nodes to 2 nodes and encounter the same problem using huggingface trainer.
|
@xylian86, can you please help with this? |
@tjruwase Yes, for sure. |
yes |
Describe the bug
Hello, I encountered a problem when trying to resume from a previous checkpoint when I used Transformers Trainer + ZeRO 3 strategy to train a TinyLlama.
My previous run is conducted on 2 nodes * 2 A100 40GB on each node. The structure of the previous checkpoint is showed below.
Now I want to resume from this checkpoint with 1 node with 4 * A100 40GB, and the error below occurred.
I guess it may related to the different checkpoint format(e.g. the RNG states and model/optim states). Is there any method to consolidate the checkpoint format?
Any help will be really appreciated! For us students using HPC clusters, sometimes it's hard for us to always get the same GPU numbers or specifications. Thanks in advance!
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
ds_report output
Please run
ds_report
to give us details about your setup.Screenshots
System info (please complete the following information):
Launcher context
Are you launching your experiment with the
deepspeed
launcher, MPI, or something else?Docker context
Are you using a specific docker image that you can share?
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: