-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deepspeed zero3 is not supported yet #48
Comments
Could you provide a screenshot to help us with this? |
hi @LinB203 thanks for your kind help. I just change zero2.json to zero3.json in scripts/v1_5/pretrain.sh . The model seems stuck at the model forward process, all gpus are 100% utilized. When I use zero2.json, everything is ok. There is one log that "Parameter Offload: Total persistent parameters: 1267712 in 725 params " |
when I only using one gpu, zero3 setting works fine. Is there any suggestion ? |
It may be due to the NCCL communication error. I have encountered similar problems on other nodes, and recently I have been working on code refactoring to try to solve it. |
Thank you for your help, I tried the suggested methods but the training still hangs. |
I will try to solve it. |
I will organize the code, support LoRA, support Zero3, and release more powerful models. |
can you briefly give some guesses why zero3 is not working? Because the llava1.5 supports the zero3. Looking forward to an early resolution. |
From what I can surmise, this is a communication anomaly due to multiple gpu load imbalance. For example, when the batch size is 16, if the batch on gpu 0 is all IMAGE and there are IMAGE and VIDEO on gpu 1, then the load on the gpu is severely unbalanced at that point. I observed this phenomenon in the following issues. microsoft/DeepSpeed#2223 One of the solutions is to increase the batch size to 32, in which case the probability of gpu imbalance is negligible. |
We reorganize the code and support LoRA fine-tuning, checking finetune_lora.sh. Moreover, we provide zero2_offload.json, which can be used to train on A100-40G. But unfortunately we still can't use zero3, and we suspect that deepspeed doesn't support the load imbalance between GPUs very well. However this is not a problem with Video-LLaVA, so we close it. |
I try to replace the zero2.json with zero3.json in the pretraining stage. But the model hangs and cannot train like zero2. Is it normal?
The text was updated successfully, but these errors were encountered: