Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepspeed zero3 is not supported yet #48

Closed
valencebond opened this issue Dec 15, 2023 · 10 comments
Closed

deepspeed zero3 is not supported yet #48

valencebond opened this issue Dec 15, 2023 · 10 comments

Comments

@valencebond
Copy link

I try to replace the zero2.json with zero3.json in the pretraining stage. But the model hangs and cannot train like zero2. Is it normal?

@LinB203
Copy link
Member

LinB203 commented Dec 15, 2023

Could you provide a screenshot to help us with this?

@valencebond
Copy link
Author

valencebond commented Dec 15, 2023

hi @LinB203 thanks for your kind help. I just change zero2.json to zero3.json in scripts/v1_5/pretrain.sh . The model seems stuck at the model forward process, all gpus are 100% utilized. When I use zero2.json, everything is ok.

There is one log that "Parameter Offload: Total persistent parameters: 1267712 in 725 params "

image

@valencebond
Copy link
Author

when I only using one gpu, zero3 setting works fine. Is there any suggestion ?

@LinB203
Copy link
Member

LinB203 commented Dec 16, 2023

It may be due to the NCCL communication error.
hiyouga/LLaMA-Factory#1135
hiyouga/LLaMA-Factory#1350

I have encountered similar problems on other nodes, and recently I have been working on code refactoring to try to solve it.

@valencebond
Copy link
Author

It may be due to the NCCL communication error. hiyouga/LLaMA-Factory#1135 hiyouga/LLaMA-Factory#1350

I have encountered similar problems on other nodes, and recently I have been working on code refactoring to try to solve it.

Thank you for your help, I tried the suggested methods but the training still hangs.

@LinB203
Copy link
Member

LinB203 commented Dec 16, 2023

I will try to solve it.

@LinB203
Copy link
Member

LinB203 commented Dec 16, 2023

I will organize the code, support LoRA, support Zero3, and release more powerful models.

@valencebond
Copy link
Author

I will organize the code, support LoRA, support Zero3, and release more powerful models.

can you briefly give some guesses why zero3 is not working? Because the llava1.5 supports the zero3. Looking forward to an early resolution.

@LinB203
Copy link
Member

LinB203 commented Dec 17, 2023

From what I can surmise, this is a communication anomaly due to multiple gpu load imbalance. For example, when the batch size is 16, if the batch on gpu 0 is all IMAGE and there are IMAGE and VIDEO on gpu 1, then the load on the gpu is severely unbalanced at that point. I observed this phenomenon in the following issues.

microsoft/DeepSpeed#2223
Lightning-AI/pytorch-lightning#13498

One of the solutions is to increase the batch size to 32, in which case the probability of gpu imbalance is negligible.
However not all GPUs can be loaded with such a large batch size. We are training a new version which is to compress the video tokens so that they can be trained on the A100-40G.

@LinB203
Copy link
Member

LinB203 commented Jan 16, 2024

We reorganize the code and support LoRA fine-tuning, checking finetune_lora.sh. Moreover, we provide zero2_offload.json, which can be used to train on A100-40G. But unfortunately we still can't use zero3, and we suspect that deepspeed doesn't support the load imbalance between GPUs very well. However this is not a problem with Video-LLaVA, so we close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants