deepspeed zero3 is not supported yet #48

valencebond · 2023-12-15T06:45:02Z

I try to replace the zero2.json with zero3.json in the pretraining stage. But the model hangs and cannot train like zero2. Is it normal?

LinB203 · 2023-12-15T07:18:48Z

Could you provide a screenshot to help us with this?

valencebond · 2023-12-15T13:20:34Z

hi @LinB203 thanks for your kind help. I just change zero2.json to zero3.json in scripts/v1_5/pretrain.sh . The model seems stuck at the model forward process, all gpus are 100% utilized. When I use zero2.json, everything is ok.

There is one log that "Parameter Offload: Total persistent parameters: 1267712 in 725 params "

valencebond · 2023-12-15T13:42:44Z

when I only using one gpu, zero3 setting works fine. Is there any suggestion ?

LinB203 · 2023-12-16T00:54:11Z

It may be due to the NCCL communication error.
hiyouga/LLaMA-Factory#1135
hiyouga/LLaMA-Factory#1350

I have encountered similar problems on other nodes, and recently I have been working on code refactoring to try to solve it.

valencebond · 2023-12-16T02:40:52Z

It may be due to the NCCL communication error. hiyouga/LLaMA-Factory#1135 hiyouga/LLaMA-Factory#1350

I have encountered similar problems on other nodes, and recently I have been working on code refactoring to try to solve it.

Thank you for your help, I tried the suggested methods but the training still hangs.

LinB203 · 2023-12-16T02:57:17Z

I will try to solve it.

LinB203 · 2023-12-16T14:31:46Z

I will organize the code, support LoRA, support Zero3, and release more powerful models.

valencebond · 2023-12-17T00:22:42Z

I will organize the code, support LoRA, support Zero3, and release more powerful models.

can you briefly give some guesses why zero3 is not working? Because the llava1.5 supports the zero3. Looking forward to an early resolution.

LinB203 · 2023-12-17T01:03:43Z

From what I can surmise, this is a communication anomaly due to multiple gpu load imbalance. For example, when the batch size is 16, if the batch on gpu 0 is all IMAGE and there are IMAGE and VIDEO on gpu 1, then the load on the gpu is severely unbalanced at that point. I observed this phenomenon in the following issues.

microsoft/DeepSpeed#2223
Lightning-AI/pytorch-lightning#13498

One of the solutions is to increase the batch size to 32, in which case the probability of gpu imbalance is negligible.
However not all GPUs can be loaded with such a large batch size. We are training a new version which is to compress the video tokens so that they can be trained on the A100-40G.

LinB203 · 2024-01-16T07:31:19Z

We reorganize the code and support LoRA fine-tuning, checking finetune_lora.sh. Moreover, we provide zero2_offload.json, which can be used to train on A100-40G. But unfortunately we still can't use zero3, and we suspect that deepspeed doesn't support the load imbalance between GPUs very well. However this is not a problem with Video-LLaVA, so we close it.

LinB203 closed this as completed Jan 16, 2024

This was referenced Feb 3, 2024

当我使用moe-llava的架构集成了mixtral 7BX8的时候，奇怪的事情发生了 PKU-YuanGroup/MoE-LLaVA#20

Closed

languageBindVideo model may be hang ? PKU-YuanGroup/MoE-LLaVA#21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepspeed zero3 is not supported yet #48

deepspeed zero3 is not supported yet #48

valencebond commented Dec 15, 2023

LinB203 commented Dec 15, 2023

valencebond commented Dec 15, 2023 •

edited

Loading

valencebond commented Dec 15, 2023

LinB203 commented Dec 16, 2023

valencebond commented Dec 16, 2023

LinB203 commented Dec 16, 2023

LinB203 commented Dec 16, 2023

valencebond commented Dec 17, 2023

LinB203 commented Dec 17, 2023 •

edited

Loading

LinB203 commented Jan 16, 2024

deepspeed zero3 is not supported yet #48

deepspeed zero3 is not supported yet #48

Comments

valencebond commented Dec 15, 2023

LinB203 commented Dec 15, 2023

valencebond commented Dec 15, 2023 • edited Loading

valencebond commented Dec 15, 2023

LinB203 commented Dec 16, 2023

valencebond commented Dec 16, 2023

LinB203 commented Dec 16, 2023

LinB203 commented Dec 16, 2023

valencebond commented Dec 17, 2023

LinB203 commented Dec 17, 2023 • edited Loading

LinB203 commented Jan 16, 2024

valencebond commented Dec 15, 2023 •

edited

Loading

LinB203 commented Dec 17, 2023 •

edited

Loading