4 x 4090 can not finetune 30B model #332

zzlgreat · 2023-04-13T09:38:47Z

I have seen #8 (comment). And i use the newest nvidia driver 525.105.17 so there should not be the env problem. I run it by the command
"WORLD_SIZE=1 CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 --master_port=1234 finetune.py --base_model /datas/alpaca_lora_4bit/text-generation-webui/models/llama-30b-hf --data_path /datas/GPT-4-LLM/data --output_dir ./lora-30B",
and I found the VRAM in different gpus increase in same speed, and VRAM is off when loading checkpoint by about 70%, so of course it ends by a CUDA out of memory error. Can anyone meets the same problem?

Bazovsky · 2023-04-14T18:42:30Z

By running it with torchrun you will end up with Distributed Data Parallelism (DDP). From PyTorch docs: The model is replicated on all the devices; each replica calculates gradients and simultaneously synchronizes with the others using the ring all-reduce algorithm.

In this mode you need to have a model in the memory of each GPU - that's not possible with 30b on 4090 with 24GB VRAM. The max is 13b in this mode.

The solution to this is to train the model with Single-Machine Model Parallel. Again, from PyTorch docs: model parallel, which, in contrast to DataParallel, splits a single model onto different GPUs, rather than replicating the entire model on each GPU (to be concrete, say a model m contains 10 layers: when using DataParallel, each GPU will have a replica of each of these 10 layers, whereas when using model parallel on two GPUs, each GPU could host 5 layers).

To achieve this with the finetune.py, you need to make sure that WORLD_SIZE is set to 1 (best to: export WORLD_SIZE=1),
and then just run the python script like this:

WORLD_SIZE=1 CUDA_VISIBLE_DEVICES=0,1,2,3 python finetune.py --base_model /datas/alpaca_lora_4bit/text-generation-webui/models/llama-30b-hf --data_path /datas/GPT-4-LLM/data --output_dir ./lora-30B

Let me know if this helps, this approach solved it for me ;) so I am now finetuning right now 30b on 2x 3090 24GB VRAM. (and was able to finetune 65b on 8x A6000 48GB VRAM).

zzlgreat · 2023-04-15T06:16:39Z

it's so kind of you! solved!

Bazovsky · 2023-04-15T12:20:18Z

I am also thinking of building a 4x 4090 for further research, would you be so kind to share your build specs - how did you fit it in one case?

zzlgreat · 2023-04-16T15:04:47Z

I

I am also thinking of building a 4x 4090 for further research, would you be so kind to share your build specs - how did you fit it in one case?

I use s8030 motherboard and change each 4090 to a water-helm

wac81 · 2023-04-23T08:03:35Z

4090 has a issue for nccl , did you solve it yet?

blucz · 2023-04-23T11:55:08Z

If your 4090 issue is a hang at startup, try setting export NCCL_P2P_DISABLE=1,

zzlgreat · 2023-04-23T12:00:24Z

The driver published in 3.30, which version is 520.105.xx, has fixed it. Brian Luczkiewicz ***@***.***>于2023年4月23日周日19:55写道：

If your 4090 issue is a hang at startup, try setting export NCCL_P2P_DISABLE=1, — Reply to this email directly, view it on GitHub <#332 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALZ3GHQTNHTHKCPFXE3KDO3XCUKCNANCNFSM6AAAAAAW42JBTY> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

-- what is it

kongbohu · 2023-04-25T16:18:03Z

By running it with torchrun you will end up with Distributed Data Parallelism (DDP). From PyTorch docs: The model is replicated on all the devices; each replica calculates gradients and simultaneously synchronizes with the others using the ring all-reduce algorithm.

In this mode you need to have a model in the memory of each GPU - that's not possible with 30b on 4090 with 24GB VRAM. The max is 13b in this mode.

The solution to this is to train the model with Single-Machine Model Parallel. Again, from PyTorch docs: model parallel, which, in contrast to DataParallel, splits a single model onto different GPUs, rather than replicating the entire model on each GPU (to be concrete, say a model m contains 10 layers: when using DataParallel, each GPU will have a replica of each of these 10 layers, whereas when using model parallel on two GPUs, each GPU could host 5 layers).

To achieve this with the finetune.py, you need to make sure that WORLD_SIZE is set to 1 (best to: export WORLD_SIZE=1), and then just run the python script like this:

WORLD_SIZE=1 CUDA_VISIBLE_DEVICES=0,1,2,3 python finetune.py --base_model /datas/alpaca_lora_4bit/text-generation-webui/models/llama-30b-hf --data_path /datas/GPT-4-LLM/data --output_dir ./lora-30B

Let me know if this helps, this approach solved it for me ;) so I am now finetuning right now 30b on 2x 3090 24GB VRAM. (and was able to finetune 65b on 8x A6000 48GB VRAM).

yes, ddp "OR" mp, we can only have one
btw, for MP, I think the key is here (I have one 3090-24G and one 3060-12G to split the 13b-fp16 model without using cpu memory):
model = LlamaForCausalLM.from_pretrained(
base_model,
torch_dtype=torch.float16,
device_map="sequential",
max_memory={0: "18GiB", 1: "8GiB"}
)

AegeanYan · 2023-07-22T07:35:40Z

@zzlgreat i wonder whether you trained it by 4 bit and how to set it? thx

AegeanYan · 2023-07-24T07:33:08Z

@Bazovsky Hi, i'm a beginner in LLM field and I wonder why we set world_size = 1 here. Why the WORLD_SIZE is not 4 in this issue's circumstance. I would appreciate it if you could answer my fool question. thx.

zzlgreat closed this as completed Apr 15, 2023

wangjvjie mentioned this issue Jun 9, 2023

关于预训练时多卡模型并行的问题 ymcui/Chinese-LLaMA-Alpaca#551

Closed

AegeanYan mentioned this issue Jul 25, 2023

confused about the WORLD_SIZE setting on multi-GPU training. #554

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4 x 4090 can not finetune 30B model #332

4 x 4090 can not finetune 30B model #332

zzlgreat commented Apr 13, 2023

Bazovsky commented Apr 14, 2023 •

edited

Loading

zzlgreat commented Apr 15, 2023

Bazovsky commented Apr 15, 2023

zzlgreat commented Apr 16, 2023

wac81 commented Apr 23, 2023

blucz commented Apr 23, 2023

zzlgreat commented Apr 23, 2023 via email

kongbohu commented Apr 25, 2023

AegeanYan commented Jul 22, 2023

AegeanYan commented Jul 24, 2023

4 x 4090 can not finetune 30B model #332

4 x 4090 can not finetune 30B model #332

Comments

zzlgreat commented Apr 13, 2023

Bazovsky commented Apr 14, 2023 • edited Loading

zzlgreat commented Apr 15, 2023

Bazovsky commented Apr 15, 2023

zzlgreat commented Apr 16, 2023

wac81 commented Apr 23, 2023

blucz commented Apr 23, 2023

zzlgreat commented Apr 23, 2023 via email

kongbohu commented Apr 25, 2023

AegeanYan commented Jul 22, 2023

AegeanYan commented Jul 24, 2023

Bazovsky commented Apr 14, 2023 •

edited

Loading