-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4 x 4090 can not finetune 30B model #332
Comments
By running it with torchrun you will end up with Distributed Data Parallelism (DDP). From PyTorch docs: The model is replicated on all the devices; each replica calculates gradients and simultaneously synchronizes with the others using the ring all-reduce algorithm. In this mode you need to have a model in the memory of each GPU - that's not possible with 30b on 4090 with 24GB VRAM. The max is 13b in this mode. The solution to this is to train the model with Single-Machine Model Parallel. Again, from PyTorch docs: model parallel, which, in contrast to DataParallel, splits a single model onto different GPUs, rather than replicating the entire model on each GPU (to be concrete, say a model m contains 10 layers: when using DataParallel, each GPU will have a replica of each of these 10 layers, whereas when using model parallel on two GPUs, each GPU could host 5 layers). To achieve this with the finetune.py, you need to make sure that WORLD_SIZE is set to 1 (best to: export WORLD_SIZE=1),
Let me know if this helps, this approach solved it for me ;) so I am now finetuning right now 30b on 2x 3090 24GB VRAM. (and was able to finetune 65b on 8x A6000 48GB VRAM). |
it's so kind of you! solved! |
I am also thinking of building a 4x 4090 for further research, would you be so kind to share your build specs - how did you fit it in one case? |
I
I use s8030 motherboard and change each 4090 to a water-helm |
4090 has a issue for nccl , did you solve it yet? |
If your 4090 issue is a hang at startup, try setting |
The driver published in 3.30, which version is 520.105.xx, has fixed it.
Brian Luczkiewicz ***@***.***>于2023年4月23日 周日19:55写道:
If your 4090 issue is a hang at startup, try setting export
NCCL_P2P_DISABLE=1,
—
Reply to this email directly, view it on GitHub
<#332 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALZ3GHQTNHTHKCPFXE3KDO3XCUKCNANCNFSM6AAAAAAW42JBTY>
.
You are receiving this because you modified the open/close state.Message
ID: ***@***.***>
--
what is it
|
yes, ddp "OR" mp, we can only have one |
@zzlgreat i wonder whether you trained it by 4 bit and how to set it? thx |
@Bazovsky Hi, i'm a beginner in LLM field and I wonder why we set world_size = 1 here. Why the WORLD_SIZE is not 4 in this issue's circumstance. I would appreciate it if you could answer my fool question. thx. |
I have seen #8 (comment). And i use the newest nvidia driver 525.105.17 so there should not be the env problem. I run it by the command
"WORLD_SIZE=1 CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 --master_port=1234 finetune.py --base_model /datas/alpaca_lora_4bit/text-generation-webui/models/llama-30b-hf --data_path /datas/GPT-4-LLM/data --output_dir ./lora-30B",
and I found the VRAM in different gpus increase in same speed, and VRAM is off when loading checkpoint by about 70%, so of course it ends by a CUDA out of memory error. Can anyone meets the same problem?
The text was updated successfully, but these errors were encountered: