Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4 x 4090 can not finetune 30B model #332

Closed
zzlgreat opened this issue Apr 13, 2023 · 10 comments
Closed

4 x 4090 can not finetune 30B model #332

zzlgreat opened this issue Apr 13, 2023 · 10 comments

Comments

@zzlgreat
Copy link

I have seen #8 (comment). And i use the newest nvidia driver 525.105.17 so there should not be the env problem. I run it by the command
"WORLD_SIZE=1 CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 --master_port=1234 finetune.py --base_model /datas/alpaca_lora_4bit/text-generation-webui/models/llama-30b-hf --data_path /datas/GPT-4-LLM/data --output_dir ./lora-30B",
and I found the VRAM in different gpus increase in same speed, and VRAM is off when loading checkpoint by about 70%, so of course it ends by a CUDA out of memory error. Can anyone meets the same problem?

@Bazovsky
Copy link

Bazovsky commented Apr 14, 2023

By running it with torchrun you will end up with Distributed Data Parallelism (DDP). From PyTorch docs: The model is replicated on all the devices; each replica calculates gradients and simultaneously synchronizes with the others using the ring all-reduce algorithm.

In this mode you need to have a model in the memory of each GPU - that's not possible with 30b on 4090 with 24GB VRAM. The max is 13b in this mode.

The solution to this is to train the model with Single-Machine Model Parallel. Again, from PyTorch docs: model parallel, which, in contrast to DataParallel, splits a single model onto different GPUs, rather than replicating the entire model on each GPU (to be concrete, say a model m contains 10 layers: when using DataParallel, each GPU will have a replica of each of these 10 layers, whereas when using model parallel on two GPUs, each GPU could host 5 layers).

To achieve this with the finetune.py, you need to make sure that WORLD_SIZE is set to 1 (best to: export WORLD_SIZE=1),
and then just run the python script like this:

WORLD_SIZE=1 CUDA_VISIBLE_DEVICES=0,1,2,3 python finetune.py --base_model /datas/alpaca_lora_4bit/text-generation-webui/models/llama-30b-hf --data_path /datas/GPT-4-LLM/data --output_dir ./lora-30B

Let me know if this helps, this approach solved it for me ;) so I am now finetuning right now 30b on 2x 3090 24GB VRAM. (and was able to finetune 65b on 8x A6000 48GB VRAM).

@zzlgreat
Copy link
Author

it's so kind of you! solved!

@Bazovsky
Copy link

I am also thinking of building a 4x 4090 for further research, would you be so kind to share your build specs - how did you fit it in one case?

@zzlgreat
Copy link
Author

I

I am also thinking of building a 4x 4090 for further research, would you be so kind to share your build specs - how did you fit it in one case?

I use s8030 motherboard and change each 4090 to a water-helm

@wac81
Copy link

wac81 commented Apr 23, 2023

4090 has a issue for nccl , did you solve it yet?

@blucz
Copy link

blucz commented Apr 23, 2023

If your 4090 issue is a hang at startup, try setting export NCCL_P2P_DISABLE=1,

@zzlgreat
Copy link
Author

zzlgreat commented Apr 23, 2023 via email

@kongbohu
Copy link

By running it with torchrun you will end up with Distributed Data Parallelism (DDP). From PyTorch docs: The model is replicated on all the devices; each replica calculates gradients and simultaneously synchronizes with the others using the ring all-reduce algorithm.

In this mode you need to have a model in the memory of each GPU - that's not possible with 30b on 4090 with 24GB VRAM. The max is 13b in this mode.

The solution to this is to train the model with Single-Machine Model Parallel. Again, from PyTorch docs: model parallel, which, in contrast to DataParallel, splits a single model onto different GPUs, rather than replicating the entire model on each GPU (to be concrete, say a model m contains 10 layers: when using DataParallel, each GPU will have a replica of each of these 10 layers, whereas when using model parallel on two GPUs, each GPU could host 5 layers).

To achieve this with the finetune.py, you need to make sure that WORLD_SIZE is set to 1 (best to: export WORLD_SIZE=1), and then just run the python script like this:

WORLD_SIZE=1 CUDA_VISIBLE_DEVICES=0,1,2,3 python finetune.py --base_model /datas/alpaca_lora_4bit/text-generation-webui/models/llama-30b-hf --data_path /datas/GPT-4-LLM/data --output_dir ./lora-30B

Let me know if this helps, this approach solved it for me ;) so I am now finetuning right now 30b on 2x 3090 24GB VRAM. (and was able to finetune 65b on 8x A6000 48GB VRAM).

yes, ddp "OR" mp, we can only have one
btw, for MP, I think the key is here (I have one 3090-24G and one 3060-12G to split the 13b-fp16 model without using cpu memory):
model = LlamaForCausalLM.from_pretrained(
base_model,
torch_dtype=torch.float16,
device_map="sequential",
max_memory={0: "18GiB", 1: "8GiB"}
)

@AegeanYan
Copy link

@zzlgreat i wonder whether you trained it by 4 bit and how to set it? thx

@AegeanYan
Copy link

@Bazovsky Hi, i'm a beginner in LLM field and I wonder why we set world_size = 1 here. Why the WORLD_SIZE is not 4 in this issue's circumstance. I would appreciate it if you could answer my fool question. thx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants