DPO: Multi-GPU training does not start, but works on single GPU #1011

AndersGiovanni · 2023-11-20T09:09:33Z

I have a DPO training script very similar to stack_llama_2/scripts/dpo_llama2.py. The script works perfectly fine when I run on a single A100 GPU. However, when I use 2 A100, the script is just stuck in 0'th iteration of the training and does not continue.

  0%|          | 0/11264 [00:00<?, ?it/s]

My model and reference model are defined like this:

model = AutoPeftModelForCausalLM.from_pretrained(
        script_args.model_name_or_path,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float32,
        load_in_4bit=True,
        is_trainable=True,
        device_map={"": Accelerator().local_process_index},
    )

model_ref = AutoPeftModelForCausalLM.from_pretrained(
        script_args.model_name_or_path,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float32,
        load_in_4bit=True,
        device_map={"": Accelerator().local_process_index},
    )

My accelerate config is like this:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

The versions of trl, peft, transformers, and accelerate are:

transformers==4.34.0
peft==0.5.0
trl==0.7.2
accelerate==0.22.0

My issues is sort of a mix of #151, #226 and #958.

Anyone who could help me out here?

Thanks!

The text was updated successfully, but these errors were encountered:

younesbelkada · 2023-11-20T09:36:06Z

cc @kashif

kashif · 2023-11-20T09:41:40Z

thanks, @AGMoller, we have a multi-gpu setup we are testing in PR #885 and I'll test there and update you shortly

AndersGiovanni · 2023-11-20T09:58:38Z

Thanks a lot @younesbelkada and @kashif 🙌🏼

github-actions · 2023-12-23T15:04:55Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

AndersGiovanni · 2023-12-25T13:18:31Z

I guess with merging #885 this issue can be closed. Thanks @kashif @lvwerra @younesbelkada @lewtun 😊🙌🏼

lvwerra added the 🏋 DPO Related to DPO label Nov 29, 2023

suparek mentioned this issue Dec 8, 2023

deepspeed zero3 配置下 DPO 训练会出现训练进程卡死的问题，这是怎么回事呢？ hiyouga/LLaMA-Factory#1775

Closed

1 task

AndersGiovanni closed this as completed Dec 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DPO: Multi-GPU training does not start, but works on single GPU #1011

DPO: Multi-GPU training does not start, but works on single GPU #1011

AndersGiovanni commented Nov 20, 2023

younesbelkada commented Nov 20, 2023

kashif commented Nov 20, 2023

AndersGiovanni commented Nov 20, 2023

github-actions bot commented Dec 23, 2023

AndersGiovanni commented Dec 25, 2023

DPO: Multi-GPU training does not start, but works on single GPU #1011

DPO: Multi-GPU training does not start, but works on single GPU #1011

Comments

AndersGiovanni commented Nov 20, 2023

younesbelkada commented Nov 20, 2023

kashif commented Nov 20, 2023

AndersGiovanni commented Nov 20, 2023

github-actions bot commented Dec 23, 2023

AndersGiovanni commented Dec 25, 2023