Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPO: Multi-GPU training does not start, but works on single GPU #1011

Closed
AndersGiovanni opened this issue Nov 20, 2023 · 5 comments
Closed
Labels
🏋 DPO Related to DPO

Comments

@AndersGiovanni
Copy link

I have a DPO training script very similar to stack_llama_2/scripts/dpo_llama2.py. The script works perfectly fine when I run on a single A100 GPU. However, when I use 2 A100, the script is just stuck in 0'th iteration of the training and does not continue.

  0%|          | 0/11264 [00:00<?, ?it/s]

My model and reference model are defined like this:

model = AutoPeftModelForCausalLM.from_pretrained(
        script_args.model_name_or_path,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float32,
        load_in_4bit=True,
        is_trainable=True,
        device_map={"": Accelerator().local_process_index},
    )

model_ref = AutoPeftModelForCausalLM.from_pretrained(
        script_args.model_name_or_path,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float32,
        load_in_4bit=True,
        device_map={"": Accelerator().local_process_index},
    )

My accelerate config is like this:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

The versions of trl, peft, transformers, and accelerate are:

transformers==4.34.0
peft==0.5.0
trl==0.7.2
accelerate==0.22.0

My issues is sort of a mix of #151, #226 and #958.

Anyone who could help me out here?

Thanks!

@younesbelkada
Copy link
Contributor

cc @kashif

@kashif
Copy link
Collaborator

kashif commented Nov 20, 2023

thanks, @AGMoller, we have a multi-gpu setup we are testing in PR #885 and I'll test there and update you shortly

@AndersGiovanni
Copy link
Author

Thanks a lot @younesbelkada and @kashif 🙌🏼

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@AndersGiovanni
Copy link
Author

I guess with merging #885 this issue can be closed. Thanks @kashif @lvwerra @younesbelkada @lewtun 😊🙌🏼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏋 DPO Related to DPO
Projects
None yet
Development

No branches or pull requests

4 participants