Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accelerate muti gpu with gradient_checkpointing throws an error #972

Closed
benihime91 opened this issue Sep 14, 2024 · 6 comments
Closed

accelerate muti gpu with gradient_checkpointing throws an error #972

benihime91 opened this issue Sep 14, 2024 · 6 comments

Comments

@benihime91
Copy link
Contributor

benihime91 commented Sep 14, 2024

In the trainer class currently in resume_and_prepare currently init_prepare_models(lr_scheduler=lr_scheduler) is called before init_post_load_freeze() this causes an issue with DDP because init_prepare_models is calling accelerator.prepare which is wrapping the unet/transformer as such init_post_load_freeze is trying to call enable_gradient_checkpointing on a wrapped class which throws an error.

A very quick fix is to call init_post_load_freeze and then call init_prepare_models which is what i am currently doing to run multi gpu trainings with accelerate

@benihime91
Copy link
Contributor Author

i can create a quick PR for the above but i haven't tested extensively. The above proposed fix works for FLUX dev LoRA training on 8 GPUS

@benihime91 benihime91 changed the title DDP with gradient_checkpointing throws an error accelerate muti gpu with gradient_checkpointing throws an error Sep 14, 2024
@bghira
Copy link
Owner

bghira commented Sep 14, 2024

are you talking about #686

@benihime91
Copy link
Contributor Author

umm no, it straight away throws a no attribute error, likely because the model class is getting wrapped under DistributedDataParallel

@bghira
Copy link
Owner

bghira commented Sep 14, 2024

i think you can just put unwrap_model around it then

@benihime91
Copy link
Contributor Author

yeah that works as well, it's a minor issue not much pain, just thought will put it our here incase anyones is facing the same issue

@bghira
Copy link
Owner

bghira commented Sep 14, 2024

currently most multigpu training is done using quantisation with PEFT being out of the options as a result of bug 686

so I guess it might be that you're the first one using PEFT with 8 GPUs, I've been using LyCORIS with 10x 3090.

@bghira bghira closed this as completed in 4a633a3 Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants