accelerate muti gpu with gradient_checkpointing throws an error #972

benihime91 · 2024-09-14T18:09:07Z

In the trainer class currently in resume_and_prepare currently init_prepare_models(lr_scheduler=lr_scheduler) is called before init_post_load_freeze() this causes an issue with DDP because init_prepare_models is calling accelerator.prepare which is wrapping the unet/transformer as such init_post_load_freeze is trying to call enable_gradient_checkpointing on a wrapped class which throws an error.

A very quick fix is to call init_post_load_freeze and then call init_prepare_models which is what i am currently doing to run multi gpu trainings with accelerate

The text was updated successfully, but these errors were encountered:

benihime91 · 2024-09-14T18:10:13Z

i can create a quick PR for the above but i haven't tested extensively. The above proposed fix works for FLUX dev LoRA training on 8 GPUS

bghira · 2024-09-14T18:13:41Z

are you talking about #686

benihime91 · 2024-09-14T18:15:53Z

umm no, it straight away throws a no attribute error, likely because the model class is getting wrapped under DistributedDataParallel

bghira · 2024-09-14T18:16:24Z

i think you can just put unwrap_model around it then

benihime91 · 2024-09-14T18:17:15Z

yeah that works as well, it's a minor issue not much pain, just thought will put it our here incase anyones is facing the same issue

bghira · 2024-09-14T18:25:23Z

currently most multigpu training is done using quantisation with PEFT being out of the options as a result of bug 686

so I guess it might be that you're the first one using PEFT with 8 GPUs, I've been using LyCORIS with 10x 3090.

benihime91 changed the title ~~DDP with gradient_checkpointing throws an error~~ accelerate muti gpu with gradient_checkpointing throws an error Sep 14, 2024

bghira closed this as completed in 4a633a3 Sep 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accelerate muti gpu with gradient_checkpointing throws an error #972

accelerate muti gpu with gradient_checkpointing throws an error #972

benihime91 commented Sep 14, 2024 •

edited

Loading

benihime91 commented Sep 14, 2024

bghira commented Sep 14, 2024

benihime91 commented Sep 14, 2024

bghira commented Sep 14, 2024

benihime91 commented Sep 14, 2024

bghira commented Sep 14, 2024

accelerate muti gpu with gradient_checkpointing throws an error #972

accelerate muti gpu with gradient_checkpointing throws an error #972

Comments

benihime91 commented Sep 14, 2024 • edited Loading

benihime91 commented Sep 14, 2024

bghira commented Sep 14, 2024

benihime91 commented Sep 14, 2024

bghira commented Sep 14, 2024

benihime91 commented Sep 14, 2024

bghira commented Sep 14, 2024

benihime91 commented Sep 14, 2024 •

edited

Loading