Slow training converge of StableDiffusionControlNetPipeline than original repo #2822

jingyangcarl · 2023-03-25T01:35:41Z

Describe the bug

Awsome works to train ControlNet with Diffuser from the tutorial

I run the code and compare it with the original training code here.

It turns out to be the convergence is slower than using the diffuser on Fill5k than the original training code.

Reproduction

training on Fill5k with batch size 5

Logs

No response

System Info

diffusers version: 0.15.0.dev0
Platform: Linux-5.4.0-139-generic-x86_64-with-glibc2.17
Python version: 3.8.16
PyTorch version (GPU?): 1.13.1+cu117 (True)
Huggingface_hub version: 0.13.3
Transformers version: 4.27.3
Accelerate version: 0.18.0.dev0
xFormers version: 0.0.16

The text was updated successfully, but these errors were encountered:

yiyixuxu · 2023-03-25T05:46:39Z

hi @jingyangcarl:

do you have a wandb report you can share?

YiYi

patrickvonplaten · 2023-03-28T12:53:29Z

Hmm this is going to be difficult to compare, can you give some more details on why/how convergence is slower @jingyangcarl ? Do you have loss curves, reproducible code snippets etc... maybe?

cc @williamberman

anotherjesse · 2023-03-29T00:26:49Z

If this is helpful, here is a run with batch size 14 (the largest batch size I could fit in single A100 40GB)

https://wandb.ai/anotherjesse/circles/runs/qz4ler5i?workspace=user-anotherjesse

I used the tutorial

accelerate launch train_controlnet.py  --pretrained_model_name_or_path=$MODEL_DIR  \
  --output_dir=$OUTPUT_DIR  \
  --dataset_name=fusing/fill50k  \
  --resolution=512  --learning_rate=1e-5  \
  --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png"  \
  --validation_prompt "red circle with blue background" "cyan circle with brown floral background"  \
  --train_batch_size=14 --report_to wandb --tracker_project_name circles2

It didn't seem to converge at all

I'm re-running now with batch size 5 as another datapoint.

let me know if there are any other tests I can run to help debug the code / tutorial

anotherjesse · 2023-03-29T01:26:20Z

https://wandb.ai/anotherjesse/circles2/runs/n7xhe31j/overview?workspace=user-anotherjesse should be finished in a few hours - batch size 5

yiyixuxu · 2023-03-29T03:11:59Z

@anotherjesse
Interesting! thanks so much for the report. super helpful!

cc @williamberman here - do you think it's the learning rate scheduler that causes the difference? happy to run some experiments to find out.

YiYi

jingyangcarl · 2023-03-29T21:08:55Z

@yiyixuxu @patrickvonplaten

I'm using the default configuration for training the diffuser and the default configuration for training the original ControlNet tutorial on fill50k with the same batch size setting to 5. The original control net starts to show some results after 400 steps. However, the diffuser is more than 3k steps.

I'm testing the two pipelines on the same devices. Hope this could help.

Jing

anotherjesse · 2023-04-02T22:53:49Z

Anything I can do to help figure this out?

Test with older commits? Test with head? Test original code similar to @jingyangcarl

williamberman · 2023-04-27T01:57:28Z

@anotherjesse your training run at least by eyeballing the outputs didn't get as good results as quickly as some of mine got.

Here is one of my training runs that was successful https://wandb.ai/williamberman/controlnet-model-3-11-mixed-precision/runs/b2mfgr68?workspace=user-williamberman

might be worth trying a higher learning rate.

I agree that this would be pretty hard to look into and diagnose. I think we might have to sit tight and see if we get more reports of similar poor performance.

github-actions · 2023-05-21T15:02:59Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

jingyangcarl added the bug Something isn't working label Mar 25, 2023

jingyangcarl mentioned this issue Mar 25, 2023

[WIP]Flax training script for controlnet #2818

Merged

github-actions bot added the stale Issues that haven't received updates label May 21, 2023

github-actions bot closed this as completed May 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow training converge of StableDiffusionControlNetPipeline than original repo #2822

Slow training converge of StableDiffusionControlNetPipeline than original repo #2822

jingyangcarl commented Mar 25, 2023

yiyixuxu commented Mar 25, 2023

patrickvonplaten commented Mar 28, 2023

anotherjesse commented Mar 29, 2023

anotherjesse commented Mar 29, 2023

yiyixuxu commented Mar 29, 2023

jingyangcarl commented Mar 29, 2023

anotherjesse commented Apr 2, 2023

williamberman commented Apr 27, 2023

github-actions bot commented May 21, 2023

Slow training converge of StableDiffusionControlNetPipeline than original repo #2822

Slow training converge of StableDiffusionControlNetPipeline than original repo #2822

Comments

jingyangcarl commented Mar 25, 2023

Describe the bug

Reproduction

Logs

System Info

yiyixuxu commented Mar 25, 2023

patrickvonplaten commented Mar 28, 2023

anotherjesse commented Mar 29, 2023

anotherjesse commented Mar 29, 2023

yiyixuxu commented Mar 29, 2023

jingyangcarl commented Mar 29, 2023

anotherjesse commented Apr 2, 2023

williamberman commented Apr 27, 2023

github-actions bot commented May 21, 2023