Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow training converge of StableDiffusionControlNetPipeline than original repo #2822

Closed
jingyangcarl opened this issue Mar 25, 2023 · 9 comments
Labels
bug Something isn't working stale Issues that haven't received updates

Comments

@jingyangcarl
Copy link

Describe the bug

Awsome works to train ControlNet with Diffuser from the tutorial

I run the code and compare it with the original training code here.

It turns out to be the convergence is slower than using the diffuser on Fill5k than the original training code.

Reproduction

training on Fill5k with batch size 5

Logs

No response

System Info

  • diffusers version: 0.15.0.dev0
  • Platform: Linux-5.4.0-139-generic-x86_64-with-glibc2.17
  • Python version: 3.8.16
  • PyTorch version (GPU?): 1.13.1+cu117 (True)
  • Huggingface_hub version: 0.13.3
  • Transformers version: 4.27.3
  • Accelerate version: 0.18.0.dev0
  • xFormers version: 0.0.16
@jingyangcarl jingyangcarl added the bug Something isn't working label Mar 25, 2023
@yiyixuxu
Copy link
Collaborator

hi @jingyangcarl:

do you have a wandb report you can share?

YiYi

@patrickvonplaten
Copy link
Contributor

Hmm this is going to be difficult to compare, can you give some more details on why/how convergence is slower @jingyangcarl ? Do you have loss curves, reproducible code snippets etc... maybe?

cc @williamberman

@anotherjesse
Copy link

If this is helpful, here is a run with batch size 14 (the largest batch size I could fit in single A100 40GB)

https://wandb.ai/anotherjesse/circles/runs/qz4ler5i?workspace=user-anotherjesse

I used the tutorial

accelerate launch train_controlnet.py  --pretrained_model_name_or_path=$MODEL_DIR  \
  --output_dir=$OUTPUT_DIR  \
  --dataset_name=fusing/fill50k  \
  --resolution=512  --learning_rate=1e-5  \
  --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png"  \
  --validation_prompt "red circle with blue background" "cyan circle with brown floral background"  \
  --train_batch_size=14 --report_to wandb --tracker_project_name circles2                        

It didn't seem to converge at all

I'm re-running now with batch size 5 as another datapoint.

let me know if there are any other tests I can run to help debug the code / tutorial

@anotherjesse
Copy link

https://wandb.ai/anotherjesse/circles2/runs/n7xhe31j/overview?workspace=user-anotherjesse should be finished in a few hours - batch size 5

@yiyixuxu
Copy link
Collaborator

@anotherjesse
Interesting! thanks so much for the report. super helpful!

cc @williamberman here - do you think it's the learning rate scheduler that causes the difference? happy to run some experiments to find out.

YiYi

@jingyangcarl
Copy link
Author

@yiyixuxu @patrickvonplaten

I'm using the default configuration for training the diffuser and the default configuration for training the original ControlNet tutorial on fill50k with the same batch size setting to 5. The original control net starts to show some results after 400 steps. However, the diffuser is more than 3k steps.

I'm testing the two pipelines on the same devices. Hope this could help.

Jing

@anotherjesse
Copy link

Anything I can do to help figure this out?

Test with older commits? Test with head? Test original code similar to @jingyangcarl

@williamberman
Copy link
Contributor

@anotherjesse your training run at least by eyeballing the outputs didn't get as good results as quickly as some of mine got.

Here is one of my training runs that was successful https://wandb.ai/williamberman/controlnet-model-3-11-mixed-precision/runs/b2mfgr68?workspace=user-williamberman

might be worth trying a higher learning rate.

I agree that this would be pretty hard to look into and diagnose. I think we might have to sit tight and see if we get more reports of similar poor performance.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label May 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

5 participants