-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow training converge of StableDiffusionControlNetPipeline than original repo #2822
Comments
hi @jingyangcarl: do you have a wandb report you can share? YiYi |
Hmm this is going to be difficult to compare, can you give some more details on why/how convergence is slower @jingyangcarl ? Do you have loss curves, reproducible code snippets etc... maybe? |
If this is helpful, here is a run with batch size 14 (the largest batch size I could fit in single A100 40GB) https://wandb.ai/anotherjesse/circles/runs/qz4ler5i?workspace=user-anotherjesse I used the tutorial
It didn't seem to converge at all I'm re-running now with batch size 5 as another datapoint. let me know if there are any other tests I can run to help debug the code / tutorial |
https://wandb.ai/anotherjesse/circles2/runs/n7xhe31j/overview?workspace=user-anotherjesse should be finished in a few hours - batch size 5 |
@anotherjesse cc @williamberman here - do you think it's the learning rate scheduler that causes the difference? happy to run some experiments to find out. YiYi |
I'm using the default configuration for training the diffuser and the default configuration for training the original ControlNet tutorial on fill50k with the same batch size setting to 5. The original control net starts to show some results after 400 steps. However, the diffuser is more than 3k steps. I'm testing the two pipelines on the same devices. Hope this could help. Jing |
Anything I can do to help figure this out? Test with older commits? Test with head? Test original code similar to @jingyangcarl |
@anotherjesse your training run at least by eyeballing the outputs didn't get as good results as quickly as some of mine got. Here is one of my training runs that was successful https://wandb.ai/williamberman/controlnet-model-3-11-mixed-precision/runs/b2mfgr68?workspace=user-williamberman might be worth trying a higher learning rate. I agree that this would be pretty hard to look into and diagnose. I think we might have to sit tight and see if we get more reports of similar poor performance. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Describe the bug
Awsome works to train ControlNet with Diffuser from the tutorial
I run the code and compare it with the original training code here.
It turns out to be the convergence is slower than using the diffuser on Fill5k than the original training code.
Reproduction
training on Fill5k with batch size 5
Logs
No response
System Info
diffusers
version: 0.15.0.dev0The text was updated successfully, but these errors were encountered: