-
Notifications
You must be signed in to change notification settings - Fork 892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better implementation for te autocast #895
Conversation
Thank you for this! It seems to be very good. I will review and merge sooner. Perhaps, when we cache Text Encoder outputs, another way might be fine to change the dtype of Text Encoders to fp16/bf16 in advance... |
Yeah I also add "te.to(weight_dtype)" too |
@kohya-ss Oh I got what you said. |
I think so. In my understanding, it is same to apply autocast.
This is really nice! |
autocast and changing dtype is actually different. |
hmm thank you for clarification. When generating images, we call the model converted to float16 or bfloat16 directly, so I thought there would be no difference, but it is better to use autocast. |
When I apply this PR and train with sdxl_train.py including Text Encoder, it seems that neither U-Net nor Text Encoder is trained. When training only U-Net, there is no problem. I will investigate further. But I think that perhaps multiple models may not work when specified like: |
Ok this makes sense |
@kohya-ss it is weird. but it cannot run fp16 (full fp16 not work, full bf16 work)
|
Hi, I think you may need to make the fix to the Text Encoder learning rate code in the --block_lr section that I found here: |
But I still made a fix with it |
@kohya-ss I also add a manual timeout settings for DDP since for some large dataset with multi gpu training. It is very likely to runout the default timeout(30min) when caching the latents or TextEncoder output. |
when we can expect this to be merged thank you so much |
@KohakuBlueleaf I would like to add some features after the merge, such as specifying independent learning rates for Text Encoder 1 and 2, excluding from the optimizing parameters for models with 0 specified for the learning rate, etc. |
so this will support both SD 1.5 and SDXL? |
Fix setup logic... again
When we disable the training for TE, we will not prepare it and in this case we will need explicit convert it to the target dtype (or it will remain in fp32 which may not the expected behavior)
So basically I do 2 things here:
1, explicitly convert the TE to target dtype/device when it is not be trained
2, explicitly add autocast for TE since sometime it may not be prepare. (or definitely will not be prepared eg: cached TE)