-
Notifications
You must be signed in to change notification settings - Fork 892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--train_text_encoder in sdxl_train.py does literally NOTHING #890
Comments
The issue seems to be related to this line of code:
The comment says (according to Google Translate):
Deleting the [0] bit in the accelerator.accumulate line doesn't make it work, as the comment suggests. But, if I remove the unet model entirely from training models with the code change below (so that one of the text parsers is training model index 0) then that text parser does start training. So I can train one of the text parsers by itself through hacking the code this way, but not both the unet and the text parser at the same time. I replaced (in sdxl_train.py):
with
Presumably I could also modify the code below that bit I changed, to only add the second text parser as a training model. Then I could train that one individually too. It would be better if I could get both text parsers and the unet all training at once though. |
One other side note is that the CUDA memory requirements seems to jump up (rather than down) when I disable unet training. It does the same with block_lr=0,0,0,0...0,0,0. I don't know why that might be. |
If you use that above code change to not add the unet to the list of training models, I think you should also comment out one or the other of these two lines:
to choose which of the two text models you're training. If you leave them both in, something might be wrong with training. For example, maybe the gradient norm clipping will act on the sum of both of them, cause all the parameters from any model in the training_models list get added together there. |
This thread of mine seems to be related a lot to a previous (and still open) issue: |
Thank you for opening the issue. I updated the version of I will update the script to handle multiple-model accumulation for the latest |
thank you for experiments |
For multi model gradient accumulation you only need to change this code from: I haven't tested this yet, but according to the accelerate source, that's all you need to do. |
I gave it a try now, thanks for finding that. But it doesn't seem to train the text parsers for me. :( |
Yea, that's what I was thinking, I don't think the underlying problem is because of the gradient accumulation. |
Thank you for reporting. I tried it and it was the same. So far, I have no idea why Text Encoder is not trained when using block_lr. Edit: If I set the positive values for block_lr (not 0), Text Encoder is not trained too. I will do further testing. Please let me know if you find anything out. |
This is perhaps because gradient checkpointing of U-Net is disabled when the model is eval mode (with |
It's not a huge improvement, but I think that it's possible to train both text encoders at once. I changed:
to
and also deleted the [0] in this line:
And my resulting sample images are showing training occurring. Which makes me suspect both text encoders might be learning at the same time. |
I did some testing, I trained two models one with block_lr and one without, both model are trained for around 10k steps with batch size of 32. I only printed params that have max difference higher than 0.01 so it doesn't crowd the output. This model was trained with
And this one is trained with only
Lastly this one is only for validity sake, I compared the same model:
What's intriguing to me is that the differences in the weights between two trained models with different setting (one with block_lr, one without) are exactly the same (unless it's really, I mean REALLY small). And only the bias showed any variations. |
And this is the code I'm using to compare the parameters: def compare_params(model1: nn.Module, model2: nn.Module) -> List:
# Compare pytorch parameters
params_diff = []
for (k1, v1), (k2, v2) in zip(model1.named_parameters(), model2.named_parameters()):
if k1 != k2:
continue
params_diff.append(
{
"name": k1,
"shape": list(v1.shape),
"max_diff": (v1 - v2).abs().max().item(),
"mean_diff": (v1 - v2).abs().mean().item(),
"std_diff": (v1 - v2).abs().std().item(),
},
)
return params_diff ...
te1_params_diff = compare_params(orig_text_encoder_1, trained_text_encoder_1)
te2_params_diff = compare_params(orig_text_encoder_2, trained_text_encoder_2)
... Edit: Make model name more verbose |
That's good information @fauzanardh. I did wonder if the text encoder might have some tiny changes (even though we've been saying that it doesn't train at all), cause faces become slightly distorted when using --train_text_encoder. But like you say, it's clearly not learning anything real, cause an LR of 2e-4 should be destroying the text model, and that isn't happening. I tried a new experiment too. We've been talking about whether |
🥳🎉 I found the bug! 🥳🎉 It turns out that there's a small error in the section of code that adds the LRs for the text encoders if --block_lr is used. If you look at the 25-element long array of LRs that's produced by that code, you can see that the last two are different (and broken) compared to the first 23, which hold the LRs for the unet blocks. The section of code above that (which is used if --block_lrs is not passed in as a parameter) sets up the LRs for the text encoders in a slightly different way, which does work. So to fix the bug, you just have to copy how that other piece of code adds the LRs. i.e. change:
to
in sdxl_train.py and that's the bug fixed. Edit: And of course also remove the "[0]" from this line (also in sdxl_train.py), so it's not just model 0 (the unet) that trains:
|
Thanks @FurkanGozukara. Now it's working, I find the LR for the text encoder has to be very very small. I have it set to 1e-8, and that's with a batch_size=4 and gradient_accumulation size=12. The text encoder learning rate might have to be reduced further if you're using smaller batch size. But even with that low learning rate, my resulting output appears much improved. :) |
Thanks so much for finding the bug! It is excellent. I found a more detailed cause: nn.Module.parameters() returns an iterator or the parameters, but the following part was reading the iterator.
Therefore, the iterator was empty and the parameters passed to the optimizer were empty. This problem can be avoided by converting the return value of parameters to a list, as shown below.
I have done some simple testing and it seems to work well. I will fix it with #895 today. Sorry for the delay. |
wow nice. so you are setting general learning rate 1e-8 and giving each U-NET block as LR 1e-5 right from extra arguments right? can you share your command please? i did some text encoder testing for SDXL in past. i didn't get good results. it was getting cooked very quickly also so that was probably due to bug very nice waiting for fix to test again |
Yes, that's right. In that run, I had 117 training images, and probably around 6 concepts (depending on how you count) in them to be learned. My command line parameters were:
One thing I wanted to say clearly on this thread: this fix is not a fix for LoRA as the LoRA script does not have this bug, as far as I know. This fix is only for pure Dreambooth (that is, the Dreambooth tab in kohya). Although LoRA is sometimes (confusingly) referred to as 'Dreambooth LoRA', the script that is being fixed is only for non-LoRA Dreambooth. I think this fix will also be a fix for the Finetuning tab of kohya, as that also uses sdxl_train.py. But I don't use that tab, and I haven't checked this in detail. |
@fauzanardh - I just realized what the likely cause of those small changes to the text encoder that you detected is. We usually train with --full_bf16 enabled. That option doesn't just cast the unet to bf16, it also casts the text encoders too. Given how sensitive the text encoders seem to be, it would seem like a useful feature to split the --full_bf16 option into two options, one for the unet, and one for the text encoders. I've already changed my local code to leave the text encoders as 32 bit floats, and it works fine on my 24GB card. I'm meaning to run some tests to see if that improves quality when training the text encoders and the unet together. |
wow this is another important thing. but we are also saving the model as bf16. how is that affecting? in my experiments full bf16 training performed better than fp16. but obviously i wasn't able to train text encoder |
@araleza thank you for your nice solution, I am new to this repo, just as the document said:"sdxl_train.py is a script for SDXL fine-tuning", so I am wondering how to do DreamBooth training for SDXL, is it using the same script with SDXL fine-tuning(sdxl_train.py)? If so, what the difference between DreamBooth training and fine-tuning of SDXL. as we can see there are seperate python script for SD1.5's DreamBooth training(train_db.py) and fine-tuning (finetune.py) |
I tried these two update you mentioned(together with --learning_rate=1e-8 --train_text_encoder --block_lr 1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5) to finetune sdxl, but still got 'Text encoder is same. Extract U-Net only.' when using networks/extract_lora_from_models.py to extract lora model. who knows what's the problem? |
Have you tried specifying |
Hi!
I've been trying to perform Dreambooth training of the SDXL text encoders without affecting the unet at all. There doesn't seem to be an option in sdxl_train.py to specifically target only the text encoder, so I've achieved that by using these options:
--learning_rate="0.001"
--train_text_encoder --block_lr 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Watching the Python code execution, all looks well. Just the text encoder parameters get marked for training, the two text models are set to train=true, the training rate of 0.001 is attached to the text parsers, and loss is generated and backpropagated.
The only problem is that nothing happens. No matter what learning rate I set the network to (even crazy high values like --learning_rate="0.001", such as in my example here to make it obvious if any training is occurring), the sample output images remain exactly the same quality as if no training is being performed. They neither get worse, nor better.
I can train only the text encoder without a problem when I'm making a LoRA, but it just seems non-functional for Dreambooth training. I can easily see the text encoder changing the sample output images as my LoRA trains. But it doesn't work for Dreambooth.
Here's my full command, in case there is anything interesting in there:
accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train.py" --enable_bucket --min_bucket_reso=64 --max_bucket_reso=1024 --pretrained_model_name_or_path="/home/tptpt/Documents/Dev/sdxl/sd_xl_base_1.0.safetensors" --train_data_dir="/home/tptpt/Documents/Dev/sdxl/training/trainimgs/kohya/img" --resolution="1024,1024" --output_dir="/home/tptpt/Documents/Dev/sdxl/training/trainimgs/kohya/dreambooth" --logging_dir="/home/tptpt/Documents/Dev/sdxl/training/trainimgs/kohya/log" --save_model_as=safetensors --full_bf16 --vae="/home/tptpt/Documents/Dev/sdxl/sdxl_vae.safetensors" --output_name="trainimgs" --lr_scheduler_num_cycles="20000" --max_token_length=150 --max_data_loader_n_workers="0" --learning_rate="0.001" --lr_scheduler="constant" --train_batch_size="4" --max_train_steps="12000" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --max_token_length=150 --bucket_reso_steps=32 --save_every_n_steps="300" --save_last_n_steps="900" --flip_aug --gradient_checkpointing --xformers --bucket_no_upscale --noise_offset=0.0357 --train_text_encoder --block_lr 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 --sample_sampler=k_dpm_2 --sample_prompts="/home/tptpt/Documents/Dev/sdxl/training/trainimgs/kohya/dreambooth/sample/prompt.txt" --sample_every_n_steps="154"
Any idea what's wrong? Not being able to train the text encoders is a disaster for people trying to add a new unknown keyword.
The text was updated successfully, but these errors were encountered: