--train_text_encoder in sdxl_train.py does literally NOTHING #890

araleza · 2023-10-20T13:41:59Z

Hi!

I've been trying to perform Dreambooth training of the SDXL text encoders without affecting the unet at all. There doesn't seem to be an option in sdxl_train.py to specifically target only the text encoder, so I've achieved that by using these options:

--learning_rate="0.001"
--train_text_encoder --block_lr 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Watching the Python code execution, all looks well. Just the text encoder parameters get marked for training, the two text models are set to train=true, the training rate of 0.001 is attached to the text parsers, and loss is generated and backpropagated.

The only problem is that nothing happens. No matter what learning rate I set the network to (even crazy high values like --learning_rate="0.001", such as in my example here to make it obvious if any training is occurring), the sample output images remain exactly the same quality as if no training is being performed. They neither get worse, nor better.

I can train only the text encoder without a problem when I'm making a LoRA, but it just seems non-functional for Dreambooth training. I can easily see the text encoder changing the sample output images as my LoRA trains. But it doesn't work for Dreambooth.

Here's my full command, in case there is anything interesting in there:

accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train.py" --enable_bucket --min_bucket_reso=64 --max_bucket_reso=1024 --pretrained_model_name_or_path="/home/tptpt/Documents/Dev/sdxl/sd_xl_base_1.0.safetensors" --train_data_dir="/home/tptpt/Documents/Dev/sdxl/training/trainimgs/kohya/img" --resolution="1024,1024" --output_dir="/home/tptpt/Documents/Dev/sdxl/training/trainimgs/kohya/dreambooth" --logging_dir="/home/tptpt/Documents/Dev/sdxl/training/trainimgs/kohya/log" --save_model_as=safetensors --full_bf16 --vae="/home/tptpt/Documents/Dev/sdxl/sdxl_vae.safetensors" --output_name="trainimgs" --lr_scheduler_num_cycles="20000" --max_token_length=150 --max_data_loader_n_workers="0" --learning_rate="0.001" --lr_scheduler="constant" --train_batch_size="4" --max_train_steps="12000" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --max_token_length=150 --bucket_reso_steps=32 --save_every_n_steps="300" --save_last_n_steps="900" --flip_aug --gradient_checkpointing --xformers --bucket_no_upscale --noise_offset=0.0357 --train_text_encoder --block_lr 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 --sample_sampler=k_dpm_2 --sample_prompts="/home/tptpt/Documents/Dev/sdxl/training/trainimgs/kohya/dreambooth/sample/prompt.txt" --sample_every_n_steps="154"

Any idea what's wrong? Not being able to train the text encoders is a disaster for people trying to add a new unknown keyword.

The text was updated successfully, but these errors were encountered:

araleza · 2023-10-21T15:31:15Z

The issue seems to be related to this line of code:

with accelerator.accumulate(training_models[0]): # 複数モデルに対応していない模様だがとりあえずこうしておく

The comment says (according to Google Translate):

It doesn't seem to support multiple models, but let's do this for now

Deleting the [0] bit in the accelerator.accumulate line doesn't make it work, as the comment suggests.

But, if I remove the unet model entirely from training models with the code change below (so that one of the text parsers is training model index 0) then that text parser does start training. So I can train one of the text parsers by itself through hacking the code this way, but not both the unet and the text parser at the same time.

I replaced (in sdxl_train.py):

    # 学習を準備する：モデルを適切な状態にする
    training_models = []
    if args.gradient_checkpointing:
        unet.enable_gradient_checkpointing()
    training_models.append(unet)

with

    # 学習を準備する：モデルを適切な状態にする
    training_models = []

    # If training the text encoder, disable training the unet :-/
    if args.train_text_encoder:
        accelerator.print("training text encoder 1.  DISABLING UNET TRAINING.")
        unet.requires_grad_(False)
        unet.eval()
    else:
        accelerator.print("enabling unet training")
        if args.gradient_checkpointing:
            unet.enable_gradient_checkpointing()
        training_models.append(unet)

Presumably I could also modify the code below that bit I changed, to only add the second text parser as a training model. Then I could train that one individually too. It would be better if I could get both text parsers and the unet all training at once though.

araleza · 2023-10-21T15:32:51Z

One other side note is that the CUDA memory requirements seems to jump up (rather than down) when I disable unet training. It does the same with block_lr=0,0,0,0...0,0,0. I don't know why that might be.

araleza · 2023-10-21T18:03:23Z

If you use that above code change to not add the unet to the list of training models, I think you should also comment out one or the other of these two lines:

        training_models.append(text_encoder1)
        training_models.append(text_encoder2)

to choose which of the two text models you're training. If you leave them both in, something might be wrong with training. For example, maybe the gradient norm clipping will act on the sum of both of them, cause all the parameters from any model in the training_models list get added together there.

araleza · 2023-10-21T22:52:37Z

This thread of mine seems to be related a lot to a previous (and still open) issue:
#855

kohya-ss · 2023-10-22T08:07:06Z

Thank you for opening the issue.

I updated the version of accelerate recently, and I noticed that the latest accelerate supports the gradient accumulation for multiple models, it didn't previously support. But it seems to cause this issue.

I will update the script to handle multiple-model accumulation for the latest accelerate.

FurkanGozukara · 2023-10-22T15:38:19Z

thank you for experiments

fauzanardh · 2023-10-23T16:30:52Z

For multi model gradient accumulation you only need to change this code from:
with accelerator.accumulate(training_models[0]):
to
with accelerator.accumulate(*training_models):

I haven't tested this yet, but according to the accelerate source, that's all you need to do.

araleza · 2023-10-23T17:02:46Z

I haven't tested this yet, but according to the accelerate source, that's all you need to do.

I gave it a try now, thanks for finding that. But it doesn't seem to train the text parsers for me. :(

fauzanardh · 2023-10-23T17:09:36Z

But it doesn't seem to train the text parsers for me.

Yea, that's what I was thinking, I don't think the underlying problem is because of the gradient accumulation.

kohya-ss · 2023-10-24T13:25:15Z

I gave it a try now, thanks for finding that. But it doesn't seem to train the text parsers for me. :(

Thank you for reporting. I tried it and it was the same. So far, I have no idea why Text Encoder is not trained when using block_lr.

Edit: If I set the positive values for block_lr (not 0), Text Encoder is not trained too.

I will do further testing. Please let me know if you find anything out.

kohya-ss · 2023-10-24T13:30:05Z

One other side note is that the CUDA memory requirements seems to jump up (rather than down) when I disable unet training. It does the same with block_lr=0,0,0,0...0,0,0. I don't know why that might be.

This is perhaps because gradient checkpointing of U-Net is disabled when the model is eval mode (with unet.eval().)

araleza · 2023-10-24T18:09:56Z

It's not a huge improvement, but I think that it's possible to train both text encoders at once. I changed:

    # 学習を準備する：モデルを適切な状態にする
    training_models = []
    if args.gradient_checkpointing:
        unet.enable_gradient_checkpointing()
    training_models.append(unet)

to

    # 学習を準備する：モデルを適切な状態にする
    training_models = []
    if args.gradient_checkpointing:
        unet.enable_gradient_checkpointing()

    # If training the text encoder, disable training the unet :-/
    if args.train_text_encoder:
        accelerator.print("training both text encoders.  DISABLING UNET TRAINING.")
        unet.requires_grad_(False)
        unet.eval()
    else:
        training_models.append(unet)

and also deleted the [0] in this line:

with accelerator.accumulate(training_models[0]): # 複数モデルに対応していない模様だがとりあえずこうしておく

And my resulting sample images are showing training occurring. Which makes me suspect both text encoders might be learning at the same time.

fauzanardh · 2023-10-25T01:10:08Z

I did some testing, I trained two models one with block_lr and one without, both model are trained for around 10k steps with batch size of 32. I only printed params that have max difference higher than 0.01 so it doesn't crowd the output.

This model was trained with learning_rate=2e-4 and block_lr="4e-7,4e-7,...,4e-7", compared against original model:

Printing params that have diff more than 0.01
text_encoder_1
{'name': 'text_model.encoder.layers.1.layer_norm1.weight', 'shape': [768], 'max_diff': 0.01171875, 'mean_diff': 0.0017792383441701531, 'std_diff': 0.0012432975927367806}
{'name': 'text_model.encoder.layers.8.layer_norm1.bias', 'shape': [768], 'max_diff': 0.01171875, 'mean_diff': 0.0001159746534540318, 'std_diff': 0.0004530348815023899}

text_encoder_2
{'name': 'text_model.encoder.layers.0.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.03125, 'mean_diff': 0.0024458884727209806, 'std_diff': 0.002944410778582096}
{'name': 'text_model.encoder.layers.1.layer_norm1.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.00202178955078125, 'std_diff': 0.0019075109157711267}
{'name': 'text_model.encoder.layers.1.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0023109912872314453, 'std_diff': 0.00207878602668643}
{'name': 'text_model.encoder.layers.3.layer_norm1.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0014869689475744963, 'std_diff': 0.0015505485935136676}
{'name': 'text_model.encoder.layers.3.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0021964549086987972, 'std_diff': 0.001743764034472406}
{'name': 'text_model.encoder.layers.4.layer_norm1.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0015705109108239412, 'std_diff': 0.0015068750362843275}
{'name': 'text_model.encoder.layers.7.layer_norm1.bias', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.00011402685049688444, 'std_diff': 0.0003584640217013657}
{'name': 'text_model.encoder.layers.8.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0019271194469183683, 'std_diff': 0.001346334582194686}
{'name': 'text_model.encoder.layers.11.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.002540290355682373, 'std_diff': 0.0020399941131472588}
{'name': 'text_model.encoder.layers.12.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0026359320618212223, 'std_diff': 0.0020279372110962868}
{'name': 'text_model.encoder.layers.13.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0027931213844567537, 'std_diff': 0.0022182632237672806}
{'name': 'text_model.encoder.layers.14.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0020116805098950863, 'std_diff': 0.001513096154667437}
{'name': 'text_model.encoder.layers.16.layer_norm1.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0015537261497229338, 'std_diff': 0.001214643707498908}
{'name': 'text_model.encoder.layers.17.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0026606558822095394, 'std_diff': 0.0021262539085000753}
{'name': 'text_model.encoder.layers.17.layer_norm2.bias', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0009027037885971367, 'std_diff': 0.0009552778210490942}
{'name': 'text_model.encoder.layers.18.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.002589607145637274, 'std_diff': 0.002045104745775461}
{'name': 'text_model.encoder.layers.19.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0024156570434570312, 'std_diff': 0.001925305463373661}
{'name': 'text_model.encoder.layers.22.self_attn.q_proj.bias', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.00030964481993578374, 'std_diff': 0.0006552143604494631}
{'name': 'text_model.encoder.layers.22.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.002284479094669223, 'std_diff': 0.0017820867942646146}
{'name': 'text_model.encoder.layers.25.self_attn.q_proj.bias', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.00034595682518556714, 'std_diff': 0.000717785966116935}
{'name': 'text_model.encoder.layers.27.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0019824982155114412, 'std_diff': 0.0015578048769384623}
{'name': 'text_model.encoder.layers.28.self_attn.out_proj.bias', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.00015195719606708735, 'std_diff': 0.00036566369817592204}
{'name': 'text_model.encoder.layers.28.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0020486831199377775, 'std_diff': 0.0015739863738417625}
{'name': 'text_model.encoder.layers.29.mlp.fc2.bias', 'shape': [1280], 'max_diff': 0.0234375, 'mean_diff': 0.0003259133663959801, 'std_diff': 0.0007729948265478015}
{'name': 'text_model.encoder.layers.29.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.002061605453491211, 'std_diff': 0.0015250068390741944}
{'name': 'text_model.encoder.layers.31.mlp.fc2.bias', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0004916787147521973, 'std_diff': 0.0007746534538455307}
{'name': 'text_model.encoder.layers.31.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0015727996360510588, 'std_diff': 0.0013080338248983026}
{'name': 'text_model.final_layer_norm.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0016974449390545487, 'std_diff': 0.001521552330814302}

te1_max_diff: 0.03125, te2_max_diff: 0.03125
te1_mean_diff: 0.00040936409948128536, te2_mean_diff: 0.00043955951851882993
te1_std_diff: 0.00032798347192464444, te2_std_diff: 0.00038715505570771836

And this one is trained with only learning_rate=4e-7, compared against original model:

Printing params that have diff more than 0.01
text_encoder_1
{'name': 'text_model.encoder.layers.1.layer_norm1.weight', 'shape': [768], 'max_diff': 0.01171875, 'mean_diff': 0.0017792383441701531, 'std_diff': 0.0012432975927367806}
{'name': 'text_model.encoder.layers.8.layer_norm1.bias', 'shape': [768], 'max_diff': 0.01171875, 'mean_diff': 0.00011711272964021191, 'std_diff': 0.000453238288173452}

text_encoder_2
{'name': 'text_model.encoder.layers.0.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.03125, 'mean_diff': 0.0024458884727209806, 'std_diff': 0.002944410778582096}
{'name': 'text_model.encoder.layers.1.layer_norm1.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.00202178955078125, 'std_diff': 0.0019075109157711267}
{'name': 'text_model.encoder.layers.1.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0023109912872314453, 'std_diff': 0.00207878602668643}
{'name': 'text_model.encoder.layers.3.layer_norm1.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0014869689475744963, 'std_diff': 0.0015505485935136676}
{'name': 'text_model.encoder.layers.3.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0021964549086987972, 'std_diff': 0.001743764034472406}
{'name': 'text_model.encoder.layers.4.layer_norm1.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0015705109108239412, 'std_diff': 0.0015068750362843275}
{'name': 'text_model.encoder.layers.7.layer_norm1.bias', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.00011414461914682761, 'std_diff': 0.00035843203659169376}
{'name': 'text_model.encoder.layers.8.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0019271194469183683, 'std_diff': 0.001346334582194686}
{'name': 'text_model.encoder.layers.11.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.002540290355682373, 'std_diff': 0.0020399941131472588}
{'name': 'text_model.encoder.layers.12.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0026359320618212223, 'std_diff': 0.0020279372110962868}
{'name': 'text_model.encoder.layers.13.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0027931213844567537, 'std_diff': 0.0022182632237672806}
{'name': 'text_model.encoder.layers.14.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0020116805098950863, 'std_diff': 0.001513096154667437}
{'name': 'text_model.encoder.layers.16.layer_norm1.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0015537261497229338, 'std_diff': 0.001214643707498908}
{'name': 'text_model.encoder.layers.17.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0026606558822095394, 'std_diff': 0.0021262539085000753}
{'name': 'text_model.encoder.layers.17.layer_norm2.bias', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0009027037885971367, 'std_diff': 0.0009552778210490942}
{'name': 'text_model.encoder.layers.18.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.002589607145637274, 'std_diff': 0.002045104745775461}
{'name': 'text_model.encoder.layers.19.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0024156570434570312, 'std_diff': 0.001925305463373661}
{'name': 'text_model.encoder.layers.22.self_attn.q_proj.bias', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0003100501489825547, 'std_diff': 0.0006551524274982512}
{'name': 'text_model.encoder.layers.22.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.002284479094669223, 'std_diff': 0.0017820867942646146}
{'name': 'text_model.encoder.layers.25.self_attn.q_proj.bias', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0003461359883658588, 'std_diff': 0.0007177115185186267}
{'name': 'text_model.encoder.layers.27.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0019824982155114412, 'std_diff': 0.0015578048769384623}
{'name': 'text_model.encoder.layers.28.self_attn.out_proj.bias', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0001525732659501955, 'std_diff': 0.00036568197538144886}
{'name': 'text_model.encoder.layers.28.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0020486831199377775, 'std_diff': 0.0015739863738417625}
{'name': 'text_model.encoder.layers.29.mlp.fc2.bias', 'shape': [1280], 'max_diff': 0.0234375, 'mean_diff': 0.00032617890974506736, 'std_diff': 0.0007729247445240617}
{'name': 'text_model.encoder.layers.29.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.002061605453491211, 'std_diff': 0.0015250068390741944}
{'name': 'text_model.encoder.layers.31.mlp.fc2.bias', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0004916787147521973, 'std_diff': 0.0007746534538455307}
{'name': 'text_model.encoder.layers.31.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0015727996360510588, 'std_diff': 0.0013080338248983026}
{'name': 'text_model.final_layer_norm.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0016974449390545487, 'std_diff': 0.001521552330814302}

te1_max_diff: 0.03125, te2_max_diff: 0.03125
te1_mean_diff: 0.00041061308598839283, te2_mean_diff: 0.00044157158263049396
te1_std_diff: 0.00033168707022923627, te2_std_diff: 0.00039243965048368104

Lastly this one is only for validity sake, I compared the same model:

Printing params that have diff more than 0.01
text_encoder_1

text_encoder_2

te1_max_diff: 0.0, te2_max_diff: 0.0
te1_mean_diff: 0.0, te2_mean_diff: 0.0
te1_std_diff: 0.0, te2_std_diff: 0.0

What's intriguing to me is that the differences in the weights between two trained models with different setting (one with block_lr, one without) are exactly the same (unless it's really, I mean REALLY small). And only the bias showed any variations.

fauzanardh · 2023-10-25T01:15:11Z

And this is the code I'm using to compare the parameters:

def compare_params(model1: nn.Module, model2: nn.Module) -> List:
    # Compare pytorch parameters
    params_diff = []
    for (k1, v1), (k2, v2) in zip(model1.named_parameters(), model2.named_parameters()):
        if k1 != k2:
            continue
        params_diff.append(
            {
                "name": k1,
                "shape": list(v1.shape),
                "max_diff": (v1 - v2).abs().max().item(),
                "mean_diff": (v1 - v2).abs().mean().item(),
                "std_diff": (v1 - v2).abs().std().item(),
            },
        )
    return params_diff

...
te1_params_diff = compare_params(orig_text_encoder_1, trained_text_encoder_1)
te2_params_diff = compare_params(orig_text_encoder_2, trained_text_encoder_2)
...

Edit: Make model name more verbose

araleza · 2023-10-25T12:48:50Z

That's good information @fauzanardh. I did wonder if the text encoder might have some tiny changes (even though we've been saying that it doesn't train at all), cause faces become slightly distorted when using --train_text_encoder. But like you say, it's clearly not learning anything real, cause an LR of 2e-4 should be destroying the text model, and that isn't happening.

I tried a new experiment too. We've been talking about whether accelerator.accumulate(training_models) is working correctly with multiple models, and it occurred to me that the only reason we use Accelerate is to support multiple GPUs (whether they're in one PC, or spread across multiple PCs). Since I'm only using one GPU, I tried ripping Accelerate usage out of sdxl_train.py. And without it being used at all, I get the same incorrect behavior as before, with the text parser only training if the unet isn't. So whatever the issue is, it's nothing to do with any bugs in Accelerate's recently-added support for multiple models, or the way that kohya's sd-scripts are calling it.

araleza · 2023-10-25T16:03:15Z

🥳🎉 I found the bug! 🥳🎉

It turns out that there's a small error in the section of code that adds the LRs for the text encoders if --block_lr is used. If you look at the 25-element long array of LRs that's produced by that code, you can see that the last two are different (and broken) compared to the first 23, which hold the LRs for the unet blocks.

The section of code above that (which is used if --block_lrs is not passed in as a parameter) sets up the LRs for the text encoders in a slightly different way, which does work. So to fix the bug, you just have to copy how that other piece of code adds the LRs. i.e. change:

        params_to_optimize = get_block_params_to_optimize(training_models[0], block_lrs)  # U-Net
        for m in training_models[1:]:  # Text Encoders if exists
            params_to_optimize.append({"params": m.parameters(), "lr": args.learning_rate})

to

        params_to_optimize = get_block_params_to_optimize(training_models[0], block_lrs)  # U-Net
        for m in training_models[1:]:  # Text Encoders if exists
            params = []
            params.extend(m.parameters())
            params_to_optimize.append({"params": params, "lr": args.learning_rate})

in sdxl_train.py and that's the bug fixed.

Edit: And of course also remove the "[0]" from this line (also in sdxl_train.py), so it's not just model 0 (the unet) that trains:

with accelerator.accumulate(training_models[0]): # 複数モデルに対応していない模様だがとりあえずこうしておく

FurkanGozukara · 2023-10-25T21:22:46Z

@araleza amazing find ty so much

I hope get fixed asap @kohya-ss

araleza · 2023-10-25T21:54:01Z

@araleza amazing find ty so much

I hope get fixed asap @kohya-ss

Thanks @FurkanGozukara. Now it's working, I find the LR for the text encoder has to be very very small. I have it set to 1e-8, and that's with a batch_size=4 and gradient_accumulation size=12. The text encoder learning rate might have to be reduced further if you're using smaller batch size. But even with that low learning rate, my resulting output appears much improved. :)

kohya-ss · 2023-10-25T22:29:57Z

Thanks so much for finding the bug! It is excellent.

I found a more detailed cause: nn.Module.parameters() returns an iterator or the parameters, but the following part was reading the iterator.

    # calculate number of trainable parameters
    n_params = 0
    for params in params_to_optimize:
        for p in params["params"]:
            n_params += p.numel()

Therefore, the iterator was empty and the parameters passed to the optimizer were empty.

This problem can be avoided by converting the return value of parameters to a list, as shown below.
(This is in principle the same as the araleza's code.)

    if block_lrs is None:
        params_to_optimize = [
            {"params": list(training_models[0].parameters()), "lr": args.learning_rate},
        ]
    else:
        params_to_optimize = get_block_params_to_optimize(training_models[0], block_lrs)  # U-Net

    for m in training_models[1:]:  # Text Encoders if exists
        params_to_optimize.append({"params": list(m.parameters()), "lr": args.learning_rate_te or args.learning_rate})

I have done some simple testing and it seems to work well.

I will fix it with #895 today. Sorry for the delay.

FurkanGozukara · 2023-10-25T23:36:35Z

@araleza amazing find ty so much
I hope get fixed asap @kohya-ss

Thanks @FurkanGozukara. Now it's working, I find the LR for the text encoder has to be very very small. I have it set to 1e-8, and that's with a batch_size=4 and gradient_accumulation size=12. The text encoder learning rate might have to be reduced further if you're using smaller batch size. But even with that low learning rate, my resulting output appears much improved. :)

wow nice. so you are setting general learning rate 1e-8 and giving each U-NET block as LR 1e-5 right from extra arguments right? can you share your command please?

i did some text encoder testing for SDXL in past. i didn't get good results. it was getting cooked very quickly also

so that was probably due to bug very nice waiting for fix to test again

araleza · 2023-10-26T17:43:28Z

wow nice. so you are setting general learning rate 1e-8 and giving each U-NET block as LR 1e-5 right from extra arguments right? can you share your command please?

Yes, that's right. In that run, I had 117 training images, and probably around 6 concepts (depending on how you count) in them to be learned. My command line parameters were:

accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train.py" --enable_bucket --min_bucket_reso=64 --max_bucket_reso=1024 --pretrained_model_name_or_path="/home/tptpt/Documents/Dev/sdxl/sd_xl_base_1.0.safetensors" --train_data_dir="/home/tptpt/trainimgs/kohya/img" --resolution="1024,1024" --output_dir="/home/tptpt/trainimgs/kohya/dreambooth" --logging_dir="/home/tptpt/trainimgs/kohya/log" --save_model_as=safetensors --full_bf16 --vae="/home/tptpt/Documents/Dev/sdxl/sdxl_vae.safetensors" --output_name="trainimgs" --lr_scheduler_num_cycles="20000" --max_token_length=150 --max_data_loader_n_workers="0" --gradient_accumulation_steps=12 --learning_rate="1e-08" --lr_scheduler="constant" --train_batch_size="4" --max_train_steps="2400" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --max_token_length=150 --bucket_reso_steps=32 --v_pred_like_loss="0.5" --save_every_n_steps="250" --save_last_n_steps="750" --min_snr_gamma=5 --flip_aug --gradient_checkpointing --xformers --bucket_no_upscale --noise_offset=0.0357 --adaptive_noise_scale=0.00357 --train_text_encoder --block_lr 1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5 --sample_sampler=k_dpm_2 --sample_prompts="/home/tptpt/trainimgs/kohya/dreambooth/sample/prompt.txt" --sample_every_n_steps="35"

One thing I wanted to say clearly on this thread: this fix is not a fix for LoRA as the LoRA script does not have this bug, as far as I know. This fix is only for pure Dreambooth (that is, the Dreambooth tab in kohya). Although LoRA is sometimes (confusingly) referred to as 'Dreambooth LoRA', the script that is being fixed is only for non-LoRA Dreambooth.

I think this fix will also be a fix for the Finetuning tab of kohya, as that also uses sdxl_train.py. But I don't use that tab, and I haven't checked this in detail.

araleza · 2023-10-28T13:35:54Z

@fauzanardh - I just realized what the likely cause of those small changes to the text encoder that you detected is. We usually train with --full_bf16 enabled. That option doesn't just cast the unet to bf16, it also casts the text encoders too.

Given how sensitive the text encoders seem to be, it would seem like a useful feature to split the --full_bf16 option into two options, one for the unet, and one for the text encoders. I've already changed my local code to leave the text encoders as 32 bit floats, and it works fine on my 24GB card. I'm meaning to run some tests to see if that improves quality when training the text encoders and the unet together.

FurkanGozukara · 2023-10-28T14:24:01Z

@fauzanardh - I just realized what the likely cause of those small changes to the text encoder that you detected is. We usually train with --full_bf16 enabled. That option doesn't just cast the unet to bf16, it also casts the text encoders too.

Given how sensitive the text encoders seem to be, it would seem like a useful feature to split the --full_bf16 option into two options, one for the unet, and one for the text encoders. I've already changed my local code to leave the text encoders as 32 bit floats, and it works fine on my 24GB card. I'm meaning to run some tests to see if that improves quality when training the text encoders and the unet together.

wow this is another important thing. but we are also saving the model as bf16. how is that affecting? in my experiments full bf16 training performed better than fp16. but obviously i wasn't able to train text encoder

jucic · 2023-11-03T07:12:21Z

wow nice. so you are setting general learning rate 1e-8 and giving each U-NET block as LR 1e-5 right from extra arguments right? can you share your command please?

Yes, that's right. In that run, I had 117 training images, and probably around 6 concepts (depending on how you count) in them to be learned. My command line parameters were:

accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train.py" --enable_bucket --min_bucket_reso=64 --max_bucket_reso=1024 --pretrained_model_name_or_path="/home/tptpt/Documents/Dev/sdxl/sd_xl_base_1.0.safetensors" --train_data_dir="/home/tptpt/trainimgs/kohya/img" --resolution="1024,1024" --output_dir="/home/tptpt/trainimgs/kohya/dreambooth" --logging_dir="/home/tptpt/trainimgs/kohya/log" --save_model_as=safetensors --full_bf16 --vae="/home/tptpt/Documents/Dev/sdxl/sdxl_vae.safetensors" --output_name="trainimgs" --lr_scheduler_num_cycles="20000" --max_token_length=150 --max_data_loader_n_workers="0" --gradient_accumulation_steps=12 --learning_rate="1e-08" --lr_scheduler="constant" --train_batch_size="4" --max_train_steps="2400" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --max_token_length=150 --bucket_reso_steps=32 --v_pred_like_loss="0.5" --save_every_n_steps="250" --save_last_n_steps="750" --min_snr_gamma=5 --flip_aug --gradient_checkpointing --xformers --bucket_no_upscale --noise_offset=0.0357 --adaptive_noise_scale=0.00357 --train_text_encoder --block_lr 1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5 --sample_sampler=k_dpm_2 --sample_prompts="/home/tptpt/trainimgs/kohya/dreambooth/sample/prompt.txt" --sample_every_n_steps="35"

One thing I wanted to say clearly on this thread: this fix is not a fix for LoRA as the LoRA script does not have this bug, as far as I know. This fix is only for pure Dreambooth (that is, the Dreambooth tab in kohya). Although LoRA is sometimes (confusingly) referred to as 'Dreambooth LoRA', the script that is being fixed is only for non-LoRA Dreambooth.

I think this fix will also be a fix for the Finetuning tab of kohya, as that also uses sdxl_train.py. But I don't use that tab, and I haven't checked this in detail.

@araleza thank you for your nice solution, I am new to this repo, just as the document said:"sdxl_train.py is a script for SDXL fine-tuning", so I am wondering how to do DreamBooth training for SDXL, is it using the same script with SDXL fine-tuning(sdxl_train.py)? If so, what the difference between DreamBooth training and fine-tuning of SDXL. as we can see there are seperate python script for SD1.5's DreamBooth training(train_db.py) and fine-tuning (finetune.py)

jucic · 2024-01-10T02:50:46Z

🥳🎉 I found the bug! 🥳🎉

It turns out that there's a small error in the section of code that adds the LRs for the text encoders if --block_lr is used. If you look at the 25-element long array of LRs that's produced by that code, you can see that the last two are different (and broken) compared to the first 23, which hold the LRs for the unet blocks.

The section of code above that (which is used if --block_lrs is not passed in as a parameter) sets up the LRs for the text encoders in a slightly different way, which does work. So to fix the bug, you just have to copy how that other piece of code adds the LRs. i.e. change:
        params_to_optimize = get_block_params_to_optimize(training_models[0], block_lrs)  # U-Net
        for m in training_models[1:]:  # Text Encoders if exists
            params_to_optimize.append({"params": m.parameters(), "lr": args.learning_rate})
to
        params_to_optimize = get_block_params_to_optimize(training_models[0], block_lrs)  # U-Net
        for m in training_models[1:]:  # Text Encoders if exists
            params = []
            params.extend(m.parameters())
            params_to_optimize.append({"params": params, "lr": args.learning_rate})
in sdxl_train.py and that's the bug fixed.

Edit: And of course also remove the "[0]" from this line (also in sdxl_train.py), so it's not just model 0 (the unet) that trains:

with accelerator.accumulate(training_models[0]): # 複数モデルに対応していない模様だがとりあえずこうしておく

I tried these two update you mentioned(together with --learning_rate=1e-8 --train_text_encoder --block_lr 1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5) to finetune sdxl, but still got 'Text encoder is same. Extract U-Net only.' when using networks/extract_lora_from_models.py to extract lora model. who knows what's the problem?

gshawn3 · 2024-04-30T18:41:55Z

I tried these two update you mentioned(together with --learning_rate=1e-8 --train_text_encoder --block_lr 1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5) to finetune sdxl, but still got 'Text encoder is same. Extract U-Net only.' when using networks/extract_lora_from_models.py to extract lora model. who knows what's the problem?

Have you tried specifying min_diff when calling networks/extract_lora_from_models.py? The default setting might not pick up the differences in the text encoders, which would cause them to not be included in the LoRA. Try setting min_diff to 0.00001 or even 0 and see if that works.

kohya-ss mentioned this issue Oct 22, 2023

Text encoder of SD 1.5 model is not trained which is not supposed to happen #855

Open

araleza mentioned this issue Oct 25, 2023

Better implementation for te autocast #895

Merged

moonshinegloss mentioned this issue Oct 26, 2023

text encoder upstream bug derrian-distro/LoRA_Easy_Training_Scripts#154

Closed

araleza closed this as completed Nov 1, 2023

hablaba mentioned this issue Jan 4, 2024

Remove SECourses videos bmaltais/kohya_ss#1838

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--train_text_encoder in sdxl_train.py does literally NOTHING #890

--train_text_encoder in sdxl_train.py does literally NOTHING #890

araleza commented Oct 20, 2023 •

edited

Loading

araleza commented Oct 21, 2023 •

edited

Loading

araleza commented Oct 21, 2023

araleza commented Oct 21, 2023

araleza commented Oct 21, 2023 •

edited

Loading

kohya-ss commented Oct 22, 2023

FurkanGozukara commented Oct 22, 2023

fauzanardh commented Oct 23, 2023

araleza commented Oct 23, 2023

fauzanardh commented Oct 23, 2023 •

edited

Loading

kohya-ss commented Oct 24, 2023 •

edited

Loading

kohya-ss commented Oct 24, 2023

araleza commented Oct 24, 2023

fauzanardh commented Oct 25, 2023

fauzanardh commented Oct 25, 2023 •

edited

Loading

araleza commented Oct 25, 2023

araleza commented Oct 25, 2023 •

edited

Loading

FurkanGozukara commented Oct 25, 2023

araleza commented Oct 25, 2023 •

edited

Loading

kohya-ss commented Oct 25, 2023

FurkanGozukara commented Oct 25, 2023 •

edited

Loading

araleza commented Oct 26, 2023 •

edited

Loading

araleza commented Oct 28, 2023

FurkanGozukara commented Oct 28, 2023

jucic commented Nov 3, 2023

jucic commented Jan 10, 2024

gshawn3 commented Apr 30, 2024

--train_text_encoder in sdxl_train.py does literally NOTHING #890

--train_text_encoder in sdxl_train.py does literally NOTHING #890

Comments

araleza commented Oct 20, 2023 • edited Loading

araleza commented Oct 21, 2023 • edited Loading

araleza commented Oct 21, 2023

araleza commented Oct 21, 2023

araleza commented Oct 21, 2023 • edited Loading

kohya-ss commented Oct 22, 2023

FurkanGozukara commented Oct 22, 2023

fauzanardh commented Oct 23, 2023

araleza commented Oct 23, 2023

fauzanardh commented Oct 23, 2023 • edited Loading

kohya-ss commented Oct 24, 2023 • edited Loading

kohya-ss commented Oct 24, 2023

araleza commented Oct 24, 2023

fauzanardh commented Oct 25, 2023

fauzanardh commented Oct 25, 2023 • edited Loading

araleza commented Oct 25, 2023

araleza commented Oct 25, 2023 • edited Loading

FurkanGozukara commented Oct 25, 2023

araleza commented Oct 25, 2023 • edited Loading

kohya-ss commented Oct 25, 2023

FurkanGozukara commented Oct 25, 2023 • edited Loading

araleza commented Oct 26, 2023 • edited Loading

araleza commented Oct 28, 2023

FurkanGozukara commented Oct 28, 2023

jucic commented Nov 3, 2023

jucic commented Jan 10, 2024

gshawn3 commented Apr 30, 2024

araleza commented Oct 20, 2023 •

edited

Loading

araleza commented Oct 21, 2023 •

edited

Loading

araleza commented Oct 21, 2023 •

edited

Loading

fauzanardh commented Oct 23, 2023 •

edited

Loading

kohya-ss commented Oct 24, 2023 •

edited

Loading

fauzanardh commented Oct 25, 2023 •

edited

Loading

araleza commented Oct 25, 2023 •

edited

Loading

araleza commented Oct 25, 2023 •

edited

Loading

FurkanGozukara commented Oct 25, 2023 •

edited

Loading

araleza commented Oct 26, 2023 •

edited

Loading