Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--train_text_encoder in sdxl_train.py does literally NOTHING #890

Closed
araleza opened this issue Oct 20, 2023 · 26 comments
Closed

--train_text_encoder in sdxl_train.py does literally NOTHING #890

araleza opened this issue Oct 20, 2023 · 26 comments

Comments

@araleza
Copy link

araleza commented Oct 20, 2023

Hi!

I've been trying to perform Dreambooth training of the SDXL text encoders without affecting the unet at all. There doesn't seem to be an option in sdxl_train.py to specifically target only the text encoder, so I've achieved that by using these options:

--learning_rate="0.001"
--train_text_encoder --block_lr 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Watching the Python code execution, all looks well. Just the text encoder parameters get marked for training, the two text models are set to train=true, the training rate of 0.001 is attached to the text parsers, and loss is generated and backpropagated.

The only problem is that nothing happens. No matter what learning rate I set the network to (even crazy high values like --learning_rate="0.001", such as in my example here to make it obvious if any training is occurring), the sample output images remain exactly the same quality as if no training is being performed. They neither get worse, nor better.

I can train only the text encoder without a problem when I'm making a LoRA, but it just seems non-functional for Dreambooth training. I can easily see the text encoder changing the sample output images as my LoRA trains. But it doesn't work for Dreambooth.

Here's my full command, in case there is anything interesting in there:

accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train.py" --enable_bucket --min_bucket_reso=64 --max_bucket_reso=1024 --pretrained_model_name_or_path="/home/tptpt/Documents/Dev/sdxl/sd_xl_base_1.0.safetensors" --train_data_dir="/home/tptpt/Documents/Dev/sdxl/training/trainimgs/kohya/img" --resolution="1024,1024" --output_dir="/home/tptpt/Documents/Dev/sdxl/training/trainimgs/kohya/dreambooth" --logging_dir="/home/tptpt/Documents/Dev/sdxl/training/trainimgs/kohya/log" --save_model_as=safetensors --full_bf16 --vae="/home/tptpt/Documents/Dev/sdxl/sdxl_vae.safetensors" --output_name="trainimgs" --lr_scheduler_num_cycles="20000" --max_token_length=150 --max_data_loader_n_workers="0" --learning_rate="0.001" --lr_scheduler="constant" --train_batch_size="4" --max_train_steps="12000" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --max_token_length=150 --bucket_reso_steps=32 --save_every_n_steps="300" --save_last_n_steps="900" --flip_aug --gradient_checkpointing --xformers --bucket_no_upscale --noise_offset=0.0357 --train_text_encoder --block_lr 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 --sample_sampler=k_dpm_2 --sample_prompts="/home/tptpt/Documents/Dev/sdxl/training/trainimgs/kohya/dreambooth/sample/prompt.txt" --sample_every_n_steps="154"

Any idea what's wrong? Not being able to train the text encoders is a disaster for people trying to add a new unknown keyword.

@araleza
Copy link
Author

araleza commented Oct 21, 2023

The issue seems to be related to this line of code:

with accelerator.accumulate(training_models[0]): # 複数モデルに対応していない模様だがとりあえずこうしておく

The comment says (according to Google Translate):

It doesn't seem to support multiple models, but let's do this for now

Deleting the [0] bit in the accelerator.accumulate line doesn't make it work, as the comment suggests.

But, if I remove the unet model entirely from training models with the code change below (so that one of the text parsers is training model index 0) then that text parser does start training. So I can train one of the text parsers by itself through hacking the code this way, but not both the unet and the text parser at the same time.

I replaced (in sdxl_train.py):

    # 学習を準備する:モデルを適切な状態にする
    training_models = []
    if args.gradient_checkpointing:
        unet.enable_gradient_checkpointing()
    training_models.append(unet)

with

    # 学習を準備する:モデルを適切な状態にする
    training_models = []

    # If training the text encoder, disable training the unet :-/
    if args.train_text_encoder:
        accelerator.print("training text encoder 1.  DISABLING UNET TRAINING.")
        unet.requires_grad_(False)
        unet.eval()
    else:
        accelerator.print("enabling unet training")
        if args.gradient_checkpointing:
            unet.enable_gradient_checkpointing()
        training_models.append(unet)

Presumably I could also modify the code below that bit I changed, to only add the second text parser as a training model. Then I could train that one individually too. It would be better if I could get both text parsers and the unet all training at once though.

@araleza
Copy link
Author

araleza commented Oct 21, 2023

One other side note is that the CUDA memory requirements seems to jump up (rather than down) when I disable unet training. It does the same with block_lr=0,0,0,0...0,0,0. I don't know why that might be.

@araleza
Copy link
Author

araleza commented Oct 21, 2023

If you use that above code change to not add the unet to the list of training models, I think you should also comment out one or the other of these two lines:

        training_models.append(text_encoder1)
        training_models.append(text_encoder2)

to choose which of the two text models you're training. If you leave them both in, something might be wrong with training. For example, maybe the gradient norm clipping will act on the sum of both of them, cause all the parameters from any model in the training_models list get added together there.

@araleza
Copy link
Author

araleza commented Oct 21, 2023

This thread of mine seems to be related a lot to a previous (and still open) issue:
#855

@kohya-ss
Copy link
Owner

Thank you for opening the issue.

I updated the version of accelerate recently, and I noticed that the latest accelerate supports the gradient accumulation for multiple models, it didn't previously support. But it seems to cause this issue.

I will update the script to handle multiple-model accumulation for the latest accelerate.

@FurkanGozukara
Copy link

thank you for experiments

@fauzanardh
Copy link

For multi model gradient accumulation you only need to change this code from:
with accelerator.accumulate(training_models[0]):
to
with accelerator.accumulate(*training_models):

I haven't tested this yet, but according to the accelerate source, that's all you need to do.

@araleza
Copy link
Author

araleza commented Oct 23, 2023

I haven't tested this yet, but according to the accelerate source, that's all you need to do.

I gave it a try now, thanks for finding that. But it doesn't seem to train the text parsers for me. :(

@fauzanardh
Copy link

fauzanardh commented Oct 23, 2023

But it doesn't seem to train the text parsers for me.

Yea, that's what I was thinking, I don't think the underlying problem is because of the gradient accumulation.

@kohya-ss
Copy link
Owner

kohya-ss commented Oct 24, 2023

I gave it a try now, thanks for finding that. But it doesn't seem to train the text parsers for me. :(

Thank you for reporting. I tried it and it was the same. So far, I have no idea why Text Encoder is not trained when using block_lr.

Edit: If I set the positive values for block_lr (not 0), Text Encoder is not trained too.

I will do further testing. Please let me know if you find anything out.

@kohya-ss
Copy link
Owner

One other side note is that the CUDA memory requirements seems to jump up (rather than down) when I disable unet training. It does the same with block_lr=0,0,0,0...0,0,0. I don't know why that might be.

This is perhaps because gradient checkpointing of U-Net is disabled when the model is eval mode (with unet.eval().)

@araleza
Copy link
Author

araleza commented Oct 24, 2023

It's not a huge improvement, but I think that it's possible to train both text encoders at once. I changed:

    # 学習を準備する:モデルを適切な状態にする
    training_models = []
    if args.gradient_checkpointing:
        unet.enable_gradient_checkpointing()
    training_models.append(unet)

to

    # 学習を準備する:モデルを適切な状態にする
    training_models = []
    if args.gradient_checkpointing:
        unet.enable_gradient_checkpointing()

    # If training the text encoder, disable training the unet :-/
    if args.train_text_encoder:
        accelerator.print("training both text encoders.  DISABLING UNET TRAINING.")
        unet.requires_grad_(False)
        unet.eval()
    else:
        training_models.append(unet)

and also deleted the [0] in this line:

with accelerator.accumulate(training_models[0]): # 複数モデルに対応していない模様だがとりあえずこうしておく

And my resulting sample images are showing training occurring. Which makes me suspect both text encoders might be learning at the same time.

@fauzanardh
Copy link

I did some testing, I trained two models one with block_lr and one without, both model are trained for around 10k steps with batch size of 32. I only printed params that have max difference higher than 0.01 so it doesn't crowd the output.

This model was trained with learning_rate=2e-4 and block_lr="4e-7,4e-7,...,4e-7", compared against original model:

Printing params that have diff more than 0.01
text_encoder_1
{'name': 'text_model.encoder.layers.1.layer_norm1.weight', 'shape': [768], 'max_diff': 0.01171875, 'mean_diff': 0.0017792383441701531, 'std_diff': 0.0012432975927367806}
{'name': 'text_model.encoder.layers.8.layer_norm1.bias', 'shape': [768], 'max_diff': 0.01171875, 'mean_diff': 0.0001159746534540318, 'std_diff': 0.0004530348815023899}

text_encoder_2
{'name': 'text_model.encoder.layers.0.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.03125, 'mean_diff': 0.0024458884727209806, 'std_diff': 0.002944410778582096}
{'name': 'text_model.encoder.layers.1.layer_norm1.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.00202178955078125, 'std_diff': 0.0019075109157711267}
{'name': 'text_model.encoder.layers.1.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0023109912872314453, 'std_diff': 0.00207878602668643}
{'name': 'text_model.encoder.layers.3.layer_norm1.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0014869689475744963, 'std_diff': 0.0015505485935136676}
{'name': 'text_model.encoder.layers.3.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0021964549086987972, 'std_diff': 0.001743764034472406}
{'name': 'text_model.encoder.layers.4.layer_norm1.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0015705109108239412, 'std_diff': 0.0015068750362843275}
{'name': 'text_model.encoder.layers.7.layer_norm1.bias', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.00011402685049688444, 'std_diff': 0.0003584640217013657}
{'name': 'text_model.encoder.layers.8.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0019271194469183683, 'std_diff': 0.001346334582194686}
{'name': 'text_model.encoder.layers.11.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.002540290355682373, 'std_diff': 0.0020399941131472588}
{'name': 'text_model.encoder.layers.12.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0026359320618212223, 'std_diff': 0.0020279372110962868}
{'name': 'text_model.encoder.layers.13.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0027931213844567537, 'std_diff': 0.0022182632237672806}
{'name': 'text_model.encoder.layers.14.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0020116805098950863, 'std_diff': 0.001513096154667437}
{'name': 'text_model.encoder.layers.16.layer_norm1.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0015537261497229338, 'std_diff': 0.001214643707498908}
{'name': 'text_model.encoder.layers.17.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0026606558822095394, 'std_diff': 0.0021262539085000753}
{'name': 'text_model.encoder.layers.17.layer_norm2.bias', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0009027037885971367, 'std_diff': 0.0009552778210490942}
{'name': 'text_model.encoder.layers.18.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.002589607145637274, 'std_diff': 0.002045104745775461}
{'name': 'text_model.encoder.layers.19.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0024156570434570312, 'std_diff': 0.001925305463373661}
{'name': 'text_model.encoder.layers.22.self_attn.q_proj.bias', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.00030964481993578374, 'std_diff': 0.0006552143604494631}
{'name': 'text_model.encoder.layers.22.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.002284479094669223, 'std_diff': 0.0017820867942646146}
{'name': 'text_model.encoder.layers.25.self_attn.q_proj.bias', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.00034595682518556714, 'std_diff': 0.000717785966116935}
{'name': 'text_model.encoder.layers.27.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0019824982155114412, 'std_diff': 0.0015578048769384623}
{'name': 'text_model.encoder.layers.28.self_attn.out_proj.bias', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.00015195719606708735, 'std_diff': 0.00036566369817592204}
{'name': 'text_model.encoder.layers.28.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0020486831199377775, 'std_diff': 0.0015739863738417625}
{'name': 'text_model.encoder.layers.29.mlp.fc2.bias', 'shape': [1280], 'max_diff': 0.0234375, 'mean_diff': 0.0003259133663959801, 'std_diff': 0.0007729948265478015}
{'name': 'text_model.encoder.layers.29.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.002061605453491211, 'std_diff': 0.0015250068390741944}
{'name': 'text_model.encoder.layers.31.mlp.fc2.bias', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0004916787147521973, 'std_diff': 0.0007746534538455307}
{'name': 'text_model.encoder.layers.31.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0015727996360510588, 'std_diff': 0.0013080338248983026}
{'name': 'text_model.final_layer_norm.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0016974449390545487, 'std_diff': 0.001521552330814302}

te1_max_diff: 0.03125, te2_max_diff: 0.03125
te1_mean_diff: 0.00040936409948128536, te2_mean_diff: 0.00043955951851882993
te1_std_diff: 0.00032798347192464444, te2_std_diff: 0.00038715505570771836

And this one is trained with only learning_rate=4e-7, compared against original model:

Printing params that have diff more than 0.01
text_encoder_1
{'name': 'text_model.encoder.layers.1.layer_norm1.weight', 'shape': [768], 'max_diff': 0.01171875, 'mean_diff': 0.0017792383441701531, 'std_diff': 0.0012432975927367806}
{'name': 'text_model.encoder.layers.8.layer_norm1.bias', 'shape': [768], 'max_diff': 0.01171875, 'mean_diff': 0.00011711272964021191, 'std_diff': 0.000453238288173452}

text_encoder_2
{'name': 'text_model.encoder.layers.0.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.03125, 'mean_diff': 0.0024458884727209806, 'std_diff': 0.002944410778582096}
{'name': 'text_model.encoder.layers.1.layer_norm1.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.00202178955078125, 'std_diff': 0.0019075109157711267}
{'name': 'text_model.encoder.layers.1.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0023109912872314453, 'std_diff': 0.00207878602668643}
{'name': 'text_model.encoder.layers.3.layer_norm1.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0014869689475744963, 'std_diff': 0.0015505485935136676}
{'name': 'text_model.encoder.layers.3.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0021964549086987972, 'std_diff': 0.001743764034472406}
{'name': 'text_model.encoder.layers.4.layer_norm1.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0015705109108239412, 'std_diff': 0.0015068750362843275}
{'name': 'text_model.encoder.layers.7.layer_norm1.bias', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.00011414461914682761, 'std_diff': 0.00035843203659169376}
{'name': 'text_model.encoder.layers.8.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0019271194469183683, 'std_diff': 0.001346334582194686}
{'name': 'text_model.encoder.layers.11.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.002540290355682373, 'std_diff': 0.0020399941131472588}
{'name': 'text_model.encoder.layers.12.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0026359320618212223, 'std_diff': 0.0020279372110962868}
{'name': 'text_model.encoder.layers.13.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0027931213844567537, 'std_diff': 0.0022182632237672806}
{'name': 'text_model.encoder.layers.14.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0020116805098950863, 'std_diff': 0.001513096154667437}
{'name': 'text_model.encoder.layers.16.layer_norm1.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0015537261497229338, 'std_diff': 0.001214643707498908}
{'name': 'text_model.encoder.layers.17.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0026606558822095394, 'std_diff': 0.0021262539085000753}
{'name': 'text_model.encoder.layers.17.layer_norm2.bias', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0009027037885971367, 'std_diff': 0.0009552778210490942}
{'name': 'text_model.encoder.layers.18.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.002589607145637274, 'std_diff': 0.002045104745775461}
{'name': 'text_model.encoder.layers.19.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0024156570434570312, 'std_diff': 0.001925305463373661}
{'name': 'text_model.encoder.layers.22.self_attn.q_proj.bias', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0003100501489825547, 'std_diff': 0.0006551524274982512}
{'name': 'text_model.encoder.layers.22.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.002284479094669223, 'std_diff': 0.0017820867942646146}
{'name': 'text_model.encoder.layers.25.self_attn.q_proj.bias', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0003461359883658588, 'std_diff': 0.0007177115185186267}
{'name': 'text_model.encoder.layers.27.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0019824982155114412, 'std_diff': 0.0015578048769384623}
{'name': 'text_model.encoder.layers.28.self_attn.out_proj.bias', 'shape': [1280], 'max_diff': 0.01171875, 'mean_diff': 0.0001525732659501955, 'std_diff': 0.00036568197538144886}
{'name': 'text_model.encoder.layers.28.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0020486831199377775, 'std_diff': 0.0015739863738417625}
{'name': 'text_model.encoder.layers.29.mlp.fc2.bias', 'shape': [1280], 'max_diff': 0.0234375, 'mean_diff': 0.00032617890974506736, 'std_diff': 0.0007729247445240617}
{'name': 'text_model.encoder.layers.29.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.002061605453491211, 'std_diff': 0.0015250068390741944}
{'name': 'text_model.encoder.layers.31.mlp.fc2.bias', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0004916787147521973, 'std_diff': 0.0007746534538455307}
{'name': 'text_model.encoder.layers.31.layer_norm2.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0015727996360510588, 'std_diff': 0.0013080338248983026}
{'name': 'text_model.final_layer_norm.weight', 'shape': [1280], 'max_diff': 0.015625, 'mean_diff': 0.0016974449390545487, 'std_diff': 0.001521552330814302}

te1_max_diff: 0.03125, te2_max_diff: 0.03125
te1_mean_diff: 0.00041061308598839283, te2_mean_diff: 0.00044157158263049396
te1_std_diff: 0.00033168707022923627, te2_std_diff: 0.00039243965048368104

Lastly this one is only for validity sake, I compared the same model:

Printing params that have diff more than 0.01
text_encoder_1

text_encoder_2

te1_max_diff: 0.0, te2_max_diff: 0.0
te1_mean_diff: 0.0, te2_mean_diff: 0.0
te1_std_diff: 0.0, te2_std_diff: 0.0

What's intriguing to me is that the differences in the weights between two trained models with different setting (one with block_lr, one without) are exactly the same (unless it's really, I mean REALLY small). And only the bias showed any variations.

@fauzanardh
Copy link

fauzanardh commented Oct 25, 2023

And this is the code I'm using to compare the parameters:

def compare_params(model1: nn.Module, model2: nn.Module) -> List:
    # Compare pytorch parameters
    params_diff = []
    for (k1, v1), (k2, v2) in zip(model1.named_parameters(), model2.named_parameters()):
        if k1 != k2:
            continue
        params_diff.append(
            {
                "name": k1,
                "shape": list(v1.shape),
                "max_diff": (v1 - v2).abs().max().item(),
                "mean_diff": (v1 - v2).abs().mean().item(),
                "std_diff": (v1 - v2).abs().std().item(),
            },
        )
    return params_diff
...
te1_params_diff = compare_params(orig_text_encoder_1, trained_text_encoder_1)
te2_params_diff = compare_params(orig_text_encoder_2, trained_text_encoder_2)
...

Edit: Make model name more verbose

@araleza
Copy link
Author

araleza commented Oct 25, 2023

That's good information @fauzanardh. I did wonder if the text encoder might have some tiny changes (even though we've been saying that it doesn't train at all), cause faces become slightly distorted when using --train_text_encoder. But like you say, it's clearly not learning anything real, cause an LR of 2e-4 should be destroying the text model, and that isn't happening.

I tried a new experiment too. We've been talking about whether accelerator.accumulate(training_models) is working correctly with multiple models, and it occurred to me that the only reason we use Accelerate is to support multiple GPUs (whether they're in one PC, or spread across multiple PCs). Since I'm only using one GPU, I tried ripping Accelerate usage out of sdxl_train.py. And without it being used at all, I get the same incorrect behavior as before, with the text parser only training if the unet isn't. So whatever the issue is, it's nothing to do with any bugs in Accelerate's recently-added support for multiple models, or the way that kohya's sd-scripts are calling it.

@araleza
Copy link
Author

araleza commented Oct 25, 2023

🥳🎉 I found the bug! 🥳🎉

It turns out that there's a small error in the section of code that adds the LRs for the text encoders if --block_lr is used. If you look at the 25-element long array of LRs that's produced by that code, you can see that the last two are different (and broken) compared to the first 23, which hold the LRs for the unet blocks.

The section of code above that (which is used if --block_lrs is not passed in as a parameter) sets up the LRs for the text encoders in a slightly different way, which does work. So to fix the bug, you just have to copy how that other piece of code adds the LRs. i.e. change:

        params_to_optimize = get_block_params_to_optimize(training_models[0], block_lrs)  # U-Net
        for m in training_models[1:]:  # Text Encoders if exists
            params_to_optimize.append({"params": m.parameters(), "lr": args.learning_rate})

to

        params_to_optimize = get_block_params_to_optimize(training_models[0], block_lrs)  # U-Net
        for m in training_models[1:]:  # Text Encoders if exists
            params = []
            params.extend(m.parameters())
            params_to_optimize.append({"params": params, "lr": args.learning_rate})

in sdxl_train.py and that's the bug fixed.

Edit: And of course also remove the "[0]" from this line (also in sdxl_train.py), so it's not just model 0 (the unet) that trains:

with accelerator.accumulate(training_models[0]): # 複数モデルに対応していない模様だがとりあえずこうしておく

@FurkanGozukara
Copy link

@araleza amazing find ty so much

I hope get fixed asap @kohya-ss

@araleza
Copy link
Author

araleza commented Oct 25, 2023

@araleza amazing find ty so much

I hope get fixed asap @kohya-ss

Thanks @FurkanGozukara. Now it's working, I find the LR for the text encoder has to be very very small. I have it set to 1e-8, and that's with a batch_size=4 and gradient_accumulation size=12. The text encoder learning rate might have to be reduced further if you're using smaller batch size. But even with that low learning rate, my resulting output appears much improved. :)

@kohya-ss
Copy link
Owner

Thanks so much for finding the bug! It is excellent.

I found a more detailed cause: nn.Module.parameters() returns an iterator or the parameters, but the following part was reading the iterator.

    # calculate number of trainable parameters
    n_params = 0
    for params in params_to_optimize:
        for p in params["params"]:
            n_params += p.numel()

Therefore, the iterator was empty and the parameters passed to the optimizer were empty.

This problem can be avoided by converting the return value of parameters to a list, as shown below.
(This is in principle the same as the araleza's code.)

    if block_lrs is None:
        params_to_optimize = [
            {"params": list(training_models[0].parameters()), "lr": args.learning_rate},
        ]
    else:
        params_to_optimize = get_block_params_to_optimize(training_models[0], block_lrs)  # U-Net

    for m in training_models[1:]:  # Text Encoders if exists
        params_to_optimize.append({"params": list(m.parameters()), "lr": args.learning_rate_te or args.learning_rate})

I have done some simple testing and it seems to work well.

I will fix it with #895 today. Sorry for the delay.

@FurkanGozukara
Copy link

FurkanGozukara commented Oct 25, 2023

@araleza amazing find ty so much
I hope get fixed asap @kohya-ss

Thanks @FurkanGozukara. Now it's working, I find the LR for the text encoder has to be very very small. I have it set to 1e-8, and that's with a batch_size=4 and gradient_accumulation size=12. The text encoder learning rate might have to be reduced further if you're using smaller batch size. But even with that low learning rate, my resulting output appears much improved. :)

wow nice. so you are setting general learning rate 1e-8 and giving each U-NET block as LR 1e-5 right from extra arguments right? can you share your command please?

i did some text encoder testing for SDXL in past. i didn't get good results. it was getting cooked very quickly also

so that was probably due to bug very nice waiting for fix to test again

@araleza
Copy link
Author

araleza commented Oct 26, 2023

wow nice. so you are setting general learning rate 1e-8 and giving each U-NET block as LR 1e-5 right from extra arguments right? can you share your command please?

Yes, that's right. In that run, I had 117 training images, and probably around 6 concepts (depending on how you count) in them to be learned. My command line parameters were:

accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train.py" --enable_bucket --min_bucket_reso=64 --max_bucket_reso=1024 --pretrained_model_name_or_path="/home/tptpt/Documents/Dev/sdxl/sd_xl_base_1.0.safetensors" --train_data_dir="/home/tptpt/trainimgs/kohya/img" --resolution="1024,1024" --output_dir="/home/tptpt/trainimgs/kohya/dreambooth" --logging_dir="/home/tptpt/trainimgs/kohya/log" --save_model_as=safetensors --full_bf16 --vae="/home/tptpt/Documents/Dev/sdxl/sdxl_vae.safetensors" --output_name="trainimgs" --lr_scheduler_num_cycles="20000" --max_token_length=150 --max_data_loader_n_workers="0" --gradient_accumulation_steps=12 --learning_rate="1e-08" --lr_scheduler="constant" --train_batch_size="4" --max_train_steps="2400" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --max_token_length=150 --bucket_reso_steps=32 --v_pred_like_loss="0.5" --save_every_n_steps="250" --save_last_n_steps="750" --min_snr_gamma=5 --flip_aug --gradient_checkpointing --xformers --bucket_no_upscale --noise_offset=0.0357 --adaptive_noise_scale=0.00357 --train_text_encoder --block_lr 1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5 --sample_sampler=k_dpm_2 --sample_prompts="/home/tptpt/trainimgs/kohya/dreambooth/sample/prompt.txt" --sample_every_n_steps="35"

One thing I wanted to say clearly on this thread: this fix is not a fix for LoRA as the LoRA script does not have this bug, as far as I know. This fix is only for pure Dreambooth (that is, the Dreambooth tab in kohya). Although LoRA is sometimes (confusingly) referred to as 'Dreambooth LoRA', the script that is being fixed is only for non-LoRA Dreambooth.

I think this fix will also be a fix for the Finetuning tab of kohya, as that also uses sdxl_train.py. But I don't use that tab, and I haven't checked this in detail.

@araleza
Copy link
Author

araleza commented Oct 28, 2023

@fauzanardh - I just realized what the likely cause of those small changes to the text encoder that you detected is. We usually train with --full_bf16 enabled. That option doesn't just cast the unet to bf16, it also casts the text encoders too.

Given how sensitive the text encoders seem to be, it would seem like a useful feature to split the --full_bf16 option into two options, one for the unet, and one for the text encoders. I've already changed my local code to leave the text encoders as 32 bit floats, and it works fine on my 24GB card. I'm meaning to run some tests to see if that improves quality when training the text encoders and the unet together.

@FurkanGozukara
Copy link

@fauzanardh - I just realized what the likely cause of those small changes to the text encoder that you detected is. We usually train with --full_bf16 enabled. That option doesn't just cast the unet to bf16, it also casts the text encoders too.

Given how sensitive the text encoders seem to be, it would seem like a useful feature to split the --full_bf16 option into two options, one for the unet, and one for the text encoders. I've already changed my local code to leave the text encoders as 32 bit floats, and it works fine on my 24GB card. I'm meaning to run some tests to see if that improves quality when training the text encoders and the unet together.

wow this is another important thing. but we are also saving the model as bf16. how is that affecting? in my experiments full bf16 training performed better than fp16. but obviously i wasn't able to train text encoder

@araleza araleza closed this as completed Nov 1, 2023
@jucic
Copy link

jucic commented Nov 3, 2023

wow nice. so you are setting general learning rate 1e-8 and giving each U-NET block as LR 1e-5 right from extra arguments right? can you share your command please?

Yes, that's right. In that run, I had 117 training images, and probably around 6 concepts (depending on how you count) in them to be learned. My command line parameters were:

accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train.py" --enable_bucket --min_bucket_reso=64 --max_bucket_reso=1024 --pretrained_model_name_or_path="/home/tptpt/Documents/Dev/sdxl/sd_xl_base_1.0.safetensors" --train_data_dir="/home/tptpt/trainimgs/kohya/img" --resolution="1024,1024" --output_dir="/home/tptpt/trainimgs/kohya/dreambooth" --logging_dir="/home/tptpt/trainimgs/kohya/log" --save_model_as=safetensors --full_bf16 --vae="/home/tptpt/Documents/Dev/sdxl/sdxl_vae.safetensors" --output_name="trainimgs" --lr_scheduler_num_cycles="20000" --max_token_length=150 --max_data_loader_n_workers="0" --gradient_accumulation_steps=12 --learning_rate="1e-08" --lr_scheduler="constant" --train_batch_size="4" --max_train_steps="2400" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --max_token_length=150 --bucket_reso_steps=32 --v_pred_like_loss="0.5" --save_every_n_steps="250" --save_last_n_steps="750" --min_snr_gamma=5 --flip_aug --gradient_checkpointing --xformers --bucket_no_upscale --noise_offset=0.0357 --adaptive_noise_scale=0.00357 --train_text_encoder --block_lr 1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5 --sample_sampler=k_dpm_2 --sample_prompts="/home/tptpt/trainimgs/kohya/dreambooth/sample/prompt.txt" --sample_every_n_steps="35"

One thing I wanted to say clearly on this thread: this fix is not a fix for LoRA as the LoRA script does not have this bug, as far as I know. This fix is only for pure Dreambooth (that is, the Dreambooth tab in kohya). Although LoRA is sometimes (confusingly) referred to as 'Dreambooth LoRA', the script that is being fixed is only for non-LoRA Dreambooth.

I think this fix will also be a fix for the Finetuning tab of kohya, as that also uses sdxl_train.py. But I don't use that tab, and I haven't checked this in detail.

@araleza thank you for your nice solution, I am new to this repo, just as the document said:"sdxl_train.py is a script for SDXL fine-tuning", so I am wondering how to do DreamBooth training for SDXL, is it using the same script with SDXL fine-tuning(sdxl_train.py)? If so, what the difference between DreamBooth training and fine-tuning of SDXL. as we can see there are seperate python script for SD1.5's DreamBooth training(train_db.py) and fine-tuning (finetune.py)

@jucic
Copy link

jucic commented Jan 10, 2024

🥳🎉 I found the bug! 🥳🎉

It turns out that there's a small error in the section of code that adds the LRs for the text encoders if --block_lr is used. If you look at the 25-element long array of LRs that's produced by that code, you can see that the last two are different (and broken) compared to the first 23, which hold the LRs for the unet blocks.

The section of code above that (which is used if --block_lrs is not passed in as a parameter) sets up the LRs for the text encoders in a slightly different way, which does work. So to fix the bug, you just have to copy how that other piece of code adds the LRs. i.e. change:

        params_to_optimize = get_block_params_to_optimize(training_models[0], block_lrs)  # U-Net
        for m in training_models[1:]:  # Text Encoders if exists
            params_to_optimize.append({"params": m.parameters(), "lr": args.learning_rate})

to

        params_to_optimize = get_block_params_to_optimize(training_models[0], block_lrs)  # U-Net
        for m in training_models[1:]:  # Text Encoders if exists
            params = []
            params.extend(m.parameters())
            params_to_optimize.append({"params": params, "lr": args.learning_rate})

in sdxl_train.py and that's the bug fixed.

Edit: And of course also remove the "[0]" from this line (also in sdxl_train.py), so it's not just model 0 (the unet) that trains:

with accelerator.accumulate(training_models[0]): # 複数モデルに対応していない模様だがとりあえずこうしておく

I tried these two update you mentioned(together with --learning_rate=1e-8 --train_text_encoder --block_lr 1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5) to finetune sdxl, but still got 'Text encoder is same. Extract U-Net only.' when using networks/extract_lora_from_models.py to extract lora model. who knows what's the problem?

@gshawn3
Copy link

gshawn3 commented Apr 30, 2024

I tried these two update you mentioned(together with --learning_rate=1e-8 --train_text_encoder --block_lr 1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5,1e-5) to finetune sdxl, but still got 'Text encoder is same. Extract U-Net only.' when using networks/extract_lora_from_models.py to extract lora model. who knows what's the problem?

Have you tried specifying min_diff when calling networks/extract_lora_from_models.py? The default setting might not pick up the differences in the text encoders, which would cause them to not be included in the LoRA. Try setting min_diff to 0.00001 or even 0 and see if that works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants