Training loss NaN #10

NSun-S · 2024-11-15T03:32:05Z

Thanks for your awesome work.
I'm trying to reproduce your results for distilling SD-XL. I ran bash examples/train/distill_xl.sh on an 8 - GPU machine. It has been running normally for 25 epochs, with more than 260,000 steps. However, the loss has always been NaN, as shown below:

step_loss: nan, step_loss_noise: nan, step_loss_kd: nan, step_loss_feat: nan

The only modification I made was changing certain lines to ensure the script runs properly. The modified code is as follows:

# Convert images to latent space
with torch.no_grad():
    latents = vae.encode(
        batch["image"].to(accelerator.device, dtype=weight_dtype)
    ).latent_dist.sample()
    latents = latents * vae.config.scaling_factor
    latents = latents.to(accelerator.device, dtype=weight_dtype)

Are there any parameters that should be adjusted? Could you provide your training loss curve or training log?
Looking forward to your reply.

The text was updated successfully, but these errors were encountered:

Huage001 · 2024-11-17T15:46:18Z

Dear @NSun-S ,

Thanks for your interest in our work! We actually only run 100,000 steps and have not run so long in fact. If there is a problem for a longer training, I suggest trying bfloat16 data type or first training without loss_kd and loss_feat for about 50,000 steps and then adding these loss terms.

Please let us know if the problem persists.

NSun-S · 2024-11-19T06:25:18Z

Dear @NSun-S ,

Thanks for your interest in our work! We actually only run 100,000 steps and have not run so long in fact. If there is a problem for a longer training, I suggest trying bfloat16 data type or first training without loss_kd and loss_feat for about 50,000 steps and then adding these loss terms.

Please let us know if the problem persists.

Thanks for your reply. I have attempted to use bf16, and it seems that all three kinds of losses are computed normally.

I will check the performance after training :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training loss NaN #10

Training loss NaN #10

NSun-S commented Nov 15, 2024

Huage001 commented Nov 17, 2024

NSun-S commented Nov 19, 2024

Training loss NaN #10

Training loss NaN #10

Comments

NSun-S commented Nov 15, 2024

Huage001 commented Nov 17, 2024

NSun-S commented Nov 19, 2024