You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your awesome work.
I'm trying to reproduce your results for distilling SD-XL. I ran bash examples/train/distill_xl.sh on an 8 - GPU machine. It has been running normally for 25 epochs, with more than 260,000 steps. However, the loss has always been NaN, as shown below:
step_loss: nan, step_loss_noise: nan, step_loss_kd: nan, step_loss_feat: nan
The only modification I made was changing certain lines to ensure the script runs properly. The modified code is as follows:
# Convert images to latent space
with torch.no_grad():
latents = vae.encode(
batch["image"].to(accelerator.device, dtype=weight_dtype)
).latent_dist.sample()
latents = latents * vae.config.scaling_factor
latents = latents.to(accelerator.device, dtype=weight_dtype)
Are there any parameters that should be adjusted? Could you provide your training loss curve or training log?
Looking forward to your reply.
The text was updated successfully, but these errors were encountered:
Thanks for your interest in our work! We actually only run 100,000 steps and have not run so long in fact. If there is a problem for a longer training, I suggest trying bfloat16 data type or first training without loss_kd and loss_feat for about 50,000 steps and then adding these loss terms.
Thanks for your interest in our work! We actually only run 100,000 steps and have not run so long in fact. If there is a problem for a longer training, I suggest trying bfloat16 data type or first training without loss_kd and loss_feat for about 50,000 steps and then adding these loss terms.
Please let us know if the problem persists.
Thanks for your reply. I have attempted to use bf16, and it seems that all three kinds of losses are computed normally.
Thanks for your awesome work.
I'm trying to reproduce your results for distilling SD-XL. I ran
bash examples/train/distill_xl.sh
on an 8 - GPU machine. It has been running normally for 25 epochs, with more than 260,000 steps. However, the loss has always been NaN, as shown below:step_loss: nan, step_loss_noise: nan, step_loss_kd: nan, step_loss_feat: nan
The only modification I made was changing certain lines to ensure the script runs properly. The modified code is as follows:
Are there any parameters that should be adjusted? Could you provide your training loss curve or training log?
Looking forward to your reply.
The text was updated successfully, but these errors were encountered: