Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training loss NaN #10

Open
NSun-S opened this issue Nov 15, 2024 · 2 comments
Open

Training loss NaN #10

NSun-S opened this issue Nov 15, 2024 · 2 comments

Comments

@NSun-S
Copy link

NSun-S commented Nov 15, 2024

Thanks for your awesome work.
I'm trying to reproduce your results for distilling SD-XL. I ran bash examples/train/distill_xl.sh on an 8 - GPU machine. It has been running normally for 25 epochs, with more than 260,000 steps. However, the loss has always been NaN, as shown below:

step_loss: nan, step_loss_noise: nan, step_loss_kd: nan, step_loss_feat: nan

The only modification I made was changing certain lines to ensure the script runs properly. The modified code is as follows:

# Convert images to latent space
with torch.no_grad():
    latents = vae.encode(
        batch["image"].to(accelerator.device, dtype=weight_dtype)
    ).latent_dist.sample()
    latents = latents * vae.config.scaling_factor
    latents = latents.to(accelerator.device, dtype=weight_dtype)

Are there any parameters that should be adjusted? Could you provide your training loss curve or training log?
Looking forward to your reply.

@Huage001
Copy link
Owner

Dear @NSun-S ,

Thanks for your interest in our work! We actually only run 100,000 steps and have not run so long in fact. If there is a problem for a longer training, I suggest trying bfloat16 data type or first training without loss_kd and loss_feat for about 50,000 steps and then adding these loss terms.

Please let us know if the problem persists.

@NSun-S
Copy link
Author

NSun-S commented Nov 19, 2024

Dear @NSun-S ,

Thanks for your interest in our work! We actually only run 100,000 steps and have not run so long in fact. If there is a problem for a longer training, I suggest trying bfloat16 data type or first training without loss_kd and loss_feat for about 50,000 steps and then adding these loss terms.

Please let us know if the problem persists.

Thanks for your reply. I have attempted to use bf16, and it seems that all three kinds of losses are computed normally.

I will check the performance after training :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants