Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine vae model convergence #18

Open
Anonnoname opened this issue Feb 7, 2023 · 5 comments
Open

Determine vae model convergence #18

Anonnoname opened this issue Feb 7, 2023 · 5 comments

Comments

@Anonnoname
Copy link

Hello! I'd like to ask how I can determine if my VAE model has converged. Which metrics or loss should I look at? When I'm training on the car dataset, as the KL weights increase, the latent points become more noisy, leading to a decrease in reconstruction quality. Is it possible that if I keep training the model, the reconstruction quality will continue to get worse? If so, how can I know when to stop training?

I used the default config. trainer.epochs set to 800.
step 25480
image

@Anonnoname
Copy link
Author

Additionally, what is the method for determining if the diffusion model has converged or not? I noticed that the loss ceased to decrease in the early epochs, but the overall quality of the samples has continued to improve over time.

@fradino
Copy link

fradino commented Feb 10, 2023

hello,have you ever encountered a situation where the loss becomes nan when training VAE

@ZENGXH
Copy link
Collaborator

ZENGXH commented Feb 20, 2023

@Anonnoname for the VAE training, it usually converged after the KL annealing stop. The criterion of a good VAE is that it can achieve a reasonably good reconstruction performance while the latent points look (slightly) smoother than the input points. In my experiment, the latent points will look like this at iter 144400:
image_6fc08d79379340ef84edbad646191a45-IW9QIRc9GGdkjIWEmpk5twNsE

I feel like your reconstruction is worse than expectation. And the latent points is over-smoothed. This is usually caused by a high KL loss weight. Are you using the default config?

@ZENGXH
Copy link
Collaborator

ZENGXH commented Feb 20, 2023

For the diffusion model, the loss tend to have high variance: it's hard to judge from the loss about the convergence. I usually 1) evaluate the checkpoint every 1000 epoch and determine from the evaluation metric and 2) visualize the results. My experience is that LION usually converge at around 10k iteration.

@ZENGXH
Copy link
Collaborator

ZENGXH commented Feb 20, 2023

@fradino for the NaN issue, could you start another issue and post your log & config so that I can help with that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants