Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question on convergence of DSM #9

Open
cheind opened this issue Dec 18, 2021 · 5 comments
Open

A question on convergence of DSM #9

cheind opened this issue Dec 18, 2021 · 5 comments

Comments

@cheind
Copy link

cheind commented Dec 18, 2021

Hey,

I'm currently tracing out the story of diffusion generative models and right now, I'm studying the denoising score matching objective (DSM). I've noticed that your multi-scale approach relies heavily on it (and the original paper is quite old), so I decided that to ask my question here.

I gone through the theory of DSM and got a good grip on how it works and why it works. However, in practice I observe slow convergence (much slower convergence than with ISM) on toy examples. In particular I believe this might be due type of noise distribution selected. While not restricted, it seems everyone goes with a normal distribution since it provides a simple derivative. The derivate being 1/sigma**2 * (orig-perturbed). In practive, I've observed that the scale term in front causes the derivative to take values on the order of 1e4 for sigma=1e-2 and loss jumps around quite heavily. The smaller sigma, the slower the convergence. The loss never actually decreases, but the resulting gradient field looks comparable to what ISM gives.

Did you observe this in your experiments as well?

@yang-song
Copy link
Member

Very good observation! The convergence of DSM will be plagued by large variance and will be very small for small sigma. This is a known issue but can be alleviated by control variates (see https://arxiv.org/abs/2101.03288 as an example). In our experiments we do DSM across multiple noise scales, and didn't observe slowed convergence since there are many large sigmas in the noise scales.

@cheind
Copy link
Author

cheind commented Dec 18, 2021

Ah ok, I was already planning for variance reduction methods :) For larger sigmae everything seems to be much smoother - that I observed as well. I wonder if the runtime advantage of dsm over ism is not eaten up again by slower convergence? After all, for ism, we only need the trace of the jacobian, which should be faster to compute than the entire jacobian (if frameworks like PyTorch would support such an operation). I have already a quite fast version (limited to specific NN architectures) here

https://github.com/cheind/diffusion-models/blob/189fbf545f07be0f8f9c42bc803016b846602f3c/diffusion/jacobians.py#L5

@yang-song
Copy link
Member

Trace of the jacobian is still very expensive to compute. That said, there are methods like sliced score matching that do not add noise and are not affected by variance issues. I tried them in training score-based models before. They gave decent performance, but didn't seem to outperform dsm.

cheind added a commit to cheind/score-matching that referenced this issue Dec 18, 2021
@cheind
Copy link
Author

cheind commented Dec 18, 2021

Yes, very true if data dimensions become large. I was thinking about (low-rank) approximations to the jacobian and came across this paper

Abdel-Khalik, Hany S., et al. "A low rank approach to automatic differentiation." Advances in Automatic Differentiation. Springer, Berlin, Heidelberg, 2008. 55-65.

which is also quite dated. But after skimming it, the idea seems connected to your sliced SM approach: as if sliced score matching computes a low-rank jacobian approximation.

Ok, thanks for your valuable time and have a nice Saturday.

@cheind
Copy link
Author

cheind commented Dec 23, 2021

I've recreated your toy-example to compare Langevin and annealed Langevin sampling. In particular, I've not used exact scores but trained a toy model to perform score prediction. The results are below. In the first figure on right plot we see default Langevin sampling (model trained unconditionally) with expected issues. The next figure (again right plot) shows annealed Langevin sampling as proposed in your paper (model trained conditioned on noise-level). The results are as expected, but I had to change one particular thing to make it work:

  • The noise levels range from [2..0.01] compared to [20..1] as mentioned in the paper. I tried with the original settings, but a sigma of 20 basically gives a flat space and this led to particles flying off in all kind of directions.

I believe the difference is due to the inexactness of model prediction and, of course, due to potential hidden errors in the code. Would you agree?

default_langevin
annealed_langevin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants