-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using natural gradient for the distibution of inducing variables in the inner layers #32
Comments
Yes you can certainly do this, but you might end up taking overly large step. I've played around with this myself a bit, and it's not totally straightforward to make work without tuning. If you only use natural gradients for the final layer with a Gaussian likelihood things are easy (e.g. gamma=0.05 will likely work) . For non-Gaussian likelihoods or inner layers the method still works, but care is needed to not use a too large step size. If you see a cholesky failed error it's probably for this reason. A simple thing to try is reducing the stepsize. Alternatively some adaptive method can work well. |
An update: I've done some more experiments with this and have found that nat grads can work well if tuned correctly, but can sometimes converge to poor local optima. This is to be expected, I think, since the optimization of the inner layer is potentially highly multimodal, so momentum based optimization might find better a optimum in practice. |
I've made some more experiments and the interesting observation is that when it is possible to use nat grads for the inner layers with a relative high gamma (0.1) the results are better than nat grad only in the last layer, and this in terms of ELBO, better uncertainty estimation and better prediction of the final model. (My experiments were restricteed to the two layers case) |
The nat grad code allows the nat grad taken with respect to different parameterizations. The default is using the natural parameters as this works best in this paper, but this wasn't assessed for a deep gp. I've never tried different parameterizations for the deep gp but would be interested to know how well it works! |
Dear Salimbeni,
In your demo you used the natural gradient to optimize the distribution of the inducing variables at the final layer. I tought that it may be interesting to use the natural gradient also for the distribution of the inducing variables in the inner layers. However, i always obtain an error in the Cholesky decomposition. " Cholesky decomposition was not successful. The input might not be valid." Which i never obtain when using the natural gradient only for the the final layer.
Did you encounter this problem?
Thank you in advance.
The text was updated successfully, but these errors were encountered: