Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grad in SI #14

Open
ssydasheng opened this issue Mar 22, 2021 · 4 comments
Open

Grad in SI #14

ssydasheng opened this issue Mar 22, 2021 · 4 comments

Comments

@ssydasheng
Copy link

Hi,

I am recently reading your excellent continual-learning implementation, in particular about the SI. In the following line of code, you used p.grad, which is the gradient of the regularized loss. However, based on my understanding about SI, the gradient should be computed merely on the data loss, so that it measures how much each weight contributes to the fitting error of the present task. Am I wrong about it, or I missed important factors in your implementation? Thanks ahead for your clarification.

W[n].add_(-p.grad*(p.detach()-p_old[n]))

@GMvandeVen
Copy link
Owner

Hi, thanks for your interest in the code! You're correct that in the line you refer to, p.grad is the gradient of the regularized loss. But my understanding of SI is that it is this gradient that should be used for computing the parameter regularisation strengths. I think it's outside of the scope of this issue to go into to much depth, but what I found to be a useful paper in this regard (even though it is about EWC rather than SI) is this one: https://arxiv.org/abs/1712.03847. Hope this helps!

@GMvandeVen
Copy link
Owner

Actually, thinking a bit more about, I'm not sure anymore how relevant the above paper is to this specific issue. Sorry. I quickly checked the original paper on SI and my impression is that based on the description in that paper it is ambiguous whether they used the gradient of the reguarized or the unregularized loss. (But please correct me if you can point out somewhere in that paper where this is specified!)
Anyway, for this code I intended to follow the approach as described in Eq 13-15 in this paper (https://www.nature.com/articles/s41467-020-17866-2). Whether or not that is the most appropriate approach is an interesting question, I'll need to think more about that. My guess is that in practice it will not matter too much, although that remains to be demonstrated.

@ssydasheng
Copy link
Author

Thank you very much for your detailed response. It seems that the nature paper you referred to indeed used the regularized loss. And I haven't found in the SI paper specifying whether they used the regularized loss or the original loss. I tried to read the released code of the SI paper, but frankly speaking I didn't manage to understand their code well enough either. That's why I came to read your implementation.

Despite that, I noticed in their released code, the original loss is used, in the following line,
https://github.com/ganguli-lab/pathint/blob/master/pathint/optimizers.py#L123
Furthermore, the unreg_grads is used in the following line, https://github.com/ganguli-lab/pathint/blob/master/pathint/protocols.py#L34. That's why I thought that they used the original loss, though I am not sure about this.

I agree with you that in practice it might not matter too much. For using the regularized loss, I think it is similar to use larger coefficients for earlier tasks. Since the Hessian of the quadratic regularization captures the contributions of the previous tasks, and their contribution also exists in the omega. Then previous tasks will be counted multiple times.

@GMvandeVen
Copy link
Owner

Interesting, thanks! When I have some time I might look more into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants