-
-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimisers make me sad #234
Comments
We were just talking in Slack about how in-place gradient updates are actually two separate optimizations: avoiding the temporary allocation of an update array, and fusing the loops of the optimizer expression and the update operation. Few if any deep learning frameworks do the latter (what in Julia would simply be |
Optimisers no longer make me sad! They are not perfect though, so see #637. |
I've been meaning to redesign our optimisers for a while, but I figured I'd write down some thoughts in case anyone else wants to take a crack at the design.
The main challenge here is keeping optimisers composable. This means that concepts like SGD, momentum and "SGD with momentum" should all be the same kind of object, and we should be able to construct the latter easily from the two former, like
optimiser(SGD(η), Momentum(ρ))
(or perhapsSGD(η, Momentum(ρ))
).What makes this slightly difficult is that the update step
x -= Δx
should be applied once after all the other optimisers. One easy way to solve this is to make that update explicit (optimiser(SGD(η), Momentum(ρ), Update())
), but this is fairly bad for convenience. Another way is to make the update itself part of the training loop, but then it can't be adjusted by an optimiser (maybe that's something we don't need to support anyway?).Some other concerns that I think are less hard to address: implementing new optimisers should be easy (~5 lines), parameters should be adjustable (#233), the update should be fast. Also, we can't in future assume that gradients will be updated in-place, as we will likely have a more functional AD (#86). See also #228, #106.
Edit: People with knowledge have informed me that making the update part of the train loop would be OK. What I'm thinking in that case is that an optimiser takes a
param, grad
and updates grad in place. If it needs to store state, it can use the param itself as an ID. I think that addresses everything we need.The text was updated successfully, but these errors were encountered: