Optimisers make me sad #234

MikeInnes · 2018-04-14T01:46:48Z

I've been meaning to redesign our optimisers for a while, but I figured I'd write down some thoughts in case anyone else wants to take a crack at the design.

The main challenge here is keeping optimisers composable. This means that concepts like SGD, momentum and "SGD with momentum" should all be the same kind of object, and we should be able to construct the latter easily from the two former, like optimiser(SGD(η), Momentum(ρ)) (or perhaps SGD(η, Momentum(ρ))).

What makes this slightly difficult is that the update step x -= Δx should be applied once after all the other optimisers. One easy way to solve this is to make that update explicit (optimiser(SGD(η), Momentum(ρ), Update())), but this is fairly bad for convenience. Another way is to make the update itself part of the training loop, but then it can't be adjusted by an optimiser (maybe that's something we don't need to support anyway?).

Some other concerns that I think are less hard to address: implementing new optimisers should be easy (~5 lines), parameters should be adjustable (#233), the update should be fast. Also, we can't in future assume that gradients will be updated in-place, as we will likely have a more functional AD (#86). See also #228, #106.

Edit: People with knowledge have informed me that making the update part of the train loop would be OK. What I'm thinking in that case is that an optimiser takes a param, grad and updates grad in place. If it needs to store state, it can use the param itself as an ID. I think that addresses everything we need.

The text was updated successfully, but these errors were encountered:

jekbradbury · 2018-04-14T03:52:57Z

we can't in future assume that gradients will be updated in-place

We were just talking in Slack about how in-place gradient updates are actually two separate optimizations: avoiding the temporary allocation of an update array, and fusing the loops of the optimizer expression and the update operation. Few if any deep learning frameworks do the latter (what in Julia would simply be new_param .-= eta .* grad), and the update doesn't usually account for a significant fraction of training time, but all frameworks I'm familiar with include the first optimization for memory reasons. Often (not always) the parameters represent a major chunk of the network's memory usage; if they do you really only want to have one copy each of parameters and gradients live at any given time, so you have to do the update itself in-place.

MikeInnes · 2018-06-08T17:40:07Z

#283

MikeInnes · 2019-02-22T15:59:00Z

Optimisers no longer make me sad! They are not perfect though, so see #637.

MikeInnes closed this as completed Feb 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimisers make me sad #234

Optimisers make me sad #234

MikeInnes commented Apr 14, 2018 •

edited

Loading

jekbradbury commented Apr 14, 2018

MikeInnes commented Jun 8, 2018

MikeInnes commented Feb 22, 2019

Optimisers make me sad #234

Optimisers make me sad #234

Comments

MikeInnes commented Apr 14, 2018 • edited Loading

jekbradbury commented Apr 14, 2018

MikeInnes commented Jun 8, 2018

MikeInnes commented Feb 22, 2019

MikeInnes commented Apr 14, 2018 •

edited

Loading