-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move all optimizers to Optimisers.jl #9
Conversation
src/rules.jl
Outdated
end | ||
|
||
function (o::Descent)(m, m̄) | ||
update(o, m, m̄, state(o, m))[1] | ||
(o::Descent)(m, dm, st) = update!(o, m, dm, st) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following from FluxML/FluxML-Community-Call-Minutes#22 (comment), I'm not sure we should make these structs callable at all. Given none of methods do anything more than forward to update(!)
with the optimizer object itself, why not get rid of them altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree but left them in the PR so we could discuss and make a decision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also (wrongly) makes it look like the optimizer has completely internal state even though this will still return an updated state.
Sorry for the mass cc's but I figured this was an important enough change that we should have a full discussion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for looking into this!
I haven't gone through the entire pr yet but with optimisers.jl, we can't do inplace updates and have a slightly different design to allow for things like xla (and even the GPU compilation work in Julia) to do more aggressive optimisations.
Okay but I don't want to halt other progress until XLA/GPUCompiler is ready. We could always support both in-place and immutable updates for now. |
I'm happy to switch this PR to be completely immutable if we already have something ready to address the allocation issue. |
Supporting both doesn't seem worth it, and the work will immediately be useful in 1.6, so it's not halting progress there. We should be alright |
Do you mean optimization passes through AbstractInterpreter? |
No, I mean the gpu codegen work that has already been merged in base. |
Sorry, do you mind linking PRs? So this will make all the out-of-place updates mutating as an optimization for code involving |
There isn't one pr directly linked with it, unfortunately. I guess I could link to the recent work in GPU work in the language. It's not something that will enable universally better code I don't think, but it fits in line with the programming model of GPUs where you do end up wanting to free up memory as soon as possible. Allocations can be improved by improving the function calls just as much, so not inherently something to hold this up on. |
This is breaking the current convention of passing the lr as the first positional argument, it is going to break basically all existing code. Did we already have some discussion on that? Also, having a unicode keyword arg so often used is not ideal. I'm not opposing the change, but we have to be careful about it, maybe doit somewhere else |
I would leave out the changes in interface.jl for the purposes of this PR. It may make things faster to model structure as in #3 |
No, and sorry about that! TBH in all my time using Flux, I never realized that the 1st arg was always LR. I always end up looking it up, because it is easier to double check than two have my training not work cause I set the momentum instead of the LR. Thus, the change to keyword args where it is easier to remember. That being said, I have no strong feelings on this change and happy to go back to avoid breaking things. The changes in interface.jl are mostly the mutating stuff. I feel like we should discuss this, because the key feature of functional optimizers is explicit state not immutability. If there is something ready to roll out in a few weeks that make immutable updates mutating as an optimization, then I'll just drop the changes to interface.jl and make everything immutable. But if it will be months before we get such features ready for use, then I would strongly recommend we consider keeping mutability right now. At the end of the day, there is no world in which we can have optimizers that make copies of all the model parameters/gradients (even on Flux#master). It seems silly to not provide what we think is a better optimizer and scheduling just because we also want feature X. Why not make feature X part of a later PR if it will be ready later? Re: ParameterSchedulers.jl: @DhairyaLGandhi you mentioned you wanted to discuss here. What are the design assumptions that need to be addressed? We can fix ParameterSchedulers.jl to account for them. |
Both are
Not sure which features you're talking about here? But pretty sure immutable updates here is an explicit design, and we would maintain that. Re: ParameterSchedulers.jl: apologies, I meant that we should move the discussion and implementation from FluxML/Flux.jl#1487 to FluxML/Flux.jl#1481, and (as a follow up) the scheduling interface should be an implementation of the optimisers+training without needing additional handling on either side. I'm other words, we need to generalise the interface around these methods to allow for scheduling. |
Either way, the optimisation around mutability is orthogonal, and would not need weeks if it's required urgently. |
I'm talking about the feature where
The explicit state is that generalization. That was the purpose of this PR. Based on this discussion, I think we have two options. Either make a PR to Flux.jl itself that makes all the optimizers use explicit state and leave Optimisers.jl alone. Or merge some version of this PR into Optimisers.jl with the work around mutability for the future. |
Also I just pushed some code I had implemented some time back for the remaining optimisers in |
Agreed, but it's incomplete in and of itself. As an example, making some information around training available to the scheduler would be very helpful.
Maybe I'm not following this properly, apologies if I am, but I'm pretty sure that optimisers.jl is that |
What information would that be? Scheduling policies require the current iteration which is easily specified through the explicit state.
That's what I'm trying to say! As a requirement for doing scheduling right, it became clear that the optimizers needed to move to explicit state. So, I have opened the discussion topic, then I prepared this PR in light of that discussion. This PR follows the design principles of Optimisers.jl to use explicit state. What isn't required (right now) is immutability. So, I am suggesting we save the immutability design goal for later so that we can have explicit state optimizers in Flux now. |
The Epoch of training for example. Immutability is part of the design and interface, and I'd keep that. I would much prefer an issue to follow up rather than block the current work. |
The epoch of training wouldn't necessarily be information in
That's why I suggested that if we can't budge on immutability in Optimisers.jl, then I can submit a PR to Flux.jl that implements explicit state but not immutability. |
This looks fine, except for the fact the composite optimizer should be renamed. |
All good concerns, thanks! I think the naming is not the end of the world, but I would definitely prefer a better name. I dont think we want to add a state field though, this is a different design and we shouldn't be too worried about the breakages that can be managed elsewhere (see the corresponding PR to Flux where I have updated the commented out We should definitely come up with a smoother breakage path though, although I'm not sure we can get around everything. |
AFAICT, even with the commented I don't think there is really any clean way around it. The only way to avoid passing around the state would be to store it in some structure. The only structures available to We could do something like struct OptAndState{O, S}
opt::O
state::S
end Then use |
Will |
@darsnack this is a basic script that I'd like to see still working after the change using Flux
using Flux.Optimise
opt = ADAM(0.1)
x = rand(2)
for t=1:10
g = gradient(x -> sum(x.^2), x)
update!(opt, x, g[1])
opt.eta /= 2
end If we keep this PR as it is, in order to provide a deprecation path in Flux we would have to create a parallel family of optimizers, ADAM, Nesterov, etc... where struct ADAM
opt::Optimisers.ADAM
state
end then |
In that case I'd rather just have a breaking change. |
Okay renamed to I disagree with a breaking change. This is such a fundamental piece of the framework, that we shouldn't play fast and loose with people's code. @CarloLucibello I think we could have deprecation path code in this repo in a If there is a concern about code complexity, then I think something like the following should resolve it: for $Opt in (:Descent, :ADAM) # we would have all in the real code
@eval begin
mutable struct $(Opt){T}
opt::Optimisers.$Opt
state::T
end
Base.getproperty(o::$Opt, p) = getproperty(o.opt, p)
Base.setproperty!(o::$Opt, p, v) = setproperty!(o.opt, p, v)
function apply!(o::$Opt, x, dx)
dx, s = apply(o.opt, x, dx, o.state)
o.state = s
return dx
end
end
end |
ok, we'll cook something like that in flux, handling Two more things (feel free to ignore both)
|
I don't think we would play fast and lose but instead deprecate the current ones as a whole, pointing to the new explicit state approach, and handle both of them inside Flux for a technically breaking change. That way we can preserve custom loops for one more release cycle. |
What about There are probably more reasons than I list here for explicit state, so @ToucheSir and @DhairyaLGandhi can chime in: With struct Scheduler{S, O}
opt::O
schedule::S
iter::Int
end
function apply!(o, x, dx)
o.opt.eta = o.schedule[o.iter]
o.iter += 1
return apply!(o.opt, x, dx)
end This seems like a perfectly reasonable and logical thing to write, but it won't work unless So, state is a requirement for almost every optimizer. Aside from a complicated paragraph in the docs, the best way to make this clear to users is by making state part of the interface if it is a requirement and not an implementation detail. It also allows anyone writing an optimizer to write the natural code (w.r.t. a single I think largely, the benefits are that this is a clearer, more extensible design. It should make for safer code with easier to detect errors. There's also the XLA-side of this where explicit state and immutability can allow compiler optimizations, but those are further down the line. |
It also means that the code for an optimizer is stable. When we want to special case how we handle the multiple states associated with a data structure, it only needs to be done once, and none of the optimizer rules need to change their code. |
I think everything should be resolved in the latest commit. I was convinced that Docstrings are in place, but we should make setting up the docs a separate PR. Tests are completely ported from Flux.jl. |
@DhairyaLGandhi can I have write access to this repo? (actually, if it could be made org-wide it would be much more convenient) |
Bump can we get the write-access for this repo to match the org? I think this can be merged. |
Thanks! |
This PR finishes the already good work in this repo to move all the optimizers from Flux to Optimisers.jl. The main advantage of the implementation in Optimisers.jl is the use of functors and explicit state. This eliminates the need for any
IdDict
s and simplifies how compositions of basic optimizers work.One notable exception: this implementation is mutable. The simple reason for this is that immutable updates will copy entire models and result in high memory usage. Until we have a compiler optimization to address this, this PR opts to have mutable updates for now. You can read more about the reasoning in this design note.Remaining tasks:
cc @DhairyaLGandhi @CarloLucibello @mcabbott @ToucheSir