Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move all optimizers to Optimisers.jl #9

Merged
merged 15 commits into from
Feb 8, 2021

Conversation

darsnack
Copy link
Member

@darsnack darsnack commented Jan 29, 2021

This PR finishes the already good work in this repo to move all the optimizers from Flux to Optimisers.jl. The main advantage of the implementation in Optimisers.jl is the use of functors and explicit state. This eliminates the need for any IdDicts and simplifies how compositions of basic optimizers work.

One notable exception: this implementation is mutable. The simple reason for this is that immutable updates will copy entire models and result in high memory usage. Until we have a compiler optimization to address this, this PR opts to have mutable updates for now. You can read more about the reasoning in this design note.

Remaining tasks:

  • docstrings
  • clean up tests

cc @DhairyaLGandhi @CarloLucibello @mcabbott @ToucheSir

src/rules.jl Outdated
end

function (o::Descent)(m, m̄)
update(o, m, m̄, state(o, m))[1]
(o::Descent)(m, dm, st) = update!(o, m, dm, st)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following from FluxML/FluxML-Community-Call-Minutes#22 (comment), I'm not sure we should make these structs callable at all. Given none of methods do anything more than forward to update(!) with the optimizer object itself, why not get rid of them altogether?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but left them in the PR so we could discuss and make a decision

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also (wrongly) makes it look like the optimizer has completely internal state even though this will still return an updated state.

@darsnack
Copy link
Member Author

Sorry for the mass cc's but I figured this was an important enough change that we should have a full discussion

src/rules.jl Outdated Show resolved Hide resolved
src/interface.jl Outdated Show resolved Hide resolved
src/rules.jl Outdated Show resolved Hide resolved
Copy link
Member

@DhairyaLGandhi DhairyaLGandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for looking into this!

I haven't gone through the entire pr yet but with optimisers.jl, we can't do inplace updates and have a slightly different design to allow for things like xla (and even the GPU compilation work in Julia) to do more aggressive optimisations.

src/Optimisers.jl Outdated Show resolved Hide resolved
src/interface.jl Outdated Show resolved Hide resolved
src/rules.jl Show resolved Hide resolved
@darsnack
Copy link
Member Author

Okay but I don't want to halt other progress until XLA/GPUCompiler is ready. We could always support both in-place and immutable updates for now.

@darsnack
Copy link
Member Author

I'm happy to switch this PR to be completely immutable if we already have something ready to address the allocation issue.

@DhairyaLGandhi
Copy link
Member

Supporting both doesn't seem worth it, and the work will immediately be useful in 1.6, so it's not halting progress there. We should be alright

@darsnack
Copy link
Member Author

Do you mean optimization passes through AbstractInterpreter?

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Jan 29, 2021

No, I mean the gpu codegen work that has already been merged in base.

@darsnack
Copy link
Member Author

Sorry, do you mind linking PRs? So this will make all the out-of-place updates mutating as an optimization for code involving CuArrays?

@DhairyaLGandhi
Copy link
Member

There isn't one pr directly linked with it, unfortunately. I guess I could link to the recent work in GPU work in the language. It's not something that will enable universally better code I don't think, but it fits in line with the programming model of GPUs where you do end up wanting to free up memory as soon as possible.

Allocations can be improved by improving the function calls just as much, so not inherently something to hold this up on.

src/rules.jl Outdated Show resolved Hide resolved
src/rules.jl Outdated Show resolved Hide resolved
@CarloLucibello
Copy link
Member

CarloLucibello commented Jan 29, 2021

This is breaking the current convention of passing the lr as the first positional argument, it is going to break basically all existing code. Did we already have some discussion on that? Also, having a unicode keyword arg so often used is not ideal. I'm not opposing the change, but we have to be careful about it, maybe doit somewhere else

@DhairyaLGandhi
Copy link
Member

I would leave out the changes in interface.jl for the purposes of this PR.

It may make things faster to model structure as in #3

@darsnack
Copy link
Member Author

This is breaking the current convention of passing the lr as the first positional argument, it is going to break basically all existing code. Did we already have some discussion on that?

No, and sorry about that! TBH in all my time using Flux, I never realized that the 1st arg was always LR. I always end up looking it up, because it is easier to double check than two have my training not work cause I set the momentum instead of the LR. Thus, the change to keyword args where it is easier to remember. That being said, I have no strong feelings on this change and happy to go back to avoid breaking things.


The changes in interface.jl are mostly the mutating stuff. I feel like we should discuss this, because the key feature of functional optimizers is explicit state not immutability. If there is something ready to roll out in a few weeks that make immutable updates mutating as an optimization, then I'll just drop the changes to interface.jl and make everything immutable.

But if it will be months before we get such features ready for use, then I would strongly recommend we consider keeping mutability right now. At the end of the day, there is no world in which we can have optimizers that make copies of all the model parameters/gradients (even on Flux#master). It seems silly to not provide what we think is a better optimizer and scheduling just because we also want feature X. Why not make feature X part of a later PR if it will be ready later?


Re: ParameterSchedulers.jl: @DhairyaLGandhi you mentioned you wanted to discuss here. What are the design assumptions that need to be addressed? We can fix ParameterSchedulers.jl to account for them.

@DhairyaLGandhi
Copy link
Member

explicit state not immutability

Both are

be months before we get such features ready for use, then I would strongly recommend we consider keeping mutability right now

Not sure which features you're talking about here? But pretty sure immutable updates here is an explicit design, and we would maintain that.

Re: ParameterSchedulers.jl: apologies, I meant that we should move the discussion and implementation from FluxML/Flux.jl#1487 to FluxML/Flux.jl#1481, and (as a follow up) the scheduling interface should be an implementation of the optimisers+training without needing additional handling on either side. I'm other words, we need to generalise the interface around these methods to allow for scheduling.

@DhairyaLGandhi
Copy link
Member

Either way, the optimisation around mutability is orthogonal, and would not need weeks if it's required urgently.

@darsnack
Copy link
Member Author

Not sure which features you're talking about here?

I'm talking about the feature where x, st = update(o, x, dx, st) does not make copies of x or dx even though update is immutable.

the scheduling interface should be an implementation of the optimisers+training without needing additional handling on either side. I'm other words, we need to generalise the interface around these methods to allow for scheduling.

The explicit state is that generalization. That was the purpose of this PR.

Based on this discussion, I think we have two options. Either make a PR to Flux.jl itself that makes all the optimizers use explicit state and leave Optimisers.jl alone. Or merge some version of this PR into Optimisers.jl with the work around mutability for the future.

@DhairyaLGandhi
Copy link
Member

Also I just pushed some code I had implemented some time back for the remaining optimisers in dg/adams which we can merge, it's tested and everything.

@DhairyaLGandhi
Copy link
Member

The explicit state is that generalization. That was the purpose of this PR.

Agreed, but it's incomplete in and of itself. As an example, making some information around training available to the scheduler would be very helpful.

Either make a PR to Flux.jl itself that makes all the optimizers use explicit state

Maybe I'm not following this properly, apologies if I am, but I'm pretty sure that optimisers.jl is that

@darsnack
Copy link
Member Author

As an example, making some information around training available to the scheduler would be very helpful.

What information would that be? Scheduling policies require the current iteration which is easily specified through the explicit state.

Maybe I'm not following this properly, apologies if I am, but I'm pretty sure that optimisers.jl is that

That's what I'm trying to say!

As a requirement for doing scheduling right, it became clear that the optimizers needed to move to explicit state. So, I have opened the discussion topic, then I prepared this PR in light of that discussion. This PR follows the design principles of Optimisers.jl to use explicit state. What isn't required (right now) is immutability. So, I am suggesting we save the immutability design goal for later so that we can have explicit state optimizers in Flux now.

@DhairyaLGandhi
Copy link
Member

What information would that be?

The Epoch of training for example.

Immutability is part of the design and interface, and I'd keep that. I would much prefer an issue to follow up rather than block the current work.

@darsnack
Copy link
Member Author

The Epoch of training for example.

The epoch of training wouldn't necessarily be information in Flux.train!. In any case, ParameterSchedulers.jl already has usage patterns in place when you want to schedule by epoch, iteration, or some mixture. We can save the discussion on scheduling for later maybe.

Immutability is part of the design and interface, and I'd keep that. I would much prefer an issue to follow up rather than block the current work.

That's why I suggested that if we can't budge on immutability in Optimisers.jl, then I can submit a PR to Flux.jl that implements explicit state but not immutability.

src/rules.jl Outdated Show resolved Hide resolved
@CarloLucibello
Copy link
Member

This looks fine, except for the fact the composite optimizer should be renamed.
I'm worried about the breakage though. Everyone's code with custom training loops (i.e. everything using update!, like all my flux code) is going to be broken. Of minor importance, custom optimizers overloading apply! will fail.
Do we have a deprecation path?
Maybe we should temporarily add a state field to each optimizer here so that we can repurpose update! in Flux while deprecating it at the same time.

@DhairyaLGandhi
Copy link
Member

All good concerns, thanks!

I think the naming is not the end of the world, but I would definitely prefer a better name. I dont think we want to add a state field though, this is a different design and we shouldn't be too worried about the breakages that can be managed elsewhere (see the corresponding PR to Flux where I have updated the commented out update!)

We should definitely come up with a smoother breakage path though, although I'm not sure we can get around everything.

@darsnack
Copy link
Member Author

darsnack commented Feb 3, 2021

AFAICT, even with the commented update! in FluxML/Flux.jl#1481, there is still a breaking change since the explicit state must be passed into update!.

I don't think there is really any clean way around it. The only way to avoid passing around the state would be to store it in some structure. The only structures available to update! are the optimizer, Params, and Grads. Params/Grads are clearly poor candidates leaving only the optimizer like @CarloLucibello suggested.

We could do something like

struct OptAndState{O, S}
  opt::O
  state::S
end

Then use @forward to push things onto the optimizer. All this code would be in Flux.jl not Optimisers.jl, so we can save the discussion for FluxML/Flux.jl#1481. Can we agree that the soft-breakage code doesn't belong in this repo?

@ToucheSir
Copy link
Member

Will update! and Optimiser get deprecation warnings as part of this bringup?

@CarloLucibello
Copy link
Member

@darsnack this is a basic script that I'd like to see still working after the change

using Flux
using Flux.Optimise

opt = ADAM(0.1)
x = rand(2)

for t=1:10
    g = gradient(x -> sum(x.^2), x)
    update!(opt, x, g[1])
    opt.eta /= 2
end

If we keep this PR as it is, in order to provide a deprecation path in Flux we would have to create a parallel family of optimizers, ADAM, Nesterov, etc... where

struct ADAM
   opt::Optimisers.ADAM
   state   
end

then @forward a few methods and also overload setfield and getfield. This could be done, but wouldn't be more practical to just add a state to the optimizers here?

@DhairyaLGandhi
Copy link
Member

In that case I'd rather just have a breaking change.

@darsnack
Copy link
Member Author

darsnack commented Feb 4, 2021

Okay renamed to ChainOptimiser from SequenceOptimiser as was agreed on the last ML community call.

I disagree with a breaking change. This is such a fundamental piece of the framework, that we shouldn't play fast and loose with people's code.

@CarloLucibello I think we could have deprecation path code in this repo in a flux.jl file, but I still feel that the code belongs in Flux.jl itself. Materially, I don't think it matters too much, since Flux.jl will probably be the only downstream user of Optimiser.jl. But I don't think any users of Optimisers.jl should be allowed to use implicit state. It was never a part of the package, so I think we shouldn't introduce it only to deprecate it. The only users who can expect implicit state are Flux.jl users, so that's where I think the deprecation path belongs.

If there is a concern about code complexity, then I think something like the following should resolve it:

for $Opt in (:Descent, :ADAM) # we would have all in the real code
  @eval begin
    mutable struct $(Opt){T}
      opt::Optimisers.$Opt
      state::T
    end

    Base.getproperty(o::$Opt, p) = getproperty(o.opt, p)
    Base.setproperty!(o::$Opt, p, v) = setproperty!(o.opt, p, v)

    function apply!(o::$Opt, x, dx)
      dx, s = apply(o.opt, x, dx, o.state)
      o.state = s
      
      return dx
    end
  end
end

@CarloLucibello
Copy link
Member

ok, we'll cook something like that in flux, handling init as well.

Two more things (feel free to ignore both)

  • the chain in ChainOptimizer feels like a verb, maybe better OptimizersChain or OptChain?
  • can anyone do or point me to a very short summarization of the advantages of the external state approach compared to the IdDict one? One disadvantage I see is that having to init the state and pass it around is slightly annoying

@DhairyaLGandhi
Copy link
Member

I don't think we would play fast and lose but instead deprecate the current ones as a whole, pointing to the new explicit state approach, and handle both of them inside Flux for a technically breaking change. That way we can preserve custom loops for one more release cycle.

@darsnack
Copy link
Member Author

darsnack commented Feb 4, 2021

What about ChainedOptimiser? Happy to do OptimiserChain too.

There are probably more reasons than I list here for explicit state, so @ToucheSir and @DhairyaLGandhi can chime in:

With IdDicts (or any form of implicit state), the state is an implementation detail instead of part of the interface. This is an issue for people trying to extend the interface, since using state is a requirement. For example, if you are writing a scheduler, you might do:

struct Scheduler{S, O}
  opt::O
  schedule::S
  iter::Int
end

function apply!(o, x, dx)
  o.opt.eta = o.schedule[o.iter]
  o.iter += 1
  return apply!(o.opt, x, dx)
end

This seems like a perfectly reasonable and logical thing to write, but it won't work unless o is optimizing exactly one array x. This is because the state which is associated with each x that o is applied to is not accounted for anywhere. This can be rectified by replacing iter with an IdDict and doing o.iter[x], but it wouldn't be obvious to anyone extending the interface at first glance.

So, state is a requirement for almost every optimizer. Aside from a complicated paragraph in the docs, the best way to make this clear to users is by making state part of the interface if it is a requirement and not an implementation detail. It also allows anyone writing an optimizer to write the natural code (w.r.t. a single x), then we manage passing around the state for complicated data structures.

I think largely, the benefits are that this is a clearer, more extensible design. It should make for safer code with easier to detect errors. There's also the XLA-side of this where explicit state and immutability can allow compiler optimizations, but those are further down the line.

@darsnack
Copy link
Member Author

darsnack commented Feb 4, 2021

It also means that the code for an optimizer is stable. When we want to special case how we handle the multiple states associated with a data structure, it only needs to be done once, and none of the optimizer rules need to change their code.

@darsnack
Copy link
Member Author

darsnack commented Feb 5, 2021

I think everything should be resolved in the latest commit. I was convinced that OptimiserChain was the best name.

Docstrings are in place, but we should make setting up the docs a separate PR. Tests are completely ported from Flux.jl.

@CarloLucibello
Copy link
Member

CarloLucibello commented Feb 7, 2021

@DhairyaLGandhi can I have write access to this repo? (actually, if it could be made org-wide it would be much more convenient)

@darsnack
Copy link
Member Author

darsnack commented Feb 8, 2021

Bump can we get the write-access for this repo to match the org? I think this can be merged.

@DhairyaLGandhi DhairyaLGandhi merged commit 638ace1 into FluxML:master Feb 8, 2021
@DhairyaLGandhi
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants