Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjoints for regularizers? #1575

Closed
caseykneale opened this issue Apr 13, 2021 · 32 comments
Closed

Adjoints for regularizers? #1575

caseykneale opened this issue Apr 13, 2021 · 32 comments

Comments

@caseykneale
Copy link

caseykneale commented Apr 13, 2021

Not sure if this is the right way to go about it so I'd like to ask what you all think... Would it make sense to make some adjoints for regularizers, and or attach them to specific layers?
ie:

L1(Δ, x, λ) = ( Δ .+ ( λ .* sign.( Δ ) ) ) .* ( abs.( x ) .> λ )
L1hook( x, λ ) = x
@adjoint L1hook( x, λ ) = x, Δ -> (L1( Δ, x, λ ),nothing)

note: this isn't a perfect lasso representation - it's missing a term based on the optimizer learning rate I think, but its a quick demonstrative hack and works if one is mindful of the magnitude of their independent variables.

L1Dense(...)
    σ.(L1hook(W, λ)*x .+ b)
end

For L2 it's not as big of a deal, but maybe there are other cases where baking this type of capability into some layers is worth while? Open to feedback and willing to make a PR if its deemed a reasonable suggestion

@darsnack
Copy link
Member

darsnack commented Apr 13, 2021

Thanks for the suggestion! I think there are a couple alternatives worth considering first.

The simplest (preferred?) mechanism is to just apply an L1 regularizer as part of your objective function for training. Of course, if there is a desire to achieve the same result using gradient hooks and avoiding the forward pass of the regularizer term, then something like Zygote.jl's hook is better than adding manual hooks to each layer. Just realized hook won't help with implicit parameters.

If you really did want to do something like the above, then you could perhaps consider a custom "L1 array" wrapper type. I haven't really thought about the exact mechanisms to do this, or how complicated it would become. But it would me more flexible than changing Dense or adding L1Dense (or even a generic HookedDense).

So, overall, I think there are several alternatives here. Hopefully, one of them addresses your concern.

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Apr 13, 2021

L1 array seems to be a finicky road to go down. It doesn't cleanly address the issue of regularizers being well supported and means it can cause issues with wrapper types breaking assumptions for different array types.

Adding some regularizers functions with adjoints on the other hand would be a nicer way to handle this. I wouldn't think many would object to having it as part of the function call to be differentiated.

@darsnack
Copy link
Member

Adding some regularizers functions with adjoints on the other hand would be a nicer way to handle this. I wouldn't think many would object to having it as part of the function call to be differentiated.

This is what I meant by my first suggestion. Though I don't think you need to even write the adjoints (at least for simple regularizers like L1 or L2).

@caseykneale
Copy link
Author

caseykneale commented Apr 13, 2021

The reason why I run from the idea of just adding penalties in an objective function is it makes your objective incredibly complicated (if you need to have a loss with 10 layers each with weird mixed and matched penalties your loss function is now 11 lines long in the best case). Now imagine more fancy loss functions, it get's messy quick. Also L1, requires gradient level information to do properly (projected gradient descent). Wereas seeing an L1Dense in a chain, or in the params, it's pleasant to unpack.

I personally like the idea of hooked dense for a number of reasons. In some cases reversing the gradient is handy to do, or regularize the gradient by a Lipschitz constant, clipping, or as shown the example above L1 penalty.

But yea there still are different roads to go down. Either a hooked layer, or some other more explicit parameterization on how the gradient is handled.

@caseykneale
Copy link
Author

I'll probably cook up a PR sometime soon, if you all don't accept it that's fine :), it'll make my code cleaner and maybe there'll be a nugget or two that's useful.

@darsnack
Copy link
Member

darsnack commented Apr 13, 2021

The reason that I am pushing back against hooked dense is two-fold. First, it muddles the forward pass (i.e. (l::Dense)(x) = l.σ(l.hookW(l.W) * x + l.hookb(l.b))). One of the nice things about Flux+Zygote is that concepts that should be semantically separate are separate. Second, it is not generic enough to handle arbitrary layers and functions. The mechanism for L1 regularization should scale to any parameter we take a gradient with respect to. Hooked versions of each layer require manual intervention to maintain, and a user can't easily buy into the functionality when writing a custom layer. And if I load a random model from disk, I need to traverse the model and replace each Dense with HookedDense just to apply some regularization. Fundamentally, this is because hooked dense ties the model architecture to the optimization process. That's something to avoid if possible.

Putting aside gradient hooks for regularizers, the notion of gradient hooks is a byproduct of PyTorch's approach to AD. In our case, gradient(f, x) directly returns the gradients. With this more functional approach, you don't need hooks to modify them, you can just directly modify them before passing them to update!(opt, w, Δw). Using hookW(W) with an adjoint for hookW is an opaque way of implementing the process:

  1. gs = gradient(f, x)
  2. gs = apply_modifiers(gs)
  3. update!(opt, ps, gs)

With this in mind, perhaps a collection of common gradient modifiers like the one above for L1 would be a good addition to Flux.


The reason why I run from the idea of just adding penalties in an objective function is it makes your objective incredibly complicated (if you need to have a loss with 10 layers each with weird mixed and matched penalties your loss function is now 11 lines long in the best case). Now imagine more fancy loss functions, it get's messy quick. Also L1, requires gradient level information to do properly (projected gradient descent). Wereas seeing an L1Dense in a chain, or in the params, it's pleasant to unpack.

In the worst case where you are manually specifying a unique regularizer for every layer, then I could see how this is easier on the eyes. But if you are trying to apply a certain regularizer to each Dense and a different one to each Conv, then you can use multiple dispatch to make this easy. Something like modules might make this even easier.

Of course, that only addresses the case of writing the penalty function in a simpler manner. If a gradient hook is clearly necessary/better, then I guess we don't have a great mechanism for reasoning about the different pieces of what gradient returns. The correct way to handle this would be think more carefully about how we represent implicit parameters and their gradients as a structure.


If we still want to take the "hook" approach, a wrapped array type seems to me to be the only way to make "hooked dense" and friends scale to arbitrary layers. But I agree with @DhairyaLGandhi that this is a fragile route, and it would require lots of testing and a compelling "win" over other methods to be merged. That being said, if you were to implement an experimental PR, I think it is always cool to explore these possibilities.

@darsnack
Copy link
Member

I'll probably cook up a PR sometime soon, if you all don't accept it that's fine :), it'll make my code cleaner and maybe there'll be a nugget or two that's useful.

Always welcome 😄 ! Having some solid code to compare against will only improve the discussion. I'd also be curious to compare the verbosity in the "many different penalties" corner case.

@caseykneale
Copy link
Author

Understood, and yea it's not fair to expect people to put regularizers on each layer of a NN that would be unusual. I'm in an unusual territory right now. Maybe what I'll do is take a stab at forking master and see if anything generic falls out. If not... well, I'll still put a tentative PR. If it doesn't stick that's okay - its something I gotta do anyways might as well make it "nice"

@darsnack
Copy link
Member

darsnack commented Apr 13, 2021

Yeah and just to be clear, even if something isn't generic enough for Flux.jl, it can still be super useful in its own package! There's nothing stopping us from adding a forward pointer to such a package in the docs too. I can see how a package that provides RegularizedX or L1X versions for every X will be really useful to the ecosystem.

Looking forward to the PR!

@caseykneale
Copy link
Author

@darsnack
Just noticed this:
The reason that I am pushing back against hooked dense is two-fold. First, it muddles the forward pass (i.e. (l::Dense)(x) = l.σ(l.hookW(l.W) * x + l.hookb(l.b))).
you pretty much never regularize biases'. So
l.σ(l.hook(l.W) * x + l.b))
Is the worst case here.

@darsnack
Copy link
Member

That's right, but once you add the ability to hook, it can be used for anything. So we'd want to hook every parameter not just a subset that makes sense for regularization.

This is looking at it as something we add to Flux where we try to be as generic as possible.

@caseykneale
Copy link
Author

tldr: think you all are right and this belongs in a sugar package.

read on anyways:
I understand what you're saying from the perspective of writing software, but in some instances you just need something to stand on it's own I guess. IE: technically you can turn a dense layer into a conv. layer, it's just mat mul with a banded matrix after all, but that's a bad call :D. So I'm trying to understand a good dividing line. I've spent some time grocking the codebase and the design decisions all make sense. But this L1 thing is a bit of an outlier and I definitely respect why it's been avoided.

I did a little thinking and maybe the "right" idea is something along the lines of:
L1(Dense(...),param)
Some function that eats and modifies a dense/conv layer in a manner that's expected...Alternatively, Where L1, L2, RevGrad, etc, are part of a separate package. So Flux can maintain it's generic intent, but if someone really wants these rarer convenience functions they can import them and fire away. This will likely require an abstract Dense and Conv type to support if someone wants Dense(...) |> L1 |> RevGrad |> ..., so it composes nicely and isn't a 1 off. The utility will be limited by any major changes to the Flux API I guess...

There are some other layers and aliases I'd like to add to Flux as well, but they also aren't generic, and definitely are "user preference". So factoring things out into a Sugar package sounds like the right way to go. Think I'll close this issue out but I'll ping you on slack or something once the package is standing.

@DhairyaLGandhi
Copy link
Member

Actually an L1(Dense(), ...) sounds kind of interesting. In that case we aren't special casing too much I don't think. What kinds of aliases are you thinking of?

@caseykneale
Copy link
Author

caseykneale commented Apr 14, 2021

The issue with L1(...) is it'd need a param, either by closure, or otherwise(might incur cost/might not), in some sense it's a chain within a chain for gradient manipulations I guess.

Lemme tinker a bit then I'll show you all what I'm thinking. I'll do it from a repo unattached to Flux just because I don't see the exact mechanism for doing this yet. It will probably take a few hacks.

The Alias's I'm mostly interested in pertain to training, but lemme show you what I'm thinking with effort instead of babbling. If any of it sounds good for flux we can copypasta it in, if not it can be it's own little thingy. Basically I'm working on a paper and I want the code for it to represent what I can do now, and be highly readable, when I actually try (most my hobby repo's are subpar).

@caseykneale caseykneale reopened this Apr 14, 2021
@darsnack
Copy link
Member

Yeah L1(...) approach is nice enough that we can consider changes to Flux to support it too. Some kind of parameter sharing could work here. The only hiccup I see is that the "adjoint" for the hook needs to accumulate on top of the original adjoint (i.e. order matters)?

@caseykneale
Copy link
Author

So my first attempt at this is a little... bad. I mean it is what it is. I made a type to carry gradient hooks on bias and weight items for dense layers. Super naive, but I wanted to get a handle on where this might go. This absolutely doesn't belong in Flux as it currently stands... But, I'll keep poking around with some ideas.

Code is here:
https://github.com/caseykneale/DeltaSugar.jl

@caseykneale
Copy link
Author

caseykneale commented Apr 17, 2021

... The only hiccup I see is that the "adjoint" for the hook needs to accumulate on top of the original adjoint (i.e. order matters)?

Sounds like I might be misunderstanding how adjoints work? I figured they let you inline transform a gradient and once that was done it was business as usual once the next operation went along.

is what I've done here a mistake: https://github.com/caseykneale/DeltaSugar.jl/blob/bf492abddbf3d3e484d5c0a635762375a3bff05f/src/GradientModifiers.jl#L1
Effectively I am just taking some input gradient at some specific op in a layer and mutating it via a composition of functors.

@darsnack
Copy link
Member

When I said that, I was specifically talking about the L1(Dense(...), ...) syntax and the order of function application in the chain rule. But trying to answer your question made me think of a generic solution:

l1hook(Δ, x, λ) = ( Δ .+ ( λ .* sign.( Δ ) ) ) .* ( abs.( x ) .> λ )

struct L1{S, T}
    lambda::S
    layer::T
end

@functor L1

(l::L1)(x) = l.layer(x)

@adjoint function (l::L1)(x)
    y, back = Zygote.pullback(l.layer, x)

    return y, Δ -> l1hook(back(Δ), x, l.lambda)
end

I think approach could be generic enough to be in Flux directly.

@caseykneale
Copy link
Author

This is one of those... "why didn't I think of that?" moments, yea that looks incredibly reasonable to me.

@caseykneale
Copy link
Author

Well wait I think this would try to regularize the bias terms, which may/may not be expected behaviour... I personally, think I am fine with that, but it might upset some purists or something.

@darsnack
Copy link
Member

Yeah true. It would make sense to add a params field to L1 and only apply the hook to something in the field.

@DhairyaLGandhi
Copy link
Member

We might want to make it so that lambda could potentially contain learnable params as well. Currently it wouldn't learn. This would require differentiating through the "hook" as well.

@darsnack
Copy link
Member

darsnack commented Apr 17, 2021

Wouldn't that be nested differentiation? Asking cause would Zygote handle that well? Keno's updates on the bi-weekly calls makes me think it wouldn't.

@darsnack
Copy link
Member

Other thought is that our "optimizers" are more generic too. They are composable gradient transformations. For example, the clamp hook is already implemented as an "optimizer." So, the most generic version of this idea would be wrapping any gradient transformation to a specific layer's parameters.

This is related to an idea of a "group optimizer" that's come up before.

@caseykneale
Copy link
Author

Ah the dreaded learning parameters snafu... Most people don't do that, but the way things are trending now in the literature, it could be a "thing"... It would be a nested optimization. The gradient flow can't affect that parameter sanely unless some other thing is operating on the topology of the network. If you view it from the case of LASSO regression it becomes more apparent why that is the case.
IE:
|| ax - y||2 + \lambda |a|
\lambda is a function of what?

Technically... if you made lambda a 1-vector you could still operate on it though as Kyle has written it? Just the way you optimize it is different - right.

@darsnack
Copy link
Member

It's the other way around: the objective is a function of lambda. For example, you could do train!(m, opt, lambda) where train! trains m with opt using lambda as the regularization parameter. You're right, it would be nested optimization where opt is used in the inner optimization to train m. But the gradient(lambda -> train!(m, opt, lambda), ...) requires differentiate w.r.t. to lambda through all of train!. Since train! itself calls gradient, it is going to be nested AD.

But this is all fine. I think Dhairya is just saying instead of l1hook we should have a struct L1Hook that has been @functor L1Hook (lambda,). This is the right thing to do to make everything differentiable. Something like

struct L1Hook{T}
    lambda::T
end

@functor L1Hook

function Optimisers.apply!(o::L1Hook, x, dx)
    @. dx = (dx + (o.lambda * sign(dx))) * (abs(x) > o.lambda)

    return dx
end

struct GradientHook{S, T}
    hook::S
    layer::T
end

@functor L1

(l::GradientHook)(x) = l.layer(x)

@adjoint function (l::GradientHook)(x)
    y, back = Zygote.pullback(l.layer, x)

    return y, Δ -> apply!(l.hook, x, back(Δ))
end

Here you should be able to do params on the L1Hook and take gradients.

@caseykneale
Copy link
Author

Ah yea, I'm sorry I misunderstood thanks for explaining - now I see what they are saying. That kind of how I did it too
https://github.com/caseykneale/DeltaSugar.jl/blob/bf492abddbf3d3e484d5c0a635762375a3bff05f/src/GradientModifiers.jl#L15
But, I didn't take it that additional step. This looks pretty nice to me, I especially like how using @functor on a functor feels like the right design decision :D. If you all do this, I could make a PR for a few other similar things and trashbin most of the repo I made.

@darsnack
Copy link
Member

darsnack commented Apr 17, 2021

At least the first part of my snippet (L1Hook) is something we can accept a PR for right now. There'll probably be some back and forth on the review, but I think it is close enough to stuff we already have in the repo that a PR is mergeable. I think it would be a small port of what you have linked. Maybe we don't @functor them just yet since the other optimizers are not functor'd.

The second part (GradientHook) is something that would need some consideration. It's a decent generic solution, but my personal preference is not to mix the model architecture and optimization process. So it would require a little bit of design work on our end to figure out how to make it as syntactically efficient as the GradientHook(Dense(...), ...) approach. Of course, GradientHook is a small bit of code. The bulk is writing the gradient transformations in the first part. So, if we were to merge the first part, potentially you could copy-paste GradientHook into your code until we make a decision?

@caseykneale
Copy link
Author

Yea that's fine, I can make a commit(to my sandbox) when I wrap up this other chunk of code I'm working on. It might not be until Monday - maybe I'll get lucky and it'll be sooner.

@caseykneale
Copy link
Author

I took a new job not sure if I can continue working on this. My assumption is my NDA says "nope".

@darsnack
Copy link
Member

darsnack commented May 3, 2021

No worries!

@caseykneale
Copy link
Author

I'm just not convinced that using the layer convention for regularization is the API we want
Because it's difficult to generalize
Parameter level is a little easier, because that logic is generally useful```

Closing this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants