-
-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjoints for regularizers? #1575
Comments
Thanks for the suggestion! I think there are a couple alternatives worth considering first. The simplest (preferred?) mechanism is to just apply an L1 regularizer as part of your objective function for training. If you really did want to do something like the above, then you could perhaps consider a custom "L1 array" wrapper type. I haven't really thought about the exact mechanisms to do this, or how complicated it would become. But it would me more flexible than changing So, overall, I think there are several alternatives here. Hopefully, one of them addresses your concern. |
L1 array seems to be a finicky road to go down. It doesn't cleanly address the issue of regularizers being well supported and means it can cause issues with wrapper types breaking assumptions for different array types. Adding some regularizers functions with adjoints on the other hand would be a nicer way to handle this. I wouldn't think many would object to having it as part of the function call to be differentiated. |
This is what I meant by my first suggestion. Though I don't think you need to even write the adjoints (at least for simple regularizers like L1 or L2). |
The reason why I run from the idea of just adding penalties in an objective function is it makes your objective incredibly complicated (if you need to have a loss with 10 layers each with weird mixed and matched penalties your loss function is now 11 lines long in the best case). Now imagine more fancy loss functions, it get's messy quick. Also L1, requires gradient level information to do properly (projected gradient descent). Wereas seeing an L1Dense in a chain, or in the params, it's pleasant to unpack. I personally like the idea of hooked dense for a number of reasons. In some cases reversing the gradient is handy to do, or regularize the gradient by a Lipschitz constant, clipping, or as shown the example above L1 penalty. But yea there still are different roads to go down. Either a hooked layer, or some other more explicit parameterization on how the gradient is handled. |
I'll probably cook up a PR sometime soon, if you all don't accept it that's fine :), it'll make my code cleaner and maybe there'll be a nugget or two that's useful. |
The reason that I am pushing back against hooked dense is two-fold. First, it muddles the forward pass (i.e. Putting aside gradient hooks for regularizers, the notion of gradient hooks is a byproduct of PyTorch's approach to AD. In our case,
With this in mind, perhaps a collection of common gradient modifiers like the one above for L1 would be a good addition to Flux.
In the worst case where you are manually specifying a unique regularizer for every layer, then I could see how this is easier on the eyes. But if you are trying to apply a certain regularizer to each Of course, that only addresses the case of writing the penalty function in a simpler manner. If a gradient hook is clearly necessary/better, then I guess we don't have a great mechanism for reasoning about the different pieces of what If we still want to take the "hook" approach, a wrapped array type seems to me to be the only way to make "hooked dense" and friends scale to arbitrary layers. But I agree with @DhairyaLGandhi that this is a fragile route, and it would require lots of testing and a compelling "win" over other methods to be merged. That being said, if you were to implement an experimental PR, I think it is always cool to explore these possibilities. |
Always welcome 😄 ! Having some solid code to compare against will only improve the discussion. I'd also be curious to compare the verbosity in the "many different penalties" corner case. |
Understood, and yea it's not fair to expect people to put regularizers on each layer of a NN that would be unusual. I'm in an unusual territory right now. Maybe what I'll do is take a stab at forking master and see if anything generic falls out. If not... well, I'll still put a tentative PR. If it doesn't stick that's okay - its something I gotta do anyways might as well make it "nice" |
Yeah and just to be clear, even if something isn't generic enough for Flux.jl, it can still be super useful in its own package! There's nothing stopping us from adding a forward pointer to such a package in the docs too. I can see how a package that provides Looking forward to the PR! |
@darsnack |
That's right, but once you add the ability to hook, it can be used for anything. So we'd want to hook every parameter not just a subset that makes sense for regularization. This is looking at it as something we add to Flux where we try to be as generic as possible. |
tldr: think you all are right and this belongs in a sugar package. read on anyways: I did a little thinking and maybe the "right" idea is something along the lines of: There are some other layers and aliases I'd like to add to Flux as well, but they also aren't generic, and definitely are "user preference". So factoring things out into a |
Actually an |
The issue with Lemme tinker a bit then I'll show you all what I'm thinking. I'll do it from a repo unattached to Flux just because I don't see the exact mechanism for doing this yet. It will probably take a few hacks. The Alias's I'm mostly interested in pertain to training, but lemme show you what I'm thinking with effort instead of babbling. If any of it sounds good for flux we can copypasta it in, if not it can be it's own little thingy. Basically I'm working on a paper and I want the code for it to represent what I can do now, and be highly readable, when I actually try (most my hobby repo's are subpar). |
Yeah |
So my first attempt at this is a little... bad. I mean it is what it is. I made a type to carry gradient hooks on bias and weight items for dense layers. Super naive, but I wanted to get a handle on where this might go. This absolutely doesn't belong in Flux as it currently stands... But, I'll keep poking around with some ideas. Code is here: |
Sounds like I might be misunderstanding how adjoints work? I figured they let you inline transform a gradient and once that was done it was business as usual once the next operation went along. is what I've done here a mistake: https://github.com/caseykneale/DeltaSugar.jl/blob/bf492abddbf3d3e484d5c0a635762375a3bff05f/src/GradientModifiers.jl#L1 |
When I said that, I was specifically talking about the l1hook(Δ, x, λ) = ( Δ .+ ( λ .* sign.( Δ ) ) ) .* ( abs.( x ) .> λ )
struct L1{S, T}
lambda::S
layer::T
end
@functor L1
(l::L1)(x) = l.layer(x)
@adjoint function (l::L1)(x)
y, back = Zygote.pullback(l.layer, x)
return y, Δ -> l1hook(back(Δ), x, l.lambda)
end I think approach could be generic enough to be in Flux directly. |
This is one of those... "why didn't I think of that?" moments, yea that looks incredibly reasonable to me. |
Well wait I think this would try to regularize the bias terms, which may/may not be expected behaviour... I personally, think I am fine with that, but it might upset some purists or something. |
Yeah true. It would make sense to add a |
We might want to make it so that lambda could potentially contain learnable params as well. Currently it wouldn't learn. This would require differentiating through the "hook" as well. |
Wouldn't that be nested differentiation? Asking cause would Zygote handle that well? Keno's updates on the bi-weekly calls makes me think it wouldn't. |
Other thought is that our "optimizers" are more generic too. They are composable gradient transformations. For example, the clamp hook is already implemented as an "optimizer." So, the most generic version of this idea would be wrapping any gradient transformation to a specific layer's parameters. This is related to an idea of a "group optimizer" that's come up before. |
Ah the dreaded learning parameters snafu... Most people don't do that, but the way things are trending now in the literature, it could be a "thing"... It would be a nested optimization. The gradient flow can't affect that parameter sanely unless some other thing is operating on the topology of the network. If you view it from the case of LASSO regression it becomes more apparent why that is the case. Technically... if you made lambda a 1-vector you could still operate on it though as Kyle has written it? Just the way you optimize it is different - right. |
It's the other way around: the objective is a function of But this is all fine. I think Dhairya is just saying instead of struct L1Hook{T}
lambda::T
end
@functor L1Hook
function Optimisers.apply!(o::L1Hook, x, dx)
@. dx = (dx + (o.lambda * sign(dx))) * (abs(x) > o.lambda)
return dx
end
struct GradientHook{S, T}
hook::S
layer::T
end
@functor L1
(l::GradientHook)(x) = l.layer(x)
@adjoint function (l::GradientHook)(x)
y, back = Zygote.pullback(l.layer, x)
return y, Δ -> apply!(l.hook, x, back(Δ))
end Here you should be able to do |
Ah yea, I'm sorry I misunderstood thanks for explaining - now I see what they are saying. That kind of how I did it too |
At least the first part of my snippet ( The second part ( |
Yea that's fine, I can make a commit(to my sandbox) when I wrap up this other chunk of code I'm working on. It might not be until Monday - maybe I'll get lucky and it'll be sooner. |
I took a new job not sure if I can continue working on this. My assumption is my NDA says "nope". |
No worries! |
|
Not sure if this is the right way to go about it so I'd like to ask what you all think... Would it make sense to make some adjoints for regularizers, and or attach them to specific layers?
ie:
note: this isn't a perfect lasso representation - it's missing a term based on the optimizer
learning rate
I think, but its a quick demonstrative hack and works if one is mindful of the magnitude of their independent variables.For L2 it's not as big of a deal, but maybe there are other cases where baking this type of capability into some layers is worth while? Open to feedback and willing to make a PR if its deemed a reasonable suggestion
The text was updated successfully, but these errors were encountered: