Functional AD #86

MikeInnes · 2017-10-17T21:34:32Z

The eventual plan is to build a new compiler-level AD that better exploits Julia's compilation, provides a more function interface, and supports nested differentiation. A question here is how to support the grad(f, x) style interface while also still allowing abstraction and modularity in layers and their weights.

I see this looking something like:

W = randn(5,5)
b = randn(5)
loss(x, y) = mse(W*x .+ b, y)
dW, db = grad(loss, (W, b), x, y)

W and b are treated as implicit arguments to the function; this is nice in that it's essentially the ideal functional interface but without the mess of hundreds of explicit arguments.

Models will implement params, as they do now, and whatever arrays they return will be treated as trainable parameters (dparams = grad(model, params(model), args...)). We'll also have a Freeze layer to treat things as constant, e.g. m = Freeze(Dense(10, 5)); params(m) == []. Freezing parameters is a little more coarse-grained compared to now, but that's small loss compared to the gains.

The text was updated successfully, but these errors were encountered:

dfdx · 2017-10-18T13:30:57Z

W and b are treated as implicit arguments to the function; this is nice in that it's essentially the ideal functional interface but without the mess of hundreds of explicit arguments.

Does this mean that a user will need to make W and b global and loss to be defined in the same context? This doesn't sound very flexible, to be honest. Recently I've been playing around (e.g. in VariationalAE.jl) with models as mutable structs. For your case it would look something like:

m = Linear(W, b)
loss(m::Linear, x, y) = mse(m.W * x .+ m.b, y)
dm = grad(loss, m, x, y)

where dm is another instance of Linear holding derivatives, i.e.:

dW = dm.W
db = dm.b

MikeInnes · 2017-10-18T13:51:59Z

Does this mean that a user will need to make W and b global and loss to be defined in the same context?

Er, no? For the most part I'm not expecting any thing else to look different; so the MNIST example would stay exactly the same. It's really no different to the current TrackedArray approach in that sense, just without the hacky overloading.

Your API is something we discussed as it's closer to what Knet currently has. At a minimum it only scales up well if you allow the structure to define the forward pass (e.g. via call overloading). Even then it imposes a bigger burden on user-defined types and small models, and it's harder to figure out how it plays when you get to really complex models (as one example, higher-order models that take another model as input).

dfdx · 2017-10-18T14:27:48Z

Ah, I re-read the MNIST example. Do I understand correctly that loss() is actually a closure bound to an instance of object m::Chain? In this case my comment is indeed irrelevant.

MikeInnes · 2017-10-18T14:38:03Z

Essentially yes, it's not actually a closure in this case because it's global, but it could be. In the docs there are some examples of closing over parameters, and I expect those to work with Cassette as well.

baggepinnen · 2017-10-31T07:05:38Z

How would the user go about taking the gradient of a model output with respect to a non-parmameter like the input? This is common in creating adverserial examples, linearizing dynamical models etc.

MikeInnes · 2017-10-31T15:56:55Z

I guess that would have to be grad(loss, (W, b, x), x, y). Although that really makes it clear that loss should just be a zero-arg function, like grad(() -> loss(x, y), [W, b, x]).

MikeInnes · 2019-02-22T15:57:00Z

A year on we can do some much cooler things here. Closing in favour of #628.

MikeInnes mentioned this issue Oct 17, 2017

Some confusion about the meaning of param #81

Closed

This was referenced Apr 14, 2018

Optimisers make me sad #234

Closed

Flux.testmode!() should disable gradient tracking #232

Closed

MikeInnes changed the title ~~Cassette-based AD~~ Functional AD Apr 16, 2018

MikeInnes mentioned this issue May 7, 2018

Automatic gradient detaching #244

Closed

MikeInnes mentioned this issue Jun 4, 2018

Segfault/Stackoverflow when training a LSTM unit #280

Closed

MikeInnes mentioned this issue Jul 26, 2018

AD Overhaul #313

Merged

2 tasks

MikeInnes closed this as completed Feb 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functional AD #86

Functional AD #86

MikeInnes commented Oct 17, 2017 •

edited

Loading

dfdx commented Oct 18, 2017

MikeInnes commented Oct 18, 2017

dfdx commented Oct 18, 2017

MikeInnes commented Oct 18, 2017

baggepinnen commented Oct 31, 2017

MikeInnes commented Oct 31, 2017

MikeInnes commented Feb 22, 2019

Functional AD #86

Functional AD #86

Comments

MikeInnes commented Oct 17, 2017 • edited Loading

dfdx commented Oct 18, 2017

MikeInnes commented Oct 18, 2017

dfdx commented Oct 18, 2017

MikeInnes commented Oct 18, 2017

baggepinnen commented Oct 31, 2017

MikeInnes commented Oct 31, 2017

MikeInnes commented Feb 22, 2019

MikeInnes commented Oct 17, 2017 •

edited

Loading