-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AbstractArray or Composite? #85
Comments
Both are acceptable. Though we do run into the expression problem (#53), since the code most likely expects the thing to be |
As pointed out by @MikeInnes here and on slack, allowing for a choice of representation of the differential of a x = Fill(1.0, (2, 2))
y = Composite{typeof(x)}(; value=2.0)
x + y Whether the behaviour of sum(x) + y or Fill(getindex_value(x) + y.value) depends on whether you're treating I think there are a couple of solutions here:
I could be convinced of either approach I think, but I instinctively prefer the latter as the former feels like a patch -- better to disambiguate between fundamentally different objects than to make |
I'm really sceptical that a sensible design can be had in which we have both composite and abstractarray gradients. Will's concern in FluxML/Zygote.jl#445 was that using a FillArray as the gradient was hacky (I'm taking the core issue to be that gradients are inconsistent with the equivalent dense array). But what I've tried to illustrate is that this is inevitable with any version of this, because if you follow through on If on the other hand that tradeoff is unnacceptable, you have to use dense (or perhaps one-hot) arrays, which again leads to throwing out This will be true for any array type. If you mix composite and array gradients you're going to get a somewhat arbitrary mix of different kinds derivatives back out. |
Also, while I appreciate that this is meant to be more general, exactly the same O(N^2) behaviour affects regular arrays. It might be that the true solution is simply to fix this and not treat Fills as special at all (i.e. making arrays gradients mutable and updating them in place, which fixes the complexity). |
The point here is more that in the dense case this is a necessary consequence of the immutability of differentials (assuming you've not got some fancy compiler optimisation going on). Conversely, this really shouldn't be the case for
I think I probably agree. That said, we can definitely have
I'm sorry but I'm still not the inevitability of this. Assuming that you do indeed always represent the differential of a |
It depends on what the exact proposal is; the discussion on the Zygote issue indicated a mixing of array and composite adjoints (i.e. using composites for efficiency where possible but keeping array semantics broadly), so my comments are mainly aimed at attempts to preserve current semantics with better efficiency. I'm not certain but it seems like this issue is more about throwing out array semantics entirely, and defining the derivative only wrt the fill value, for which the earlier examples indeed don't apply. This case is a bit different because gradients of FillArrays themselves are now unambiguous, but there are still ways in which their ambiguity can leak out. Contrived example (since you mentioned upstream rrules in the original issue): @adjoint function (a * b)
y = a * b
y = reduce(==, y) ? Fill(mean(y), size(y)) : y
y, c̄ -> (c̄'b, a'c̄)
end
gradient(x -> (W*x)[1], x) If Even if it were easy to buy |
Firstly, apologies for the slow response.
I think this is what this issue is more about now -- your comments on the ambiguities inherent in mixing composite and array types for differentials seem pretty compelling.
I don't entirely follow - specifically it's unclear to me why its valid to choose any array with the same sum if you receive a composite. On the topic of using a |
Assuming you buy the We should be clear, though, that the So the fact that |
Okay, I think I follow. I think my view on this is that we should never need an element-wise gradient for a Taking your example, the rule-writer has the information required to know unambiguously what the gradient should be by virtue of the fact that |
The core point is really that switching to More broadly, I'm not imagining that people will actually write code like this, but |
Can we close this now that we have ProjectTo going on? |
Background
This issue does stuff with
FillArrays
, and got be thinking about a problem we've known about for a while but not successfully resolved.Say that you have a
Fill
x
of lengthN
and you callgetindex
on it:x[1]
How should the differential
dx
w.r.tx
associated with this operation be represented? Some options:Composite{Fill{eltype(x)}}
.Option 1 is bad because it's obviously
O(N)
Option 2 appears to be performant, but only until you consider adding it to another differential. Say that you also got
x[2]
at some point in your programme. At some point you'll have to accumulate the differential forx[1]
andx[2]
, whose sum is clearly not also a 1-hot array. Consequently, we've lost performance, and essentially regressed toO(N)
if you start adding lots of 1-hot arrays together.Option 3 is great, because when adding two
Composite{Fill{eltype(x)}}
you get another one inO(1)
-time and memory.Upshot
As presented above, what needs to happen for
Fill
s is quite obvious imho. What is less obvious is how this generalises. There's been a discussion going on for a while about the correct way to represent the differential of anAbstractArray
.It seems that the correct thing to do with an
Array
is represent it as anotherArray
, and this seems to be because anArray
is already completely general in that its elements can take any value you like.What is less clear it what to do for structured arrays. Take, for example,
Diagonal
. Should we allow its adjoint to be represented by any otherAbstractArray
, or should be we using aComposite
? My feeling is that this observation regardingFill
s provides evidence in favour of the latter, and that we need to figure out ways to handle resulting problems that arise (e.g. say that aComposite
representing aDiagonal
was propagated into therrule
for matrix-matrix multiply. We would need to know what to do with it)The text was updated successfully, but these errors were encountered: