Replies: 1 comment 1 reply
-
@aseyboldt, you've just repeated all of the opinions and points I've already responded to in #1089. If you are confused about the responses you got in #1089 or disagree with them, then use a Discussion to ask specific questions about the content of those responses. In general, we are not going to waste time duplicating conversations throughout this repository. You can attempt to reformulate this Discussion into something constructive—e.g. in the way I mentioned above—or make an honest attempt to characterize the entire discussion in #1089, and not just your side of it; otherwise, it will be deleted. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Broadcasting in aesara
I'm doing my best here to try and summarize what's going on in #1089 from my
point of view, and trying to figure out what broadcasting semantics we should
use.
How theano handled broadcasting
I think there has been some confusion about how exactly broadcasting used to
work in theano, not least of all because this wasn't all that well documented.
I'll explain my understanding of this here.
Theano broadcasting was designed so that in a given graph it is always
possible to know statically where broadcasting is happening and what is being
broadcasted. This was explicitly stated in the
docs
I'll call this property of a graph (independent on how it is achieved)
"static broadcasting". If we do not know from the graph alone where
broadcasting is happening, I'll call that "dynamic broadcasting".
So how did theano achieve static broadcasting? It deviated from numpy
semantics, and introduced as part of the type of a TensorVariable an attribute
broadcastable
. This specifies for each axis of the Variable if broadcastingis allowed to occur in that axis.
If the flag is
True
for an axis, this means that the only valid length ofthat axis was 1, and that this axis could be broadcasted in the graph if
necessary.
If the flag is
False
, the length of that axis could be arbitrary (including1), but theano would never broadcast that axis.
This means however that theano did not always have the same behavior as NumPy.
If we compute the sum of two TensorVariables that are marked as non-broadcastable,
theano would ensure that they are never broadcasted during graph evaluation.
In cases where NumPy would in fact broadcast, it would raise an exception
saying that their shapes are not compatible:
There was an explicit check in the elemwise
code
to make sure that the elemwise op does not in fact broadcast those two arrays,
so that the rewrites and gradients can rely on the fact that no unexpected
broadcasting is happening.
This was reported as a bug
here, and the check
then removed here,
but I think originally this was a conscious design decision by the theano
developers, not a bug in the first place. (And a bad error message as well, but
that's a different topic...)
I think at that time the reasons for that check and the implications of their
removal were not commonly understood and not discussed at all. (Not trying to
blame anyone here, this stuff happens)
Why anyone might want static broadcasting
So the theano developers clearly understood that they are deviating from NumPy,
why would they (or anyone) have wanted static broadcasting in the first place?
We stumbled into one reason for this soon after we removed the broadcasting
check from elemwise: Gradients depend on whether broadcasting happened or not.
Let's say we have
Then dz will be
ones_like(x)
ifx
was not broadcasted. But if it was,each value of
x
is used inz
several times, one time for each value iny
,so the gradient should be
ones_like(x).sum(keepdims=True)
.The associated issue
contains long discussions about different options of how to represent this
additional complexity in the graph. There clearly are ways to do this,
and I don't want to go through all the options here right now, but I think (?)
everyone agrees that they introduce significant additional complexity into the
graph and make rewrites more difficult.
Just to illustrate that additional complexit, here is the graph of computing
the gradient of the sum of 10 matrices with the current work-in-progress
solution:
There is also I think however a deeper performance problem that is introduced
by not knowing the broadcasting pattern statically. This happens both in the
derivative code and in the forward code, I'll focus here on the forward code
however, because I think there it is easier to explain.
Let's use as an example a simple elemwise operation
exp(x * y)
, wherex
and
y
are vectors. If we are using static broadcasting we know exactly ifthere is any broadcasting when we compile the function, so we can generate
code under that assumption. In a dynamic broadcasting environment we need
to produce machine code that is capable of dealing with the broadcasting,
and the non-broadcasting case. Let's see how my best attempts for generic
and non-generic code compare (I'm using numba here, but this is really about
the code that llvm is able to produce here, if we did this in C I think the
results would look very similar.):
So here we have specialized implementations for
elemwise(exp(x * y))
where weassume we already know if we are broadcasting, and we also have two different
generic versions that compute the same thing, but work regardless of the
broadcasting that's actually happening. We also have a third generic
implementation (
foo_dispatch
) that just calls the specialized versions in thedifferent broadcasting scenarios.
So how does their performance compare?
The two truly generic versions
foo_generic
andfoo_generic2
are alwaysslower by a factor of ~3.5 for at least one of the scenarios.
Only$2^n - 1$ different cases. $n=10$ would I think
foo_dispatch
matches the performance of the static implementations. Sobased on this,
foo_dispatch
looks pretty nice, it matches the performance ofboth the specific cases. We do run into issues however if the number of inputs
grows: For two inputs there are 3 different options: x is broadcasted, y is
broadcasted and nothing is broadcasted. But if we have
n
vector inputs for amore general elemwise, we have
not be all that unusual for elemwise ops in practice, but we just can't compile
1023 loops for each of those cases.
So to summarize: If we know where broadcasting happens statically, we can produce
much simpler gradient graphs, and we can generate faster code.
My personal (tentative) conclusion
I think the downsides to continuing to use something similar to theano
broadcasting are mainly in three areas:
theano/aesara and NumPy. I think better error messages and better
documentation for those differences should help a lot with this though.
broadcasting is happening.
base.
I tend to think however that the advantages outway the problems:
cc @ricardoV94 @brandonwillard @rlouf
Beta Was this translation helpful? Give feedback.
All reactions