Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C Elemwise implementation doesn't broadcast variables #335

Closed
brandonwillard opened this issue Mar 13, 2021 · 6 comments · Fixed by #928
Closed

C Elemwise implementation doesn't broadcast variables #335

brandonwillard opened this issue Mar 13, 2021 · 6 comments · Fixed by #928
Labels
bug Something isn't working important

Comments

@brandonwillard
Copy link
Member

brandonwillard commented Mar 13, 2021

I'm seeing a very weird error in Elemwise:

First, here's a basic broadcasting operation in NumPy:

import numpy as np

x = np.array([[-1.32720483],
              [ 0.23442016]])

m = np.array([0., 0.])

z = x - m
>>> z
array([[-1.32720483, -1.32720483],
       [ 0.23442016,  0.23442016]])

In Aesara, here's the equivalent operation using TensorConstants:

import aesara
import aesara.tensor as at


x_at = at.as_tensor(x)
m_at = at.as_tensor(m)

z_at = x_at - m_at
>>> aesara.dprint(z_at)
Elemwise{sub,no_inplace} [id A] ''
 |TensorConstant{[[-1.32720..23442016]]} [id B]
 |InplaceDimShuffle{x,0} [id C] ''
   |TensorConstant{(2,) of 0.0} [id D]

The resulting graph is a simple Elemwise for the subtraction Op–as expected. There's also an InplaceDimShuffle that adds a broadcastable dimension to the second argument, so that both inputs have the same number of dimensions. This InplaceDimShuffle is equivalent to np.expand_dims(m, 0), which–when subtracted from x–yields the same value as z.

So far, everything is good, because

>>> np.array_equal(aesara.function([], z_at)(), z)
True

Now, when we replace the TensorConstants with generic TensorVariables, we get a strange error:

x_v = at.matrix("x")
m_v = at.vector("m")

z_v = x_v - m_v
>>> aesara.function([x_v, m_v], z_v)(x, m)
<ipython-input-23-66cc28afa70f> in <module>
----> 1 aesara.function([x_v, m_v], z_v)(x, m)

~/projects/code/python/Aesara/aesara/compile/function/types.py in __call__(self, *args, **kwargs)
    989                     node=self.fn.nodes[self.fn.position_of_error],
    990                     thunk=thunk,
--> 991                     storage_map=getattr(self.fn, "storage_map", None),
    992                 )
    993             else:

~/projects/code/python/Aesara/aesara/link/utils.py in raise_with_op(fgraph, node, thunk, exc_info, storage_map)
    506         # Some exception need extra parameter in inputs. So forget the
    507         # extra long error message in that case.
--> 508     raise exc_value.with_traceback(exc_trace)
    509
    510

~/projects/code/python/Aesara/aesara/compile/function/types.py in __call__(self, *args, **kwargs)
    973             outputs = (
    974                 self.fn()
--> 975                 if output_subset is None
    976                 else self.fn(output_subset=output_subset)
    977             )

ValueError: Input dimension mis-match. (input[0].shape[1] = 1, input[1].shape[1] = 2)
Apply node that caused the error: Elemwise{sub,no_inplace}(x, InplaceDimShuffle{x,0}.0)
Toposort index: 1
Inputs types: [TensorType(float64, matrix), TensorType(float64, row)]
Inputs shapes: [(2, 1), (1, 2)]
Inputs strides: [(8, 8), (16, 8)]
Inputs values: [array([[-1.32720483],
       [ 0.23442016]]), array([[0., 0.]])]
Outputs clients: [['output']]

We can emulate this issue using the Python implementation of Elemwise, as well:

>>> aesara.function([x_v, m_v], z_v, mode="FAST_COMPILE")(x, m)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/projects/code/python/Aesara/aesara/link/vm.py in __call__(self)
    312                 ):
--> 313                     thunk()
    314                     for old_s in old_storage:

~/projects/code/python/Aesara/aesara/graph/op.py in rval(p, i, o, n)
    472             def rval(p=p, i=node_input_storage, o=node_output_storage, n=node):
--> 473                 r = p(n, [x[0] for x in i], o)
    474                 for o in node.outputs:

~/projects/code/python/Aesara/aesara/tensor/elemwise.py in perform(self, node, inputs, output_storage)
    760                 base_exc_str = f"Dimension mismatch; shapes are {', '.join(msg)}"
--> 761                 raise ValueError(base_exc_str)
    762

ValueError: Dimension mismatch; shapes are (2, 1), (*, 2)

From this output, we can see that this erroneous error is apparently the result of some bad input validation code in Elemwise.

The same is true for the C implementation, although that's a little less apparent from the output. In this case, the C code for this validation step is here.

@brandonwillard brandonwillard added bug Something isn't working important labels Mar 13, 2021
@brandonwillard brandonwillard pinned this issue Mar 13, 2021
@brandonwillard
Copy link
Member Author

brandonwillard commented Mar 13, 2021

The Python validation code (i.e. starting here) is checking the inputs' shapes against the broadcastable properties of their corresponding symbolic variables, node.inputs, and it looks like this problem arises because x_v.broadcastable == (False, False) and False is being interpreted incorrectly as "cannot be broadcastable" instead of "isn't certainly broadcastable".

If the former is the correct interpretation, then this whole thing should've failed much earlier when the underlying TensorTypes validated the numeric inputs. In other words, if broadcastable == False means "cannot be broadcastable", then we have two problems: the Type validation isn't working properly and the Elemwise Op shouldn't be performing the Type validation for the inputs (again).

More specifically, Elemwise's validation check fails on the second dimension, where the first input has shape value 1 and node.inputs[0].broadcastable[1] == False and the second input has shape value 2 and node.inputs[1].broadcastable[1] == False. The failure condition asks if all the shapes in the given dimension are broadcastable (i.e. 1) or not, and, if they aren't, it fails when one of the inputs has a broadcastable shape (again, a shape value of 1) and is not "symbolically" broadcastable (i.e. node.inputs[i].broadcastable[i] == False). Honestly, that last condition makes no sense to me.

Regardless, in this situation, we have concrete (i.e. ground) shape values for the inputs (because we have concrete NumPy inputs), so I do not understand the point of using the symbolic broadcast information in node.inputs[i].broadcastable. The only check that makes sense is one that asks whether or not the shape values are equal or one of them is broadcastable (i.e. 1).

@brandonwillard
Copy link
Member Author

brandonwillard commented Mar 14, 2021

OK, so the this can be easily fixed for the Python implementation of Elemwise, but the C implementation is essentially a ground-up custom implementation of what NumPy already does (in C, as well), so changing that is much less straightforward.

This whole situation brings up some important aspects of Aesara/Theano's design. The problem we're seeing is really caused by our use of a (2, 1)-shaped input for a TensorVariable with the TensorType at.matrix.

at.matrix has the broadcast pattern (False, False), while the input we used for that variable has the broadcast pattern (False, True) (i.e. (2 == 1, 1 == 1)). In Aesara/Theano, the broadcast pattern of the input doesn't need to exactly match the corresponding TensorVariable's TensorType.broadcastable when those values are False (i.e. not broadcastable), but they do when they're True (i.e. broadcastable).

Here's an illustration:

import aesara
import aesara.tensor as at

# The input is broadcastable in both dimensions, but the symbolic type isn't
>>> at.TensorType(aesara.config.floatX, [False, False]).filter(np.array([[1]]))
array([[1.]])

# The input is broadcastable in the first dimension, but the symbolic type isn't
>>> at.TensorType(aesara.config.floatX, [False, False]).filter(np.array([[1, 2]]))
array([[1., 2.]])

# The input is broadcastable in both dimensions, but the symbolic type is only
# broadcastable in the second
>>> at.TensorType(aesara.config.floatX, [False, True]).filter(np.array([[1]]))
array([[1.]])

# The input is not broadcastable in the second dimension, but the symbolic type 
# is
>>> at.TensorType(aesara.config.floatX, [False, True]).filter(np.array([[1, 2]]))
...
TypeError: ('Non-unit value on shape on a broadcastable dimension.', (1, 2), (False, True))

In other words, when a symbolic type is broadcastable in a given dimension, the inputs must be as well, but, if the symbolic type isn't, then it doesn't matter if the inputs are broadcastable in that dimension or not.

The problem with Elemwise's C implementation is that it doesn't honor this "weak" broadcasting equality requirement; instead, it imposes a "strong" equality requirement.

This is a pretty big design and/or implementation problem, and, apparently, it's been around for a while (at least since the last release of the original Theano).

I'm going to say that this is yet another good reason to push for the changes in #312 sooner than later. Using Cython, we could fix this problem much more quickly. Plus, we could use the NumPy C API with minimal effort, and more rapidly iterate on comparisons between the vanilla CPython API and the NumPy API (or anything else, really).

@brandonwillard brandonwillard changed the title Bad input validation in Elemwise C Elemwise implementation doesn't broadcast variables Mar 14, 2021
@brandonwillard
Copy link
Member Author

brandonwillard commented Mar 14, 2021

N.B.: While working on #336, I removed the input validation within the C implementation so that it would allow "unexpected" broadcastable inputs (i.e. the symbolic type says it's not broadcastable, but the input is). The result was an unbroadcasted version of the correct result.

There may be a way to pre/post-broadcast the inputs/outputs of an Elemwise Op, like it currently does for inputs with an unequal number of dimensions. More specifically, Elemwise uses Dimshuffle to add enough broadcastable dimensions to each input so that they all have the same number of dimensions.

We could use aesara.tensor.basic_opt.broadcast_like to make sure that either the inputs or the output are broadcast correctly; however, using broadcast_like is very wasteful, because it doesn't use views (see #159). Our current view-based broadcast Op is a Python-only implementation, so we would be adding that overhead to each Elemwise call, if we used it.

@twiecki
Copy link
Contributor

twiecki commented Mar 14, 2021

Or we add a COp for the view-based broadcast, with Cython that should be pretty easy I think.

@brandonwillard
Copy link
Member Author

brandonwillard commented Mar 14, 2021

Or we add a COp for the view-based broadcast, with Cython that should be pretty easy I think.

I haven't looked into it, but my impression is that it might actually be difficult, because—as far as I've seen—none of the current C framework uses the NumPy C API in a clear-cut, high-level way. For instance, there are some aspects of the symbolic typing and input handling at the C-level that make me question whether or not the existing Python implementation is even being used correctly (e.g. I'm not sure whether or not views are actually being copied at some point).

I (partially) take that back; there is an instance or two: e.g. here where the NumPy C API is used to broadcast in the relevant way.

@brandonwillard brandonwillard unpinned this issue May 23, 2021
@brandonwillard
Copy link
Member Author

brandonwillard commented Jan 12, 2022

This appears to be another example of the same C implementation limitations:

import numpy as np

import aesara
import aesara.tensor as at


x = at.scalar()
y = at.isclose(at.as_tensor([0, 0.5, 1, -1]), x)

y_fn = aesara.function([x], y)

aesara.dprint(y_fn, print_type=True)
# Elemwise{Composite{AND(LE(Abs((i0 - i1)), i2), Invert(OR(IsNan(i3), IsInf(i3))))}} [id A] <TensorType(bool, vector)> ''   3
#  |TensorConstant{[ 0.   0.5.. 1.  -1. ]} [id B] <TensorType(float64, vector)>
#  |InplaceDimShuffle{x} [id C] <TensorType(float64, (True,))> ''   0
#  | |<TensorType(float64, scalar)> [id D] <TensorType(float64, scalar)>
#  |Elemwise{Composite{(i0 + (i1 * Abs(i2)))}} [id E] <TensorType(float64, (True,))> ''   2
#  | |TensorConstant{(1,) of 1e-08} [id F] <TensorType(float64, (True,))>
#  | |TensorConstant{(1,) of 1e-05} [id G] <TensorType(float64, (True,))>
#  | |InplaceDimShuffle{x} [id C] <TensorType(float64, (True,))> ''   0
#  |Rebroadcast{(0, False)} [id H] <TensorType(float64, vector)> ''   1
#    |InplaceDimShuffle{x} [id C] <TensorType(float64, (True,))> ''   0

y_fn(0)
# ValueError: Input dimension mismatch. (input[0].shape[0] = 4, input[3].shape[0] = 1)
# Apply node that caused the error: Elemwise{Composite{AND(LE(Abs((i0 - i1)), i2), Invert(OR(IsNan(i3), IsInf(i3))))}}(TensorConstant{[ 0.   0.5.. 1.  -1. ]}, InplaceDimShuffle{x}.0, Elemwise{Composite{(i0 + (i1 * Abs(i2)))}}.0, Rebroadcast{(0, False)}.0)
# Toposort index: 3
# Inputs types: [TensorType(float64, vector), TensorType(float64, (True,)), TensorType(float64, (True,)), TensorType(float64, vector)]
# Inputs shapes: [(4,), (1,), (1,), (1,)]
# Inputs strides: [(8,), (8,), (8,), (8,)]
# Inputs values: [array([ 0. ,  0.5,  1. , -1. ]), array([0.]), array([1.e-08]), array([0.])]
# Outputs clients: [['output']]

Manually broadcasting the arguments (e.g. using at.broadcast_arrays) to at.isclose avoids the issue, but it shouldn't be necessary.

This arose as an issue during the implementation of aesara-devs/aeppl#110.

Also, running this same example on the #711 branch produces the following graph (and no errors):

Elemwise{Composite{AND(LE(Abs((i0 - i1)), i2), Invert(OR(i3, i4)))}} [id A] <TensorType(bool, (None,))> ''   4
 |TensorConstant{[ 0.   0.5.. 1.  -1. ]} [id B] <TensorType(float64, (4,))>
 |InplaceDimShuffle{x} [id C] <TensorType(float64, (1,))> ''   0
 | |<TensorType(float64, ())> [id D] <TensorType(float64, ())>
 |Elemwise{Composite{(i0 + (i1 * Abs(i2)))}} [id E] <TensorType(float64, (1,))> ''   3
 | |TensorConstant{(1,) of 1e-08} [id F] <TensorType(float64, (1,))>
 | |TensorConstant{(1,) of 1e-05} [id G] <TensorType(float64, (1,))>
 | |InplaceDimShuffle{x} [id C] <TensorType(float64, (1,))> ''   0
 |Elemwise{isnan,no_inplace} [id H] <TensorType(bool, (1,))> ''   2
 | |InplaceDimShuffle{x} [id C] <TensorType(float64, (1,))> ''   0
 |Elemwise{isinf,no_inplace} [id I] <TensorType(bool, (1,))> ''   1
   |InplaceDimShuffle{x} [id C] <TensorType(float64, (1,))> ''   0

I'm little curious as to why the new isnan nodes are passing the same Elemwise dimension checks that previously raised an error, though. I'm assuming it has to do with the removal/replacement of the Rebroadcast, because it implies that something broadcastable is being represented by an unbroadcastable type, and—in this situation—that would imply the observed error. The Elemwise C implementation is probably adding this check conditional on that faulty type information, which would explain why #711 makes it work.

Update: Yeah, shape/broadcastables inference is the problem in this case; the check that produces the error is only added when the input type is not broadcastable (see here), so the Rebroadcast is causing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working important
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants