-
-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression with at.grad
and broadcasting
#1119
Comments
at.grad
and broadcasting
It looks like that |
The extra nodes you're seeing are introduced by As I mentioned, a C implementation for We need that compile-time information in order to accurately determine which symbolic shape value to use for the shape resulting from Regardless, we should look more closely at these new inferred shape graphs and see if we're missing any other simplifications, because it's still quite likely that we are. N.B. Your example is a little strange, because Alloc [id A] 1
|TensorConstant{(1,) of 1.0} [id B]
|Shape_i{0} [id C] 0
|<TensorType(float64, (None,))> [id D] |
Per the above, #1122 is needed in order to "universally" reduce broadcasted shape graphs. Users will need to manually specify when a tensor's shape values are greater than one, though. |
Actually, the idea was that >>> np.zeros(0).ndim
1 I see your point though about needing to handle the case where one of the dims is 1, hence the need for the extra nodes. I think #1123 should be enough to fix this for us, unless there's a way of hinting that the shape shouldn't be broadcast. |
Yeah, I just split this into two issues that could each address the problem independently, so we can close this. #1122 would allow us to return exactly the same graphs as before in more-or-less exactly the same situation—albeit with the additional specification of a dimension not being one (e.g. on either #1123 should simply improve the performance of that one |
Now that I think about it, we should only close this when/if we find that #1124 does actually help. |
Confirmed that #1124 does indeed speed things up for us: #1124 (comment) |
Much appreciated! I've created a separate issue for the |
@mattearllongshot, if you can, try using the Numba backend, and report the performance issues like you have here. We're trying to replace the C backend with Numba, and this kind of input would really help that effort. N.B. The C backend doesn't compile all the |
Hello, I've just tried numba, and it seems to work well. The first iteration takes a long time (presumably it is compiling) but afterwards it is about 5% faster than C. I'm switching to numba by passing opts = OptimizationQuery(include=[], exclude=[])
numba_mode = Mode(NumbaLinker(), opts)
aesara.function(..., mode=numba_mode) Is this correct? Also, do you have any tips for profiling? With |
To compile a function to numba you can simply call About profiling, we don't have anything specific implemented yet, but there's a discussion here that might give some ideas: #1086 |
Ultimately, the profiling capabilities boil down to whatever is possible within Numba, so an issue like numba/numba#5028 might be worth tracking; otherwise, check the Numba communities for other approaches. |
Many thanks! Running things like this is of great help to us, so, whenever you can, try compiling to Numba and report any bugs and performance differences you observe.
Yes, we can also consider "eagerly" compiling those graphs during the construction of the
When Numba is used, the entire graph is run as a single thunk that simply calls the Numba compiled function. In this way, we avoid needing to use Aesara's old "virtual machines" and most of its manual memory management. It ends up being a really good thing for us, because Numba has implemented all that stuff much better and at a lower level that we can. This is one of the primary reasons we plan to use Numba as the default backend/transpilation target going forward. |
Now that #1124 and #1128 are in place, how is this performing with the C backend? Also, we should probably be "fusing" the scalar-only subgraphs with |
This has been added in 69b80f7. |
Here's a run-down of the current results: import numpy as np
import aesara
import aesara.tensor as at
from aesara.compile.mode import get_default_mode
shared_val = aesara.shared(np.zeros(0), name="shared_val")
params = at.vector("params")
# We don't know how `params` and `shared_val` are going to broadcast until
# run-time, so there isn't going to be any "static" information with which to
# work/optimize.
val = params + shared_val
val.name = "val"
val_sum = at.sum(val)
val_sum.name = "val_sum"
output = at.grad(val_sum, params)
aesara.dprint(output, print_type=True)
# Elemwise{second} [id A] <TensorType(float64, (None,))> '(dval_sum/dparams)'
# |Elemwise{add,no_inplace} [id B] <TensorType(float64, (None,))> 'val'
# | |params [id C] <TensorType(float64, (None,))>
# | |shared_val [id D] <TensorType(float64, (None,))>
# |InplaceDimShuffle{x} [id E] <TensorType(float64, (1,))>
# |Elemwise{second,no_inplace} [id F] <TensorType(float64, ())>
# |Sum{acc_dtype=float64} [id G] <TensorType(float64, ())> 'val_sum'
# | |Elemwise{add,no_inplace} [id B] <TensorType(float64, (None,))> 'val'
# |TensorConstant{1.0} [id H] <TensorType(float64, ())>
# The un-"optimized" gradient graph is basically saying that the result is just
# a bunch of `1`s that conform to the shape of the subgraph `val`.
mode = get_default_mode().including("local_remove_all_assert")
f = aesara.function([params], output, mode=mode)
aesara.dprint(f, print_type=True)
# Alloc [id A] <TensorType(float64, (None,))> '(dval_sum/dparams)' 6
# |TensorConstant{(1,) of 1.0} [id B] <TensorType(float64, (1,))>
# |TensorFromScalar [id C] <TensorType(int64, ())> 5
# |Composite{Abs(maximum(Switch(EQ(i0, 1), (-1), i0), Switch(EQ(i1, 1), (-1), i1)))} [id D] <int64> 4
# |ScalarFromTensor [id E] <int64> 3
# | |Shape_i{0} [id F] <TensorType(int64, ())> 2
# | |params [id G] <TensorType(float64, (None,))>
# |ScalarFromTensor [id H] <int64> 1
# |Shape_i{0} [id I] <TensorType(int64, ())> 0
# |shared_val [id J] <TensorType(float64, (None,))> Unfortunately, in 2.7.1, a rewrite basically just picked one of the inputs and hoped that it was a valid choice. That can sometimes work, and—as a result—produce a simpler graph (i.e. We've been in the process of fixing these kinds of "static" broadcasting issues from Theano, and one of the most relevant changes is #1122. With that in place, one would be able to specify that |
Description of your problem or feature request
With our production graph we're seeing
TensorFromScalar
taking up a non-trivial amount of time in 2.7.9.With 2.7.1 this doesn't happen:
Here's a minimal repro that includes the extra nodes on 2.7.9, but not on 2.7.1:
Here is the graph that is produced by 2.7.9:
And here is the graph that is produced by 2.7.1:
(note both are run with
optimizer_including=local_remove_all_assert
)Versions and main components
python -c "import aesara; print(aesara.config)"
) aesara_config.txtThe text was updated successfully, but these errors were encountered: