Performance regression with `at.grad` and broadcasting #1119

mattearllongshot · 2022-08-17T15:50:18Z

Description of your problem or feature request

With our production graph we're seeing TensorFromScalar taking up a non-trivial amount of time in 2.7.9.

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  25.4%    25.4%       0.575s       9.35e-06s     C     61500        3   CGemv{inplace}
  11.1%    36.5%       0.251s       1.22e-05s     C     20500        1   Elemwise{...}}
   8.6%    45.2%       0.196s       9.54e-06s     C     20500        1   Elemwise{...}[(0, 3)]
   8.2%    53.4%       0.186s       1.81e-06s     C     102500        5   IncSubtensor{InplaceInc;int64}
   4.8%    58.2%       0.109s       2.66e-06s     Py    41000        2   TensorFromScalar
   3.6%    61.8%       0.081s       3.96e-06s     C     20500        1   Softmax{axis=-1}
   2.6%    64.4%       0.058s       1.42e-06s     C     41000        2   IncSubtensor{InplaceInc;int64::}

With 2.7.1 this doesn't happen:

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  28.1%    28.1%       0.591s       9.60e-06s     C     61500        3   CGemv{inplace}
  13.3%    41.3%       0.279s       1.36e-05s     C     20500        1   Elemwise{...}}
  10.6%    51.9%       0.223s       1.09e-05s     C     20500        1   Elemwise{...}}[(0, 3)]
   8.9%    60.8%       0.187s       1.82e-06s     C     102500        5   IncSubtensor{InplaceInc;int64}
   3.6%    64.4%       0.076s       3.72e-06s     C     20500        1   Softmax{axis=-1}
   2.9%    67.3%       0.060s       1.46e-06s     C     41000        2   IncSubtensor{InplaceInc;int64::}
   2.2%    69.5%       0.046s       2.27e-06s     C     20500        1   SoftmaxGrad{axis=-1}

Here's a minimal repro that includes the extra nodes on 2.7.9, but not on 2.7.1:

import aesara.tensor as at
import numpy as np

shared_val = aesara.shared(np.zeros(0))

params = at.vector()
val = params + shared_val

f = aesara.function([params], [at.grad(at.sum(val), params)])
aesara.printing.pydotprint(f, outfile="graph.png", var_with_name_simple=True)

Here is the graph that is produced by 2.7.9:

And here is the graph that is produced by 2.7.1:

(note both are run with optimizer_including=local_remove_all_assert)

Versions and main components

Aesara version: 2.7.9
Aesara config (python -c "import aesara; print(aesara.config)") aesara_config.txt
Python version: 3.8.1
Operating system: Ubuntu 18.04.6 LTS
How did you install Aesara: (conda/pip) conda

The text was updated successfully, but these errors were encountered:

brandonwillard · 2022-08-17T20:24:25Z

It looks like that Op doesn't have a C implementation.

brandonwillard · 2022-08-17T21:26:30Z

The extra nodes you're seeing are introduced by Elemwise.infer_shape (via #981), which should now compute its symbolic output shapes correctly in all cases.

As I mentioned, a C implementation for TensorFromScalar would probably help a lot, and it should be very straightforward to implement; otherwise, graphs like that can't be completely removed until we add the ability to specify static/compile-time shape constraints (e.g. params.shape[0] > 1).

We need that compile-time information in order to accurately determine which symbolic shape value to use for the shape resulting from val = params + shared_val. For example, one could set the value of params to an array with shape (1,), in which case the broadcasted output shape of val would be determined by shared_val—assuming it's not also (1,). However, if we knew which of the two addends had a shape greater than one, we could simply use that shape (i.e. the Shape_i in 2.7.1).

Regardless, we should look more closely at these new inferred shape graphs and see if we're missing any other simplifications, because it's still quite likely that we are.

N.B. Your example is a little strange, because shared_val's value is actually zero-dimensional, but its TensorType is a vector. Nevertheless, if you use a different value and set shared_val = aesara.shared(np.zeros((1,)), shape=(1,)), you'll see that the shape inference is able to produce a much simpler graph:

Alloc [id A] 1
 |TensorConstant{(1,) of 1.0} [id B]
 |Shape_i{0} [id C] 0
   |<TensorType(float64, (None,))> [id D]

brandonwillard · 2022-08-17T21:42:38Z

Per the above, #1122 is needed in order to "universally" reduce broadcasted shape graphs. Users will need to manually specify when a tensor's shape values are greater than one, though.

mattearllongshot · 2022-08-18T11:15:30Z

N.B. Your example is a little strange, because shared_val's value is actually zero-dimensional, but its TensorType is a vector. Nevertheless, if you use a different value and set shared_val = aesara.shared(np.zeros((1,)), shape=(1,)), you'll see that the shape inference is able to produce a much simpler graph:

Actually, the idea was that shared_val is one-dimensional, but I don't know the size of that dimension at compile time, hence I (arbitrarily) set it to zero:

>>> np.zeros(0).ndim
1

I see your point though about needing to handle the case where one of the dims is 1, hence the need for the extra nodes. I think #1123 should be enough to fix this for us, unless there's a way of hinting that the shape shouldn't be broadcast.

mattearllongshot · 2022-08-18T12:47:03Z

Never mind, I see that #1122 is tracking this, thanks. Should this issue (#1119) be closed since the two particular issues that it refers to are being tracked independently?

brandonwillard · 2022-08-18T22:23:04Z

Never mind, I see that #1122 is tracking this, thanks. Should this issue (#1119) be closed since the two particular issues that it refers to are being tracked independently?

Yeah, I just split this into two issues that could each address the problem independently, so we can close this.

#1122 would allow us to return exactly the same graphs as before in more-or-less exactly the same situation—albeit with the additional specification of a dimension not being one (e.g. on either shared_val and/or params).

#1123 should simply improve the performance of that one Op, and that should make the new graph elements negligible. Since there's already a fix for that one, try profiling with that branch and tell us if it helps.

brandonwillard · 2022-08-18T22:27:53Z

Now that I think about it, we should only close this when/if we find that #1124 does actually help.

mattearllongshot · 2022-08-19T13:35:06Z

Confirmed that #1124 does indeed speed things up for us: #1124 (comment)

brandonwillard · 2022-08-19T17:27:44Z

Confirmed that #1124 does indeed speed things up for us: #1124 (comment)

Much appreciated! I've created a separate issue for the BroadcastTo: #1128. That should be similarly straightforward to add.

brandonwillard · 2022-08-19T17:30:44Z

@mattearllongshot, if you can, try using the Numba backend, and report the performance issues like you have here. We're trying to replace the C backend with Numba, and this kind of input would really help that effort.

N.B. The C backend doesn't compile all the Op C code together (i.e. all the C code for a graph isn't compiled together), but the Numba backend does, which means that it has significantly more potential for useful low-level optimizations.

mattearllongshot · 2022-08-22T12:43:07Z

Hello, I've just tried numba, and it seems to work well. The first iteration takes a long time (presumably it is compiling) but afterwards it is about 5% faster than C. I'm switching to numba by passing mode to aesara.function as follows:

        opts = OptimizationQuery(include=[], exclude=[])
        numba_mode = Mode(NumbaLinker(), opts)
        aesara.function(..., mode=numba_mode)

Is this correct? Also, do you have any tips for profiling? With profile=True I just get headline figures that indicate 99% of the time is spent in Function.vm.__call__. Is there a way to get a breakdown by node?

ricardoV94 · 2022-08-22T18:11:57Z

To compile a function to numba you can simply call aesara.function(..., mode="NUMBA”)

About profiling, we don't have anything specific implemented yet, but there's a discussion here that might give some ideas: #1086

brandonwillard · 2022-08-22T18:21:41Z

About profiling, we don't have anything specific implemented yet, but there's a discussion here that might give some ideas:

Ultimately, the profiling capabilities boil down to whatever is possible within Numba, so an issue like numba/numba#5028 might be worth tracking; otherwise, check the Numba communities for other approaches.

brandonwillard · 2022-08-22T18:40:17Z

Hello, I've just tried numba, and it seems to work well.

Many thanks! Running things like this is of great help to us, so, whenever you can, try compiling to Numba and report any bugs and performance differences you observe.

The first iteration takes a long time (presumably it is compiling) but afterwards it is about 5% faster than C.

Yes, we can also consider "eagerly" compiling those graphs during the construction of the Function objects returned by aesara.function; otherwise, we can't guarantee that everything will be faster, but we do expect our C and Numba results to be essentially at parity. If you find that Numba is significantly slower than an equivalent C implementation, create an issue, because there's likely something we need to fix.

With profile=True I just get headline figures that indicate 99% of the time is spent in Function.vm.__call__. Is there a way to get a breakdown by node?

When Numba is used, the entire graph is run as a single thunk that simply calls the Numba compiled function. In this way, we avoid needing to use Aesara's old "virtual machines" and most of its manual memory management. It ends up being a really good thing for us, because Numba has implemented all that stuff much better and at a lower level that we can. This is one of the primary reasons we plan to use Numba as the default backend/transpilation target going forward.

brandonwillard · 2022-10-08T00:20:09Z

Now that #1124 and #1128 are in place, how is this performing with the C backend?

Also, we should probably be "fusing" the scalar-only subgraphs with Composite. That should measurably reduce the overhead by turning all those nodes into a single one. A more general approach would involve #1225, but we can quickly do this one by hand in broadcast_shape_iter if need be.

brandonwillard · 2022-10-21T17:42:57Z

Also, we should probably be "fusing" the scalar-only subgraphs with Composite. That should measurably reduce the overhead by turning all those nodes into a single one.

This has been added in 69b80f7.

brandonwillard · 2022-10-21T19:41:06Z

Here's a run-down of the current results:

import numpy as np

import aesara
import aesara.tensor as at
from aesara.compile.mode import get_default_mode


shared_val = aesara.shared(np.zeros(0), name="shared_val")

params = at.vector("params")

# We don't know how `params` and `shared_val` are going to broadcast until
# run-time, so there isn't going to be any "static" information with which to
# work/optimize.
val = params + shared_val
val.name = "val"

val_sum = at.sum(val)
val_sum.name = "val_sum"

output = at.grad(val_sum, params)

aesara.dprint(output, print_type=True)
# Elemwise{second} [id A] <TensorType(float64, (None,))> '(dval_sum/dparams)'
#  |Elemwise{add,no_inplace} [id B] <TensorType(float64, (None,))> 'val'
#  | |params [id C] <TensorType(float64, (None,))>
#  | |shared_val [id D] <TensorType(float64, (None,))>
#  |InplaceDimShuffle{x} [id E] <TensorType(float64, (1,))>
#    |Elemwise{second,no_inplace} [id F] <TensorType(float64, ())>
#      |Sum{acc_dtype=float64} [id G] <TensorType(float64, ())> 'val_sum'
#      | |Elemwise{add,no_inplace} [id B] <TensorType(float64, (None,))> 'val'
#      |TensorConstant{1.0} [id H] <TensorType(float64, ())>

# The un-"optimized" gradient graph is basically saying that the result is just
# a bunch of `1`s that conform to the shape of the subgraph `val`.

mode = get_default_mode().including("local_remove_all_assert")
f = aesara.function([params], output, mode=mode)


aesara.dprint(f, print_type=True)
# Alloc [id A] <TensorType(float64, (None,))> '(dval_sum/dparams)' 6
#  |TensorConstant{(1,) of 1.0} [id B] <TensorType(float64, (1,))>
#  |TensorFromScalar [id C] <TensorType(int64, ())> 5
#    |Composite{Abs(maximum(Switch(EQ(i0, 1), (-1), i0), Switch(EQ(i1, 1), (-1), i1)))} [id D] <int64> 4
#      |ScalarFromTensor [id E] <int64> 3
#      | |Shape_i{0} [id F] <TensorType(int64, ())> 2
#      |   |params [id G] <TensorType(float64, (None,))>
#      |ScalarFromTensor [id H] <int64> 1
#        |Shape_i{0} [id I] <TensorType(int64, ())> 0
#          |shared_val [id J] <TensorType(float64, (None,))>

Unfortunately, in 2.7.1, a rewrite basically just picked one of the inputs and hoped that it was a valid choice. That can sometimes work, and—as a result—produce a simpler graph (i.e. id C above replaced with id F or id I), but also sometimes not. The shared variable case is especially problematic, because everything and anything changing the value of a shared variable will need to be intentionally constrained to only non-broadcastable values (in this particular case), because Theano/Aesara won't—and effectively can't—enforce that constraint itself, it can only make that assumption in very specific cases.

We've been in the process of fixing these kinds of "static" broadcasting issues from Theano, and one of the most relevant changes is #1122. With that in place, one would be able to specify that params and shared_val are vectors with shapes that are not equal to 1, and this would allow us to produce a graph like 2.7.1 in a sound way. (N.B. That will still only ever be possible when the relevant sub-graphs consist of Ops that adequately propagate static shape information.)

mattearllongshot changed the title ~~Performance regression with broadcasting at.grad and broadcasting~~ Performance regression with at.grad and broadcasting Aug 17, 2022

mattearllongshot changed the title ~~Performance regression with at.grad and broadcasting~~ Performance regression with at.grad and broadcasting Aug 17, 2022

twiecki added the performance concern label Aug 17, 2022

brandonwillard added shape inference Op implementation Involves the implementation of an Op labels Aug 17, 2022

brandonwillard mentioned this issue Aug 17, 2022

Provide a C implementation for TensorFromScalar #1123

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression with `at.grad` and broadcasting #1119

Performance regression with `at.grad` and broadcasting #1119

mattearllongshot commented Aug 17, 2022 •

edited

Loading

brandonwillard commented Aug 17, 2022

brandonwillard commented Aug 17, 2022

brandonwillard commented Aug 17, 2022

mattearllongshot commented Aug 18, 2022

mattearllongshot commented Aug 18, 2022

brandonwillard commented Aug 18, 2022 •

edited

Loading

brandonwillard commented Aug 18, 2022

mattearllongshot commented Aug 19, 2022

brandonwillard commented Aug 19, 2022

brandonwillard commented Aug 19, 2022 •

edited

Loading

mattearllongshot commented Aug 22, 2022

ricardoV94 commented Aug 22, 2022 •

edited

Loading

brandonwillard commented Aug 22, 2022

brandonwillard commented Aug 22, 2022

brandonwillard commented Oct 8, 2022 •

edited

Loading

brandonwillard commented Oct 21, 2022

brandonwillard commented Oct 21, 2022 •

edited

Loading

Performance regression with at.grad and broadcasting #1119

Performance regression with at.grad and broadcasting #1119

Comments

mattearllongshot commented Aug 17, 2022 • edited Loading

Description of your problem or feature request

Versions and main components

brandonwillard commented Aug 17, 2022

brandonwillard commented Aug 17, 2022

brandonwillard commented Aug 17, 2022

mattearllongshot commented Aug 18, 2022

mattearllongshot commented Aug 18, 2022

brandonwillard commented Aug 18, 2022 • edited Loading

brandonwillard commented Aug 18, 2022

mattearllongshot commented Aug 19, 2022

brandonwillard commented Aug 19, 2022

brandonwillard commented Aug 19, 2022 • edited Loading

mattearllongshot commented Aug 22, 2022

ricardoV94 commented Aug 22, 2022 • edited Loading

brandonwillard commented Aug 22, 2022

brandonwillard commented Aug 22, 2022

brandonwillard commented Oct 8, 2022 • edited Loading

brandonwillard commented Oct 21, 2022

brandonwillard commented Oct 21, 2022 • edited Loading

Performance regression with `at.grad` and broadcasting #1119

Performance regression with `at.grad` and broadcasting #1119

mattearllongshot commented Aug 17, 2022 •

edited

Loading

brandonwillard commented Aug 18, 2022 •

edited

Loading

brandonwillard commented Aug 19, 2022 •

edited

Loading

ricardoV94 commented Aug 22, 2022 •

edited

Loading

brandonwillard commented Oct 8, 2022 •

edited

Loading

brandonwillard commented Oct 21, 2022 •

edited

Loading