Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression with at.grad and broadcasting #1119

Open
mattearllongshot opened this issue Aug 17, 2022 · 17 comments
Open

Performance regression with at.grad and broadcasting #1119

mattearllongshot opened this issue Aug 17, 2022 · 17 comments
Labels

Comments

@mattearllongshot
Copy link

mattearllongshot commented Aug 17, 2022

Description of your problem or feature request

With our production graph we're seeing TensorFromScalar taking up a non-trivial amount of time in 2.7.9.

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  25.4%    25.4%       0.575s       9.35e-06s     C     61500        3   CGemv{inplace}
  11.1%    36.5%       0.251s       1.22e-05s     C     20500        1   Elemwise{...}}
   8.6%    45.2%       0.196s       9.54e-06s     C     20500        1   Elemwise{...}[(0, 3)]
   8.2%    53.4%       0.186s       1.81e-06s     C     102500        5   IncSubtensor{InplaceInc;int64}
   4.8%    58.2%       0.109s       2.66e-06s     Py    41000        2   TensorFromScalar
   3.6%    61.8%       0.081s       3.96e-06s     C     20500        1   Softmax{axis=-1}
   2.6%    64.4%       0.058s       1.42e-06s     C     41000        2   IncSubtensor{InplaceInc;int64::}

With 2.7.1 this doesn't happen:

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  28.1%    28.1%       0.591s       9.60e-06s     C     61500        3   CGemv{inplace}
  13.3%    41.3%       0.279s       1.36e-05s     C     20500        1   Elemwise{...}}
  10.6%    51.9%       0.223s       1.09e-05s     C     20500        1   Elemwise{...}}[(0, 3)]
   8.9%    60.8%       0.187s       1.82e-06s     C     102500        5   IncSubtensor{InplaceInc;int64}
   3.6%    64.4%       0.076s       3.72e-06s     C     20500        1   Softmax{axis=-1}
   2.9%    67.3%       0.060s       1.46e-06s     C     41000        2   IncSubtensor{InplaceInc;int64::}
   2.2%    69.5%       0.046s       2.27e-06s     C     20500        1   SoftmaxGrad{axis=-1}

Here's a minimal repro that includes the extra nodes on 2.7.9, but not on 2.7.1:

import aesara.tensor as at
import numpy as np

shared_val = aesara.shared(np.zeros(0))

params = at.vector()
val = params + shared_val

f = aesara.function([params], [at.grad(at.sum(val), params)])
aesara.printing.pydotprint(f, outfile="graph.png", var_with_name_simple=True)

Here is the graph that is produced by 2.7.9:
image

And here is the graph that is produced by 2.7.1:
image

(note both are run with optimizer_including=local_remove_all_assert)

Versions and main components

  • Aesara version: 2.7.9
  • Aesara config (python -c "import aesara; print(aesara.config)") aesara_config.txt
  • Python version: 3.8.1
  • Operating system: Ubuntu 18.04.6 LTS
  • How did you install Aesara: (conda/pip) conda
@mattearllongshot mattearllongshot changed the title Performance regression with broadcasting at.grad and broadcasting Performance regression with at.grad and broadcasting Aug 17, 2022
@mattearllongshot mattearllongshot changed the title Performance regression with at.grad and broadcasting Performance regression with at.grad and broadcasting Aug 17, 2022
@brandonwillard
Copy link
Member

It looks like that Op doesn't have a C implementation.

@brandonwillard
Copy link
Member

The extra nodes you're seeing are introduced by Elemwise.infer_shape (via #981), which should now compute its symbolic output shapes correctly in all cases.

As I mentioned, a C implementation for TensorFromScalar would probably help a lot, and it should be very straightforward to implement; otherwise, graphs like that can't be completely removed until we add the ability to specify static/compile-time shape constraints (e.g. params.shape[0] > 1).

We need that compile-time information in order to accurately determine which symbolic shape value to use for the shape resulting from val = params + shared_val. For example, one could set the value of params to an array with shape (1,), in which case the broadcasted output shape of val would be determined by shared_val—assuming it's not also (1,). However, if we knew which of the two addends had a shape greater than one, we could simply use that shape (i.e. the Shape_i in 2.7.1).

Regardless, we should look more closely at these new inferred shape graphs and see if we're missing any other simplifications, because it's still quite likely that we are.

N.B. Your example is a little strange, because shared_val's value is actually zero-dimensional, but its TensorType is a vector. Nevertheless, if you use a different value and set shared_val = aesara.shared(np.zeros((1,)), shape=(1,)), you'll see that the shape inference is able to produce a much simpler graph:

Alloc [id A] 1
 |TensorConstant{(1,) of 1.0} [id B]
 |Shape_i{0} [id C] 0
   |<TensorType(float64, (None,))> [id D]

@brandonwillard
Copy link
Member

Per the above, #1122 is needed in order to "universally" reduce broadcasted shape graphs. Users will need to manually specify when a tensor's shape values are greater than one, though.

@mattearllongshot
Copy link
Author

N.B. Your example is a little strange, because shared_val's value is actually zero-dimensional, but its TensorType is a vector. Nevertheless, if you use a different value and set shared_val = aesara.shared(np.zeros((1,)), shape=(1,)), you'll see that the shape inference is able to produce a much simpler graph:

Actually, the idea was that shared_val is one-dimensional, but I don't know the size of that dimension at compile time, hence I (arbitrarily) set it to zero:

>>> np.zeros(0).ndim
1

I see your point though about needing to handle the case where one of the dims is 1, hence the need for the extra nodes. I think #1123 should be enough to fix this for us, unless there's a way of hinting that the shape shouldn't be broadcast.

@mattearllongshot
Copy link
Author

Never mind, I see that #1122 is tracking this, thanks. Should this issue (#1119) be closed since the two particular issues that it refers to are being tracked independently?

@brandonwillard
Copy link
Member

brandonwillard commented Aug 18, 2022

Never mind, I see that #1122 is tracking this, thanks. Should this issue (#1119) be closed since the two particular issues that it refers to are being tracked independently?

Yeah, I just split this into two issues that could each address the problem independently, so we can close this.

#1122 would allow us to return exactly the same graphs as before in more-or-less exactly the same situation—albeit with the additional specification of a dimension not being one (e.g. on either shared_val and/or params).

#1123 should simply improve the performance of that one Op, and that should make the new graph elements negligible. Since there's already a fix for that one, try profiling with that branch and tell us if it helps.

@brandonwillard
Copy link
Member

Now that I think about it, we should only close this when/if we find that #1124 does actually help.

@mattearllongshot
Copy link
Author

Confirmed that #1124 does indeed speed things up for us: #1124 (comment)

@brandonwillard
Copy link
Member

Confirmed that #1124 does indeed speed things up for us: #1124 (comment)

Much appreciated! I've created a separate issue for the BroadcastTo: #1128. That should be similarly straightforward to add.

@brandonwillard
Copy link
Member

brandonwillard commented Aug 19, 2022

@mattearllongshot, if you can, try using the Numba backend, and report the performance issues like you have here. We're trying to replace the C backend with Numba, and this kind of input would really help that effort.

N.B. The C backend doesn't compile all the Op C code together (i.e. all the C code for a graph isn't compiled together), but the Numba backend does, which means that it has significantly more potential for useful low-level optimizations.

@mattearllongshot
Copy link
Author

Hello, I've just tried numba, and it seems to work well. The first iteration takes a long time (presumably it is compiling) but afterwards it is about 5% faster than C. I'm switching to numba by passing mode to aesara.function as follows:

        opts = OptimizationQuery(include=[], exclude=[])
        numba_mode = Mode(NumbaLinker(), opts)
        aesara.function(..., mode=numba_mode)

Is this correct? Also, do you have any tips for profiling? With profile=True I just get headline figures that indicate 99% of the time is spent in Function.vm.__call__. Is there a way to get a breakdown by node?

@ricardoV94
Copy link
Contributor

ricardoV94 commented Aug 22, 2022

To compile a function to numba you can simply call aesara.function(..., mode="NUMBA”)

About profiling, we don't have anything specific implemented yet, but there's a discussion here that might give some ideas: #1086

@brandonwillard
Copy link
Member

About profiling, we don't have anything specific implemented yet, but there's a discussion here that might give some ideas:

Ultimately, the profiling capabilities boil down to whatever is possible within Numba, so an issue like numba/numba#5028 might be worth tracking; otherwise, check the Numba communities for other approaches.

@brandonwillard
Copy link
Member

Hello, I've just tried numba, and it seems to work well.

Many thanks! Running things like this is of great help to us, so, whenever you can, try compiling to Numba and report any bugs and performance differences you observe.

The first iteration takes a long time (presumably it is compiling) but afterwards it is about 5% faster than C.

Yes, we can also consider "eagerly" compiling those graphs during the construction of the Function objects returned by aesara.function; otherwise, we can't guarantee that everything will be faster, but we do expect our C and Numba results to be essentially at parity. If you find that Numba is significantly slower than an equivalent C implementation, create an issue, because there's likely something we need to fix.

With profile=True I just get headline figures that indicate 99% of the time is spent in Function.vm.__call__. Is there a way to get a breakdown by node?

When Numba is used, the entire graph is run as a single thunk that simply calls the Numba compiled function. In this way, we avoid needing to use Aesara's old "virtual machines" and most of its manual memory management. It ends up being a really good thing for us, because Numba has implemented all that stuff much better and at a lower level that we can. This is one of the primary reasons we plan to use Numba as the default backend/transpilation target going forward.

@brandonwillard
Copy link
Member

brandonwillard commented Oct 8, 2022

Now that #1124 and #1128 are in place, how is this performing with the C backend?

Also, we should probably be "fusing" the scalar-only subgraphs with Composite. That should measurably reduce the overhead by turning all those nodes into a single one. A more general approach would involve #1225, but we can quickly do this one by hand in broadcast_shape_iter if need be.

@brandonwillard
Copy link
Member

Also, we should probably be "fusing" the scalar-only subgraphs with Composite. That should measurably reduce the overhead by turning all those nodes into a single one.

This has been added in 69b80f7.

@brandonwillard
Copy link
Member

brandonwillard commented Oct 21, 2022

Here's a run-down of the current results:

import numpy as np

import aesara
import aesara.tensor as at
from aesara.compile.mode import get_default_mode


shared_val = aesara.shared(np.zeros(0), name="shared_val")

params = at.vector("params")

# We don't know how `params` and `shared_val` are going to broadcast until
# run-time, so there isn't going to be any "static" information with which to
# work/optimize.
val = params + shared_val
val.name = "val"

val_sum = at.sum(val)
val_sum.name = "val_sum"

output = at.grad(val_sum, params)

aesara.dprint(output, print_type=True)
# Elemwise{second} [id A] <TensorType(float64, (None,))> '(dval_sum/dparams)'
#  |Elemwise{add,no_inplace} [id B] <TensorType(float64, (None,))> 'val'
#  | |params [id C] <TensorType(float64, (None,))>
#  | |shared_val [id D] <TensorType(float64, (None,))>
#  |InplaceDimShuffle{x} [id E] <TensorType(float64, (1,))>
#    |Elemwise{second,no_inplace} [id F] <TensorType(float64, ())>
#      |Sum{acc_dtype=float64} [id G] <TensorType(float64, ())> 'val_sum'
#      | |Elemwise{add,no_inplace} [id B] <TensorType(float64, (None,))> 'val'
#      |TensorConstant{1.0} [id H] <TensorType(float64, ())>

# The un-"optimized" gradient graph is basically saying that the result is just
# a bunch of `1`s that conform to the shape of the subgraph `val`.

mode = get_default_mode().including("local_remove_all_assert")
f = aesara.function([params], output, mode=mode)


aesara.dprint(f, print_type=True)
# Alloc [id A] <TensorType(float64, (None,))> '(dval_sum/dparams)' 6
#  |TensorConstant{(1,) of 1.0} [id B] <TensorType(float64, (1,))>
#  |TensorFromScalar [id C] <TensorType(int64, ())> 5
#    |Composite{Abs(maximum(Switch(EQ(i0, 1), (-1), i0), Switch(EQ(i1, 1), (-1), i1)))} [id D] <int64> 4
#      |ScalarFromTensor [id E] <int64> 3
#      | |Shape_i{0} [id F] <TensorType(int64, ())> 2
#      |   |params [id G] <TensorType(float64, (None,))>
#      |ScalarFromTensor [id H] <int64> 1
#        |Shape_i{0} [id I] <TensorType(int64, ())> 0
#          |shared_val [id J] <TensorType(float64, (None,))>

Unfortunately, in 2.7.1, a rewrite basically just picked one of the inputs and hoped that it was a valid choice. That can sometimes work, and—as a result—produce a simpler graph (i.e. id C above replaced with id F or id I), but also sometimes not. The shared variable case is especially problematic, because everything and anything changing the value of a shared variable will need to be intentionally constrained to only non-broadcastable values (in this particular case), because Theano/Aesara won't—and effectively can't—enforce that constraint itself, it can only make that assumption in very specific cases.

We've been in the process of fixing these kinds of "static" broadcasting issues from Theano, and one of the most relevant changes is #1122. With that in place, one would be able to specify that params and shared_val are vectors with shapes that are not equal to 1, and this would allow us to produce a graph like 2.7.1 in a sound way. (N.B. That will still only ever be possible when the relevant sub-graphs consist of Ops that adequately propagate static shape information.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants