-
-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the performance of Numba-compiled array reduction Op
s
#599
Conversation
d0486bc
to
2e2ccc9
Compare
Codecov Report
@@ Coverage Diff @@
## main #599 +/- ##
=======================================
Coverage 78.36% 78.36%
=======================================
Files 152 152
Lines 47693 47742 +49
Branches 10880 10879 -1
=======================================
+ Hits 37373 37415 +42
- Misses 7772 7778 +6
- Partials 2548 2549 +1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to combine this with the custom vectorize
implementation described in #595. In other words, instead of using numba.vectorize
, we need to generate our own for
-loop` like the example here.
That—combined with custom reduce loop (e.g. like this function in the above example))—should get us near-NumPy performance via LLVM.
As the example shows, the timings aren't good for axis=1
because the simple all-purpose loop doesn't work well after transposing the array. The transpose + loop produces a bad memory access pattern, but that can be solved by generating different loops based on the (literal) axis
value.
Yeah, so if you have a look at the Anyways, the broadcasting logic can surely be added, but the main issue here (and I think @fanshi118 and I discussed a bit about it during that time) is that even with custom loops built entirely in Numba we weren't approaching Numpy performance whenever we scaled up the arrays. For reference this is the gist that we benchmarked. |
I'm not sure I understand that Gist, but the example Gist in my earlier comment appears to demonstrate NumPy-comparable results for at least We should start by exending that example with a simple handwritten Numba implementation for After that, we can consider how to generalize the two handwritten implementations. |
I assume here that by handwritten Numba implementation you mean a for loop in which the indices of the inner loop are manually added. For a two dimensional array with axis = 1 that'd be If so then the gist that I mentioned above does exactly that, It generalizes the outer loops of vectorization using ndindex and the inner-most loop using a simple for loop. In fact it's more or less how Numpy itself does the generalization. Except we use a string based function generator. Anyways, we do even have a handwritten gist for fixed 2/3 dimension array on which we tested all this before we actually generalized it like that. It won't make much of a difference though since the functions generated in both cases are exactly the same. Link to fixed dimension loop gist: https://gist.github.com/kc611/7cc0451c6b5b53c9ce46fdc57c6a0da1 |
I've updated the
Your Gist includes some generalizations that are only important to us; I was talking about a very simple Gist that only focuses on the very narrow |
To move forward, we might need to start implementing custom conversions for specific By the way, I've updated the Those plots seem to imply that the slow version performs unnecessary writes to the output array (i.e. Anyway, we can get past this pretty easily with a custom implementation. |
2e2ccc9
to
3f098d3
Compare
The current failing test seems to be some issue with boolean typecasting in Numba. import numpy as np
import numba
def careduce_axis(x):
res_shape = 1
res = np.full(res_shape, True, dtype=np.bool)
x = x.astype(np.bool)
for idx_arr in np.ndindex(res_shape):
for i in range(x.shape[0]):
res[0] = res[0] & x[i]
return res
numba_careduce_axis = numba.njit(careduce_axis)
x = np.random.random(10)
careduce_axis(x) # Works
numba_careduce_axis(x) # Doesn't work |
3f098d3
to
c010237
Compare
Is there anything that further needs to be done in this PR ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Let's run some quick benchmarks on those new CAReduce
implementations and make sure they're better. Actually, we might want to add unit tests for this (i.e. tests that make sure we're within an acceptable margin of NumPy-like performance).
tests/link/test_numba_performance.py
Outdated
if not isinstance(i, (SharedVariable, Constant)) | ||
], | ||
numpy_inputs, | ||
performace_factor=4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the performance isn't that good with this implementation either.
(I suspect it might have something to do with the dtype
casting we're doing at the end of CAReduce
functions in Numba)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to pre-compile those CAReduce
loops using use_optimized_cheap_pass
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed changes adding that. But it doesn't look like it's making any difference in performance, can you confirm if it's correctly implemented ? (with respect to the exec
interface)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context manager needs to be used when Numba JIT compiles a function, but it looks like it's been added to the Python source string compilation utility.
df073d3
to
0eb943d
Compare
aesara/link/utils.py
Outdated
@contextmanager | ||
def use_optimized_cheap_pass(*args, **kwargs): | ||
"""Temporarily replace the cheap optimization pass with a better one.""" | ||
from numba.core.registry import cpu_target | ||
|
||
context = cpu_target.target_context._internal_codegen | ||
old_pm = context._mpm_cheap | ||
new_pm = context._module_pass_manager( | ||
loop_vectorize=True, slp_vectorize=True, opt=3, cost="cheap" | ||
) | ||
context._mpm_cheap = new_pm | ||
try: | ||
yield | ||
finally: | ||
context._mpm_cheap = old_pm | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a general utility module for linking, so we can't put this Numba-specific code here.
aesara/link/utils.py
Outdated
global_env["use_optimized_cheap_pass"] = use_optimized_cheap_pass | ||
src = f""" | ||
with use_optimized_cheap_pass(): | ||
{indent(src, " " * 4)} | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here; this function is supposed to be applicable to any string-to-Python-code conversion, so we can't add Numba-specific code to it.
0eb943d
to
80b9176
Compare
629f638
to
73fe17c
Compare
Op
s
2185084
to
7574fb2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made some changes that should produce reduction functions similar to the best performing (and contextually suitable) implementation in the case study.
The timing comparison between the Aesara and NumPy version isn't appropriate, though, so it's been set to xfail
for the moment. The reason is that at.max
produces MaxAndArgmax
Op
s, which—as the name implies—compute both the max and arg-max, and NumPy computes only one at time.
More importantly, our current Numba translation for MaxAndArgmax
literally computes both operations separately, so it will necessarily be much more expensive than a single NumPy max or arg-max.
We need to address this discrepancy as soon as possible, and definitely before merging this PR.
3f08067
to
d204db5
Compare
I changed the benchmark test so that it necessarily uses only the The results of benchmarking only the |
This only applies to reduction `Op`s (e.g. `CAReduce`).
d204db5
to
b4c8fb0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think we're good on this. There are some uncovered dispatch functions, but they don't have CAReduce
implementations, so they won't actually be used. They'll be covered when/if there's ever a need to implement CAReduce
Op
s for those Scalar
Op
s.
Linked issue: #529
This PR replaces out use of
vectorized
innumba_funcify_CAReduce
while also replacing the harmfulnp.transpose
(More Details: #404 (comment)) from the JIT-ed function.A performance comparison for this implementation: https://gist.github.com/kc611/7cc0451c6b5b53c9ce46fdc57c6a0da1