Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the performance of Numba-compiled array reduction Ops #599

Merged
merged 4 commits into from
Jan 20, 2022

Conversation

kc611
Copy link
Contributor

@kc611 kc611 commented Sep 28, 2021

Linked issue: #529

This PR replaces out use of vectorized in numba_funcify_CAReduce while also replacing the harmful np.transpose (More Details: #404 (comment)) from the JIT-ed function.

A performance comparison for this implementation: https://gist.github.com/kc611/7cc0451c6b5b53c9ce46fdc57c6a0da1

@kc611 kc611 marked this pull request as draft September 28, 2021 14:52
@codecov
Copy link

codecov bot commented Sep 29, 2021

Codecov Report

Merging #599 (7574fb2) into main (6fe9f83) will increase coverage by 0.00%.
The diff coverage is 94.02%.

❗ Current head 7574fb2 differs from pull request most recent head b4c8fb0. Consider uploading reports for the commit b4c8fb0 to get more accurate results
Impacted file tree graph

@@           Coverage Diff           @@
##             main     #599   +/-   ##
=======================================
  Coverage   78.36%   78.36%           
=======================================
  Files         152      152           
  Lines       47693    47742   +49     
  Branches    10880    10879    -1     
=======================================
+ Hits        37373    37415   +42     
- Misses       7772     7778    +6     
- Partials     2548     2549    +1     
Impacted Files Coverage Δ
aesara/link/numba/dispatch/elemwise.py 96.00% <93.27%> (-1.64%) ⬇️
aesara/link/numba/dispatch/basic.py 92.01% <100.00%> (+0.22%) ⬆️
aesara/tensor/basic_opt.py 85.14% <0.00%> (-0.08%) ⬇️
aesara/link/numba/dispatch/extra_ops.py 98.09% <0.00%> (-0.06%) ⬇️

@brandonwillard brandonwillard linked an issue Nov 8, 2021 that may be closed by this pull request
@brandonwillard brandonwillard added the Numba Involves Numba transpilation label Nov 23, 2021
Copy link
Member

@brandonwillard brandonwillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to combine this with the custom vectorize implementation described in #595. In other words, instead of using numba.vectorize, we need to generate our own for-loop` like the example here.

That—combined with custom reduce loop (e.g. like this function in the above example))—should get us near-NumPy performance via LLVM.

As the example shows, the timings aren't good for axis=1 because the simple all-purpose loop doesn't work well after transposing the array. The transpose + loop produces a bad memory access pattern, but that can be solved by generating different loops based on the (literal) axis value.

@kc611
Copy link
Contributor Author

kc611 commented Nov 23, 2021

Yeah, so if you have a look at the create_elemwise_reducer, we actually do make custom for loops for the kind of input array that we're given. Actually back when I built this I made it specifically for the reduce or at kind of functions so it's basically an custom vectorized implementation minus all the broadcasting logic for multiple inputs.

Anyways, the broadcasting logic can surely be added, but the main issue here (and I think @fanshi118 and I discussed a bit about it during that time) is that even with custom loops built entirely in Numba we weren't approaching Numpy performance whenever we scaled up the arrays.

For reference this is the gist that we benchmarked.
https://gist.github.com/kc611/22760dca36cc9062a401da89ee60ced8
(Updated it a bit to remove the parallelization stuff that we added for extra performance)

@brandonwillard
Copy link
Member

Anyways, the broadcasting logic can surely be added, but the main issue here (and I think @fanshi118 and I discussed a bit about it during that time) is that even with custom loops built entirely in Numba we weren't approaching Numpy performance whenever we scaled up the arrays.

For reference this is the gist that we benchmarked. https://gist.github.com/kc611/22760dca36cc9062a401da89ee60ced8 (Updated it a bit to remove the parallelization stuff that we added for extra performance)

I'm not sure I understand that Gist, but the example Gist in my earlier comment appears to demonstrate NumPy-comparable results for at least axis=0.

We should start by exending that example with a simple handwritten Numba implementation for axis=1—one that performs at least as well as NumPy, of course. If that's not possible, we should confirm as much and report to the Numba folk.

After that, we can consider how to generalize the two handwritten implementations.

@kc611
Copy link
Contributor Author

kc611 commented Nov 23, 2021

I assume here that by handwritten Numba implementation you mean a for loop in which the indices of the inner loop are manually added. For a two dimensional array with axis = 1 that'd be a[i][j] where j is the inner loop while i is the outer loop variable.

If so then the gist that I mentioned above does exactly that, It generalizes the outer loops of vectorization using ndindex and the inner-most loop using a simple for loop. In fact it's more or less how Numpy itself does the generalization. Except we use a string based function generator.

Anyways, we do even have a handwritten gist for fixed 2/3 dimension array on which we tested all this before we actually generalized it like that. It won't make much of a difference though since the functions generated in both cases are exactly the same.

Link to fixed dimension loop gist: https://gist.github.com/kc611/7cc0451c6b5b53c9ce46fdc57c6a0da1
(Updated to remove the indexing logic in favor of np.ndindex)

@brandonwillard
Copy link
Member

brandonwillard commented Nov 23, 2021

I've updated the np.max.reduce example Gist to include all the approaches we've considered.

If so then the gist that I mentioned above does exactly that, It generalizes the outer loops of vectorization using ndindex and the inner-most loop using a simple for loop.

Your Gist includes some generalizations that are only important to us; I was talking about a very simple Gist that only focuses on the very narrow np.max.reduce with axis=1 case and includes all the timings/relative benchmarks, so that it's easier for others (e.g. the Numba folk) to follow.

@brandonwillard
Copy link
Member

To move forward, we might need to start implementing custom conversions for specific CAReduce Ops and—possibly—axis values. We already have an implementation for max, so we'll need to time the others like we did for that, and find acceptable implementations for those, as well.

By the way, I've updated the max example Gist; it now contains some helpful debug print-outs of the CFG plots for the generated LLVM IR.

Those plots seem to imply that the slow version performs unnecessary writes to the output array (i.e. res[i] = res[i] when res[i] >= x[i, j]). Apparently the unnecessary else clause in custom_max isn't being removed during optimization.

Anyway, we can get past this pretty easily with a custom implementation.

@kc611
Copy link
Contributor Author

kc611 commented Dec 22, 2021

The current failing test seems to be some issue with boolean typecasting in Numba.

import numpy as np
import numba

def careduce_axis(x):
    res_shape = 1
    res = np.full(res_shape, True, dtype=np.bool)
    x = x.astype(np.bool)
    for idx_arr in np.ndindex(res_shape):
        for i in range(x.shape[0]):
            res[0] = res[0] & x[i]
    return res

numba_careduce_axis = numba.njit(careduce_axis)

x = np.random.random(10)

careduce_axis(x) # Works
numba_careduce_axis(x) # Doesn't work 

@kc611 kc611 marked this pull request as ready for review December 22, 2021 14:07
@kc611
Copy link
Contributor Author

kc611 commented Jan 1, 2022

Is there anything that further needs to be done in this PR ?

Copy link
Member

@brandonwillard brandonwillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Let's run some quick benchmarks on those new CAReduce implementations and make sure they're better. Actually, we might want to add unit tests for this (i.e. tests that make sure we're within an acceptable margin of NumPy-like performance).

if not isinstance(i, (SharedVariable, Constant))
],
numpy_inputs,
performace_factor=4,
Copy link
Contributor Author

@kc611 kc611 Jan 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the performance isn't that good with this implementation either.

(I suspect it might have something to do with the dtype casting we're doing at the end of CAReduce functions in Numba)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to pre-compile those CAReduce loops using use_optimized_cheap_pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed changes adding that. But it doesn't look like it's making any difference in performance, can you confirm if it's correctly implemented ? (with respect to the exec interface)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context manager needs to be used when Numba JIT compiles a function, but it looks like it's been added to the Python source string compilation utility.

Comment on lines 580 to 596
@contextmanager
def use_optimized_cheap_pass(*args, **kwargs):
"""Temporarily replace the cheap optimization pass with a better one."""
from numba.core.registry import cpu_target

context = cpu_target.target_context._internal_codegen
old_pm = context._mpm_cheap
new_pm = context._module_pass_manager(
loop_vectorize=True, slp_vectorize=True, opt=3, cost="cheap"
)
context._mpm_cheap = new_pm
try:
yield
finally:
context._mpm_cheap = old_pm


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a general utility module for linking, so we can't put this Numba-specific code here.

Comment on lines 609 to 613
global_env["use_optimized_cheap_pass"] = use_optimized_cheap_pass
src = f"""
with use_optimized_cheap_pass():
{indent(src, " " * 4)}
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here; this function is supposed to be applicable to any string-to-Python-code conversion, so we can't add Numba-specific code to it.

@brandonwillard brandonwillard force-pushed the faster_careduce branch 6 times, most recently from 629f638 to 73fe17c Compare January 18, 2022 20:29
@brandonwillard brandonwillard changed the title Initialized a new careduce_axis implementation Improve the performance of Numba-compiled array reduction Ops Jan 18, 2022
@brandonwillard brandonwillard force-pushed the faster_careduce branch 7 times, most recently from 2185084 to 7574fb2 Compare January 18, 2022 22:14
Copy link
Member

@brandonwillard brandonwillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made some changes that should produce reduction functions similar to the best performing (and contextually suitable) implementation in the case study.

The timing comparison between the Aesara and NumPy version isn't appropriate, though, so it's been set to xfail for the moment. The reason is that at.max produces MaxAndArgmax Ops, which—as the name implies—compute both the max and arg-max, and NumPy computes only one at time.

More importantly, our current Numba translation for MaxAndArgmax literally computes both operations separately, so it will necessarily be much more expensive than a single NumPy max or arg-max.

We need to address this discrepancy as soon as possible, and definitely before merging this PR.

@brandonwillard brandonwillard force-pushed the faster_careduce branch 2 times, most recently from 3f08067 to d204db5 Compare January 19, 2022 00:37
@brandonwillard
Copy link
Member

brandonwillard commented Jan 19, 2022

I changed the benchmark test so that it necessarily uses only the Max Op. I also
created a follow-up issue about cleaning up the MaxAndArgmax Op situation (i.e. #765).

The results of benchmarking only the Max Op are within the same range of Numba vs. NumPy timing values as the case study Gist, so the results are acceptable.

Copy link
Member

@brandonwillard brandonwillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think we're good on this. There are some uncovered dispatch functions, but they don't have CAReduce implementations, so they won't actually be used. They'll be covered when/if there's ever a need to implement CAReduce Ops for those Scalar Ops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request important Numba Involves Numba transpilation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix Numba performance issues involving CAReduce Ops
3 participants