Improve the performance of Numba-compiled array reduction `Op`s #599

kc611 · 2021-09-28T14:52:12Z

Linked issue: #529

This PR replaces out use of vectorized in numba_funcify_CAReduce while also replacing the harmful np.transpose (More Details: #404 (comment)) from the JIT-ed function.

A performance comparison for this implementation: https://gist.github.com/kc611/7cc0451c6b5b53c9ce46fdc57c6a0da1

aesara/link/numba/dispatch/elemwise.py

codecov · 2021-09-29T11:50:15Z

Codecov Report

Merging #599 (7574fb2) into main (6fe9f83) will increase coverage by 0.00%.
The diff coverage is 94.02%.

❗ Current head 7574fb2 differs from pull request most recent head b4c8fb0. Consider uploading reports for the commit b4c8fb0 to get more accurate results

@@           Coverage Diff           @@
##             main     #599   +/-   ##
=======================================
  Coverage   78.36%   78.36%           
=======================================
  Files         152      152           
  Lines       47693    47742   +49     
  Branches    10880    10879    -1     
=======================================
+ Hits        37373    37415   +42     
- Misses       7772     7778    +6     
- Partials     2548     2549    +1

Impacted Files	Coverage Δ
aesara/link/numba/dispatch/elemwise.py	`96.00% <93.27%> (-1.64%)`	⬇️
aesara/link/numba/dispatch/basic.py	`92.01% <100.00%> (+0.22%)`	⬆️
aesara/tensor/basic_opt.py	`85.14% <0.00%> (-0.08%)`	⬇️
aesara/link/numba/dispatch/extra_ops.py	`98.09% <0.00%> (-0.06%)`	⬇️

brandonwillard

We need to combine this with the custom vectorize implementation described in #595. In other words, instead of using numba.vectorize, we need to generate our own for-loop` like the example here.

That—combined with custom reduce loop (e.g. like this function in the above example))—should get us near-NumPy performance via LLVM.

As the example shows, the timings aren't good for axis=1 because the simple all-purpose loop doesn't work well after transposing the array. The transpose + loop produces a bad memory access pattern, but that can be solved by generating different loops based on the (literal) axis value.

kc611 · 2021-11-23T19:00:49Z

Yeah, so if you have a look at the create_elemwise_reducer, we actually do make custom for loops for the kind of input array that we're given. Actually back when I built this I made it specifically for the reduce or at kind of functions so it's basically an custom vectorized implementation minus all the broadcasting logic for multiple inputs.

Anyways, the broadcasting logic can surely be added, but the main issue here (and I think @fanshi118 and I discussed a bit about it during that time) is that even with custom loops built entirely in Numba we weren't approaching Numpy performance whenever we scaled up the arrays.

For reference this is the gist that we benchmarked.
https://gist.github.com/kc611/22760dca36cc9062a401da89ee60ced8
(Updated it a bit to remove the parallelization stuff that we added for extra performance)

brandonwillard · 2021-11-23T19:28:00Z

Anyways, the broadcasting logic can surely be added, but the main issue here (and I think @fanshi118 and I discussed a bit about it during that time) is that even with custom loops built entirely in Numba we weren't approaching Numpy performance whenever we scaled up the arrays.

For reference this is the gist that we benchmarked. https://gist.github.com/kc611/22760dca36cc9062a401da89ee60ced8 (Updated it a bit to remove the parallelization stuff that we added for extra performance)

I'm not sure I understand that Gist, but the example Gist in my earlier comment appears to demonstrate NumPy-comparable results for at least axis=0.

We should start by exending that example with a simple handwritten Numba implementation for axis=1—one that performs at least as well as NumPy, of course. If that's not possible, we should confirm as much and report to the Numba folk.

After that, we can consider how to generalize the two handwritten implementations.

kc611 · 2021-11-23T19:43:21Z

I assume here that by handwritten Numba implementation you mean a for loop in which the indices of the inner loop are manually added. For a two dimensional array with axis = 1 that'd be a[i][j] where j is the inner loop while i is the outer loop variable.

If so then the gist that I mentioned above does exactly that, It generalizes the outer loops of vectorization using ndindex and the inner-most loop using a simple for loop. In fact it's more or less how Numpy itself does the generalization. Except we use a string based function generator.

Anyways, we do even have a handwritten gist for fixed 2/3 dimension array on which we tested all this before we actually generalized it like that. It won't make much of a difference though since the functions generated in both cases are exactly the same.

Link to fixed dimension loop gist: https://gist.github.com/kc611/7cc0451c6b5b53c9ce46fdc57c6a0da1
(Updated to remove the indexing logic in favor of np.ndindex)

brandonwillard · 2021-11-23T22:50:29Z

I've updated the np.max.reduce example Gist to include all the approaches we've considered.

If so then the gist that I mentioned above does exactly that, It generalizes the outer loops of vectorization using ndindex and the inner-most loop using a simple for loop.

Your Gist includes some generalizations that are only important to us; I was talking about a very simple Gist that only focuses on the very narrow np.max.reduce with axis=1 case and includes all the timings/relative benchmarks, so that it's easier for others (e.g. the Numba folk) to follow.

brandonwillard · 2021-12-13T05:10:44Z

To move forward, we might need to start implementing custom conversions for specific CAReduce Ops and—possibly—axis values. We already have an implementation for max, so we'll need to time the others like we did for that, and find acceptable implementations for those, as well.

By the way, I've updated the max example Gist; it now contains some helpful debug print-outs of the CFG plots for the generated LLVM IR.

Those plots seem to imply that the slow version performs unnecessary writes to the output array (i.e. res[i] = res[i] when res[i] >= x[i, j]). Apparently the unnecessary else clause in custom_max isn't being removed during optimization.

Anyway, we can get past this pretty easily with a custom implementation.

kc611 · 2021-12-22T14:03:07Z

The current failing test seems to be some issue with boolean typecasting in Numba.

import numpy as np
import numba

def careduce_axis(x):
    res_shape = 1
    res = np.full(res_shape, True, dtype=np.bool)
    x = x.astype(np.bool)
    for idx_arr in np.ndindex(res_shape):
        for i in range(x.shape[0]):
            res[0] = res[0] & x[i]
    return res

numba_careduce_axis = numba.njit(careduce_axis)

x = np.random.random(10)

careduce_axis(x) # Works
numba_careduce_axis(x) # Doesn't work

kc611 · 2022-01-01T17:20:57Z

Is there anything that further needs to be done in this PR ?

brandonwillard

Looks good!

Let's run some quick benchmarks on those new CAReduce implementations and make sure they're better. Actually, we might want to add unit tests for this (i.e. tests that make sure we're within an acceptable margin of NumPy-like performance).

kc611 · 2022-01-03T16:56:22Z

tests/link/test_numba_performance.py

+            if not isinstance(i, (SharedVariable, Constant))
+        ],
+        numpy_inputs,
+        performace_factor=4,


Looks like the performance isn't that good with this implementation either.

(I suspect it might have something to do with the dtype casting we're doing at the end of CAReduce functions in Numba)

We might need to pre-compile those CAReduce loops using use_optimized_cheap_pass.

I just pushed changes adding that. But it doesn't look like it's making any difference in performance, can you confirm if it's correctly implemented ? (with respect to the exec interface)

The context manager needs to be used when Numba JIT compiles a function, but it looks like it's been added to the Python source string compilation utility.

tests/link/test_numba_performance.py

brandonwillard · 2022-01-18T16:54:26Z

aesara/link/utils.py

+@contextmanager
+def use_optimized_cheap_pass(*args, **kwargs):
+    """Temporarily replace the cheap optimization pass with a better one."""
+    from numba.core.registry import cpu_target
+
+    context = cpu_target.target_context._internal_codegen
+    old_pm = context._mpm_cheap
+    new_pm = context._module_pass_manager(
+        loop_vectorize=True, slp_vectorize=True, opt=3, cost="cheap"
+    )
+    context._mpm_cheap = new_pm
+    try:
+        yield
+    finally:
+        context._mpm_cheap = old_pm
+
+


This is a general utility module for linking, so we can't put this Numba-specific code here.

brandonwillard · 2022-01-18T16:55:34Z

aesara/link/utils.py

+    global_env["use_optimized_cheap_pass"] = use_optimized_cheap_pass
+    src = f"""
+with use_optimized_cheap_pass():
+{indent(src, " " * 4)}
+"""


Same here; this function is supposed to be applicable to any string-to-Python-code conversion, so we can't add Numba-specific code to it.

brandonwillard

I've made some changes that should produce reduction functions similar to the best performing (and contextually suitable) implementation in the case study.

The timing comparison between the Aesara and NumPy version isn't appropriate, though, so it's been set to xfail for the moment. The reason is that at.max produces MaxAndArgmax Ops, which—as the name implies—compute both the max and arg-max, and NumPy computes only one at time.

More importantly, our current Numba translation for MaxAndArgmax literally computes both operations separately, so it will necessarily be much more expensive than a single NumPy max or arg-max.

We need to address this discrepancy as soon as possible, and definitely before merging this PR.

brandonwillard · 2022-01-19T00:40:39Z

I changed the benchmark test so that it necessarily uses only the Max Op. I also
created a follow-up issue about cleaning up the MaxAndArgmax Op situation (i.e. #765).

The results of benchmarking only the Max Op are within the same range of Numba vs. NumPy timing values as the case study Gist, so the results are acceptable.

stale

This only applies to reduction `Op`s (e.g. `CAReduce`).

brandonwillard

OK, I think we're good on this. There are some uncovered dispatch functions, but they don't have CAReduce implementations, so they won't actually be used. They'll be covered when/if there's ever a need to implement CAReduce Ops for those Scalar Ops.

kc611 marked this pull request as draft September 28, 2021 14:52

kc611 commented Sep 28, 2021

View reviewed changes

aesara/link/numba/dispatch/elemwise.py Outdated Show resolved Hide resolved

kc611 commented Sep 28, 2021

View reviewed changes

aesara/link/numba/dispatch/elemwise.py Outdated Show resolved Hide resolved

kc611 force-pushed the faster_careduce branch from d0486bc to 2e2ccc9 Compare September 29, 2021 10:45

brandonwillard linked an issue Nov 8, 2021 that may be closed by this pull request

Fix Numba performance issues involving CAReduce Ops #529

Closed

brandonwillard added the Numba Involves Numba transpilation label Nov 23, 2021

brandonwillard requested changes Nov 23, 2021

View reviewed changes

brandonwillard added important enhancement New feature or request labels Dec 13, 2021

brandonwillard assigned kc611 Dec 13, 2021

kc611 force-pushed the faster_careduce branch from 2e2ccc9 to 3f098d3 Compare December 22, 2021 12:14

kc611 marked this pull request as ready for review December 22, 2021 14:07

kc611 force-pushed the faster_careduce branch from 3f098d3 to c010237 Compare December 23, 2021 16:43

brandonwillard reviewed Jan 1, 2022

View reviewed changes

kc611 commented Jan 3, 2022

View reviewed changes

kc611 force-pushed the faster_careduce branch from 7d7a3fc to 313e277 Compare January 3, 2022 17:06

brandonwillard requested changes Jan 3, 2022

View reviewed changes

tests/link/test_numba_performance.py Outdated Show resolved Hide resolved

oscargus reviewed Jan 4, 2022

View reviewed changes

tests/link/test_numba_performance.py Outdated Show resolved Hide resolved

kc611 force-pushed the faster_careduce branch from 313e277 to df073d3 Compare January 7, 2022 17:02

brandonwillard force-pushed the faster_careduce branch from df073d3 to 0eb943d Compare January 18, 2022 16:46

brandonwillard requested changes Jan 18, 2022

View reviewed changes

brandonwillard force-pushed the faster_careduce branch from 0eb943d to 80b9176 Compare January 18, 2022 18:00

brandonwillard force-pushed the faster_careduce branch 6 times, most recently from 629f638 to 73fe17c Compare January 18, 2022 20:29

brandonwillard changed the title ~~Initialized a new careduce_axis implementation~~ Improve the performance of Numba-compiled array reduction Ops Jan 18, 2022

brandonwillard force-pushed the faster_careduce branch 7 times, most recently from 2185084 to 7574fb2 Compare January 18, 2022 22:14

brandonwillard previously requested changes Jan 18, 2022

View reviewed changes

brandonwillard force-pushed the faster_careduce branch 2 times, most recently from 3f08067 to d204db5 Compare January 19, 2022 00:37

brandonwillard and others added 4 commits January 20, 2022 00:38

Add a scalar mean Op

178aeed

Updated careduce_axis implementation for Numba

092e760

Added more optimizations to the Numba cheap pass-manager

ef9beb0

This only applies to reduction `Op`s (e.g. `CAReduce`).

Added performance testing for Numba

b4c8fb0

brandonwillard force-pushed the faster_careduce branch from d204db5 to b4c8fb0 Compare January 20, 2022 06:39

brandonwillard approved these changes Jan 20, 2022

View reviewed changes

brandonwillard merged commit a73ac95 into aesara-devs:main Jan 20, 2022

brandonwillard mentioned this pull request Aug 4, 2022

Wrong gradients when inputs are dynamically broadcasted #1089

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the performance of Numba-compiled array reduction `Op`s #599

Improve the performance of Numba-compiled array reduction `Op`s #599

kc611 commented Sep 28, 2021

codecov bot commented Sep 29, 2021 •

edited

Loading

brandonwillard left a comment

kc611 commented Nov 23, 2021 •

edited

Loading

brandonwillard commented Nov 23, 2021

kc611 commented Nov 23, 2021 •

edited

Loading

brandonwillard commented Nov 23, 2021 •

edited

Loading

brandonwillard commented Dec 13, 2021

kc611 commented Dec 22, 2021

kc611 commented Jan 1, 2022

brandonwillard left a comment •

edited

Loading

kc611 Jan 3, 2022 •

edited

Loading

brandonwillard Jan 3, 2022

kc611 Jan 7, 2022

brandonwillard Jan 18, 2022

brandonwillard Jan 18, 2022

brandonwillard Jan 18, 2022

brandonwillard left a comment •

edited

Loading

brandonwillard commented Jan 19, 2022 •

edited

Loading

brandonwillard left a comment

Improve the performance of Numba-compiled array reduction Ops #599

Improve the performance of Numba-compiled array reduction Ops #599

Conversation

kc611 commented Sep 28, 2021

codecov bot commented Sep 29, 2021 • edited Loading

Codecov Report

brandonwillard left a comment

Choose a reason for hiding this comment

kc611 commented Nov 23, 2021 • edited Loading

brandonwillard commented Nov 23, 2021

kc611 commented Nov 23, 2021 • edited Loading

brandonwillard commented Nov 23, 2021 • edited Loading

brandonwillard commented Dec 13, 2021

kc611 commented Dec 22, 2021

kc611 commented Jan 1, 2022

brandonwillard left a comment • edited Loading

Choose a reason for hiding this comment

kc611 Jan 3, 2022 • edited Loading

Choose a reason for hiding this comment

brandonwillard Jan 3, 2022

Choose a reason for hiding this comment

kc611 Jan 7, 2022

Choose a reason for hiding this comment

brandonwillard Jan 18, 2022

Choose a reason for hiding this comment

brandonwillard Jan 18, 2022

Choose a reason for hiding this comment

brandonwillard Jan 18, 2022

Choose a reason for hiding this comment

brandonwillard left a comment • edited Loading

Choose a reason for hiding this comment

brandonwillard commented Jan 19, 2022 • edited Loading

brandonwillard left a comment

Choose a reason for hiding this comment

Improve the performance of Numba-compiled array reduction `Op`s #599

Improve the performance of Numba-compiled array reduction `Op`s #599

codecov bot commented Sep 29, 2021 •

edited

Loading

kc611 commented Nov 23, 2021 •

edited

Loading

kc611 commented Nov 23, 2021 •

edited

Loading

brandonwillard commented Nov 23, 2021 •

edited

Loading

brandonwillard left a comment •

edited

Loading

kc611 Jan 3, 2022 •

edited

Loading

brandonwillard left a comment •

edited

Loading

brandonwillard commented Jan 19, 2022 •

edited

Loading