ENH/PERF: enable column-wise reductions for EA-backed columns #32867

jorisvandenbossche · 2020-03-20T14:47:53Z

Currently, for reductions on a DataFrame, we convert the full DataFrame to a single "interleaved" array and then perform the operation. That's the default, but when numeric_only=True is specified, it is done block-wise.

Enabling column-wise reductions (or block-wise for EAs):

Gives better performance in common cases / no need to create a new 2D array (which goes through object dtype for nullable ints)
Ensures to use the reduction implementation of the EA itself, which can be more correct / more efficient than converting to an ndarray and using our generic nanops.

For illustration purposes, I added a column_wise keyword in this PR (not meant to keep this, just for testing), so we can compare a few cases:

In [9]: df_wide = pd.DataFrame(np.random.randint(1000, size=(1000,100))).astype("Int64").copy() 

In [15]: %timeit df_wide.mean()   
9.68 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [16]: %timeit df_wide.mean(numeric_only=True)
10.1 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [17]: %timeit df_wide.mean(column_wise=True)  
5.22 ms ± 29.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [18]: df_long = pd.DataFrame(np.random.randint(1000, size=(10000,10))).astype("Int64").copy()  

In [19]: %timeit df_long.mean()
7.77 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [20]: %timeit df_long.mean(numeric_only=True)         
2.07 ms ± 31.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [21]: %timeit df_long.mean(column_wise=True) 
1.04 ms ± 4.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So I experimented with two approaches:

First by iterating through the columns and calling the _reduce of the underlying EA (this path gets taken by using the temporary keyword column_wise=True)
Fixing the block-wise case for extension blocks (triggered by numeric_only=True) by changing that to also use _reduce of the EA (currently this was failing by calling nanops functions on the EA)

The first gives better performance (it is simpler in implementation by not involding the blocks), but requires some more new code (it uses less the existing machinery).

Ideally, for EA columns, we should always use their own reduction implementation (thus call EA._reduce), I think. So for both approaches, the question will be how to trigger this behaviour.

Closes #32651, closes #34520

jorisvandenbossche · 2020-03-20T14:50:31Z

cc @jbrockmendel

pandas/core/frame.py

pandas/core/internals/managers.py

jbrockmendel · 2020-03-20T15:24:23Z

Come to think of it, the place where this dispatch belongs may be in the relevant nanops functions

jorisvandenbossche · 2020-03-20T15:26:17Z

Come to think of it, the place where this dispatch belongs may be in the relevant nanops functions

Personally, I prefer the nanops to be about ops on numpy arrays, and not deal with extension arrays

pandas/core/internals/managers.py

pandas/core/frame.py

jbrockmendel · 2020-03-20T16:57:02Z

pandas/core/frame.py

@@ -7898,6 +7915,19 @@ def _get_data(axis_matters):
                raise NotImplementedError(msg)
            return data

+        def blk_func(values):
+            if isinstance(values, ExtensionArray):


with this inside blk_func, shouldn't the block-wise operation have the same performance bump as the column-wise?

with this inside blk_func, shouldn't the block-wise operation have the same performance bump as the column-wise?

It didn't actually change anything performance wise (it's the same function being called as before).
The reason that both paths have different performance, is because the re-assembling of the results into a Series is more expensive for the block-wise compared to column-wise.

(it's possible that the block-wise way could be optimized to get rid of this difference though. The main thing is that the block results are not in order)

(it's possible that the block-wise way could be optimized to get rid of this difference though. The main thing is that the block results are not in order)

This would be really nice.

jorisvandenbossche · 2020-03-20T17:04:41Z

So the bigger question is: how can we get to use this by default (at least for EAs). Some ideas:

The column_wise route could for example be triggered by checking if all columns are ExtensionBlock columns (we already have a BlockManager.any_extension_types, could also have an all_extension_types).
This way, this would only be triggered in rare cases though (certainly as long we don't yet have a float extension type). We could also trigger it with any_extension_types: for the new nullable dtype that would be possible (since those are new, they can introduces changes in behaviour), but not for the older extension dtypes (categorical, datetimelikes)
Currently, with numeric_only is None, we first try on the full values (DataFrame.values), and then fall back to frame_apply with ignore_failures=True.
I suppose we could also do this block-wise, just there ignore failures in case of numeric_only is None, and in such a case skip those blocks to assemble the output (but will need to check what all fails by doing this ..).
If that works, we could get rid of the full numeric_only is None-block in _reduce.

jbrockmendel · 2020-03-20T17:29:53Z

Currently, with numeric_only is None, we first try on the full values (DataFrame.values), and then fall back to frame_apply with ignore_failures=True.

I've got a branch that identifies the cases where frame_apply is used and does that before the .values call. Trying to 1) avoid a .values call and 2) de-nest _reduce. I'll move that branch up the priority-list.

jbrockmendel · 2020-03-20T17:35:54Z

Personally, I prefer the nanops to be about ops on numpy arrays, and not deal with extension arrays

Depends on the public-ness of those functions. ATM their docstrings have examples with Series, not sure if we have other docs or tests with those.

jorisvandenbossche · 2020-03-20T17:46:19Z

I've got a branch that identifies the cases where frame_apply is used and does that before the .values call. Trying to 1) avoid a .values call and 2) de-nest _reduce. I'll move that branch up the priority-list.

Note that my suggestion was to eliminate this part entirely (by using block-wise for all). So let's ensure we don't do duplicate / conflicting work. Does the branch already have something? (can you maybe push it to your fork?)

Depends on the public-ness of those functions. ATM their docstrings have examples with Series, not sure if we have other docs or tests with those.

nanops are not public (regardless of their docstrings). What I meanly meant is that (in my head) they are meant to work on numpy arrays (whether extracted from a Series first or not)

jbrockmendel · 2020-03-20T18:03:32Z

https://github.com/jbrockmendel/pandas/tree/cln-reduce

Note that my suggestion was to eliminate this part entirely (by using block-wise for all).

This would require making BlockManager._reduce handle numeric_only=None similar to how frame_apply handles allow_failures=True, right? I've been reticent to do that (because I dont like the ignoring-exceptions pattern, xref #28900), but it may end up being the best option (off the top of my head, this would make it easy to solve #28773).

jorisvandenbossche · 2020-03-20T20:00:28Z

Thanks!

because I dont like the ignoring-exceptions pattern,

Yes, but as long as we keep the numeric_only=None behaviour, that's the whole point of it .. (to ignore exceptions). So it would still be nice to try to unify the two code paths a bit more than what we have now.
Will try to take a look at it next week.

jbrockmendel · 2020-03-20T23:59:19Z

hmm block-wise and column-wise with ignore_failures wont necessarily be equivalent for object-dtype

jorisvandenbossche · 2020-03-24T12:13:05Z

hmm block-wise and column-wise with ignore_failures wont necessarily be equivalent for object-dtype

Ah, that's a good point. So for ObjectBlock, we would still need to do it column-wise, if we want to get rid of the fallback in general. Will take a further look one of the coming days if that looks feasable

…ductions

Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>

jorisvandenbossche · 2020-07-12T20:39:42Z

this is massivley increasing the complexity here. Please find a way to do better.

Do you have something more concrete as feedback? I don't find this particularly complex. It adds some more code, for sure, to ensure we perform the reductions properly column-wise with the correct operation (which fixes bugs).

Note that there was still code that could be removed (so I was actually replacing something, not only adding). I removed that now, to make this more clear (there was a comment about it)

jbrockmendel · 2020-07-12T21:46:06Z

Do you have something more concrete as feedback?

A suggestion for the short-term, (i.e. to address #35112) is to change the inline-defined f to something like:

    def blk_func(values):
            if is_extension_array_dtype(values.dtype):
                return extract_array(values)._reduce(name, skipna=skipna, **kwds)
            else:
                return op(values, axis=axis, skipna=skipna, **kwds)

in a dedicated PR. I think that would address a subset of what this PR is doing and is orthogonal to the rest of this.

jreback · 2020-07-13T13:30:50Z

this needs to wait for 1.2

jreback · 2020-07-15T12:26:24Z

moving to 1.2

jorisvandenbossche · 2020-07-15T12:58:47Z

I am fine with that, if we then include #35254 instead

jbrockmendel · 2020-09-04T02:04:03Z

@jorisvandenbossche I think #36076 has a bearing on this, thoughts?

Dr-Irv · 2020-09-04T19:43:24Z

Reduction on empty object-dtype DataFrame, currently returns float for DataFrame but integer for Series:
...
while on this PR the DataFrame behaviour follows Series to return integer series.

But in current behavior, a similar thing happens when the DataFrame has object dtype, but consists of int values and is not empty. Not sure what your PR does in this case.

>>> ddf=pd.DataFrame([[1,2,3]],columns=['a','b','c'],dtype=object)
>>> ddf
   a  b  c
0  1  2  3
>>> ddf.sum()
a    1.0
b    2.0
c    3.0
dtype: float64
>>> ddf.dtypes
a    object
b    object
c    object
dtype: object
>>> ddf['a'].sum()
1
>>> type(_)
<class 'int'>
>>>

jorisvandenbossche · 2020-09-07T07:43:40Z

@Dr-Irv ah, thanks, that's something we should then also test. And another thing to look into ;)

github-actions · 2020-10-09T00:15:00Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jbrockmendel · 2020-11-07T22:42:38Z

@jorisvandenbossche i think this is closable, as it really isnt actionable until we get the #36076-and-similar inconsistencies fixed, and at that point we'll be able to go all blockwise (which for ArrayManager will be columnwise anyway)

jreback · 2020-11-18T18:25:01Z

looks ok, but not for 1.2

jbrockmendel · 2020-11-18T19:00:54Z

@jorisvandenbossche does this still have a perf impact given that we don't go through .values anymore for axis=0? Or is the perf impact all in re-assembling blockwise results?

We now always use EA._reduce I think, so that part of the motivation should no longer be relevant.

As @jreback referred to in his previous comment, we've gone to a lot of trouble to simplify DataFrame._reduce, i'm wary of adding more re-complexifying it.

jreback · 2021-02-11T01:34:13Z

@jorisvandenbossche pls update or close

jbrockmendel · 2021-03-19T15:43:09Z

is this still relevant?

simonjayhawkins · 2021-06-16T14:00:02Z

@jorisvandenbossche closing as stale. reopen when ready.

jorisvandenbossche added 2 commits March 20, 2020 15:08

ENH/PERF: enable column-wise reductions for EA-backed columns

a9ca0fa

fix numeric_only for EAs

21aee0d

jorisvandenbossche added Performance Memory or execution speed performance Numeric Operations Arithmetic, Comparison, and Logical operations ExtensionArray Extending pandas with custom dtypes or arrays. labels Mar 20, 2020