BUG: Use correct ExtensionArray reductions in DataFrame reductions #35254

jbrockmendel · 2020-07-12T22:48:34Z

closes BUG: Mixed DataFrame with Extension Array incorrect aggregation #34520
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jreback · 2020-07-12T23:31:21Z

pandas/core/frame.py

+        def func(values):
+            if is_extension_array_dtype(values.dtype):
+                return extract_array(values)._reduce(name, skipna=skipna, **kwds)
+            else:


can we not use the same exact path for non extension arrays

these if/then for extension arrays are terrible

we'd have to put the analogous check in each of the nanops.nanfoo functions, which I know @jorisvandenbossche wouldn't like.

is func now effectively the same as blk_func on L8575?

Indeed, doing it here is IMO much cleaner than in each nanops function.

is func now effectively the same as blk_func on L8575?

Not exactly, the axis handling is different (also blk_func knows it gets an array, so the extract_array stuff is not needed).

I guess this is ok as a short term check, but seeing these constant if is_extension_array do one thing, else do something else basically means the api is missing lots of things. I would much rather fix it than keep doing this kind of thing. This makes code extremly hard to read and understand.

I remember being very excited when we removed the is_ndarray check here and this because much simpler, now we are going backwards.

jorisvandenbossche

Thanks, this might be a path to a short-term compromise ;)

One problem is that frame_apply path is only taken for if not self._is_homogeneous_type and filter_type is None: ....
So that means eg a DataFrame with only columns of the same EAdtype will still take the incorrect path. So maybe we can edit that condition to include something like or self._any_extension_type.

I am also not sure why the filter_type="bool" functions (any/all) cannot take this path. That might give problems for the "boolean" dtype.

simonjayhawkins · 2020-07-13T12:26:49Z

So maybe we can edit that condition to include something like or self._any_extension_type.

using self._mgr.any_extension_types also fixes #32651 (but not checked to see if anything breaks)

…g-_reduce

jbrockmendel · 2020-07-14T20:26:30Z

One problem is that frame_apply path is only taken for if not self._is_homogeneous_type and filter_type is None: ....

Yah, the longer-term solution I have in mind is #34714 that all cases will go through. Is tinkering with the is_homogeneous_type check necessary for this PR, or can we punt on that until after 1.1?

Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>

jorisvandenbossche · 2020-07-15T12:50:54Z

@jbrockmendel hope you don't mind, but I cherry-picked some commits with tests from me and Simon from the other PR onto this branch. Depending on whether we do the any_extension_type thing, we might need to xfail them, though

Is tinkering with the is_homogeneous_type check necessary for this PR, or can we punt on that until after 1.1?

I would personally prefer to do this here as well, as otherwise you get a rather surprising inconsistency which underlying operation is actually ran for DataFrames with EAs depending on wether other dtypes are present or not.

I pushed the tiny change for that as well in the last commit, to see what CI says, but we can easily undo that again.

jorisvandenbossche · 2020-07-15T13:43:34Z

The one failure on the py37_np18 build seems to be unrelated .. Have we seen that elsewhere as well?

simonjayhawkins · 2020-07-15T14:50:06Z

The one failure on the py37_np18 build seems to be unrelated .. Have we seen that elsewhere as well?

yep, although AFAIK we don't have an issue for this. see #34906 (comment) and #35086 (comment)

if i see this failure on a PR, I normally just re-run.

TomAugspurger

LGTM, thanks.

simonjayhawkins · 2020-07-15T16:20:26Z

needs whatsnew for #32651 (can copy from #34210)

jorisvandenbossche · 2020-07-15T18:14:09Z

@simonjayhawkins I think the general whatsnew note added by Brock already covers that as well (can maybe just add the issue number)

TomAugspurger · 2020-07-15T19:03:43Z

Merging in a few hours if no objections (cc @jreback, I think you're the only one without an explicit +1)

jbrockmendel · 2020-07-15T21:48:29Z

@jorisvandenbossche your commits look good, thanks for the heads up

jreback · 2020-07-15T22:18:05Z

thanks @jbrockmendel

…andas-dev#35254)

BUG: df.sum with Int64 dtype

9b84158

jreback requested changes Jul 12, 2020

View reviewed changes

jorisvandenbossche reviewed Jul 13, 2020

View reviewed changes

Merge branch 'master' of https://github.com/pandas-dev/pandas into bu…

15ab7d7

…g-_reduce

jbrockmendel and others added 5 commits July 14, 2020 13:28

whatnsew

c89da43

add test case for GH34520, copied from GH35112

261fa32

Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>

add test to ensure EA op is used for integer array

9ee9669

add test for GH32651, copied from GH34210

390b9bb

Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>

remove now duplicated test

312cb9c

jorisvandenbossche changed the title ~~BUG: df.sum with Int64 dtype~~ BUG: Use correct ExtensionArray reductions in DataFrame reductions Jul 15, 2020

add self._mgr.any_extension_types check

0f33353

jorisvandenbossche added this to the 1.1 milestone Jul 15, 2020

This was referenced Jul 15, 2020

ENH/PERF: enable column-wise reductions for EA-backed columns #32867

Closed

REGR: setting column with setitem should not modify existing array inplace #35266

Closed

TomAugspurger approved these changes Jul 15, 2020

View reviewed changes

TomAugspurger added Numeric Operations Arithmetic, Comparison, and Logical operations ExtensionArray Extending pandas with custom dtypes or arrays. labels Jul 15, 2020

add issue number to whatsnew

babedb9

simonjayhawkins approved these changes Jul 15, 2020

View reviewed changes

jreback approved these changes Jul 15, 2020

View reviewed changes

jreback merged commit 595208b into pandas-dev:master Jul 15, 2020

jbrockmendel deleted the bug-_reduce branch July 15, 2020 22:44

simonjayhawkins mentioned this pull request Jul 16, 2020

DataFrame with Int64 columns casts to float64 with .max()/.min() #32651

Closed

fangchenli pushed a commit to fangchenli/pandas that referenced this pull request Jul 16, 2020

BUG: Use correct ExtensionArray reductions in DataFrame reductions (p…

3de5714

…andas-dev#35254)

Uh oh!

BUG: Use correct ExtensionArray reductions in DataFrame reductions #35254

BUG: Use correct ExtensionArray reductions in DataFrame reductions #35254

Uh oh!

Conversation

jbrockmendel commented Jul 12, 2020

Uh oh!

jreback Jul 12, 2020

Choose a reason for hiding this comment

Uh oh!

jbrockmendel Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

simonjayhawkins Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

jreback Jul 14, 2020

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

simonjayhawkins commented Jul 13, 2020

Uh oh!

jbrockmendel commented Jul 14, 2020

Uh oh!

jorisvandenbossche commented Jul 15, 2020

Uh oh!

jorisvandenbossche commented Jul 15, 2020

Uh oh!

simonjayhawkins commented Jul 15, 2020

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

simonjayhawkins commented Jul 15, 2020

Uh oh!

jorisvandenbossche commented Jul 15, 2020

Uh oh!

TomAugspurger commented Jul 15, 2020

Uh oh!

jbrockmendel commented Jul 15, 2020

Uh oh!

jreback commented Jul 15, 2020

Uh oh!

Uh oh!