Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: inconsistent groupby.apply behaviour depending on column dtypes #43206

Closed
od-crypto opened this issue Aug 25, 2021 · 28 comments
Closed

BUG: inconsistent groupby.apply behaviour depending on column dtypes #43206

od-crypto opened this issue Aug 25, 2021 · 28 comments
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@od-crypto
Copy link

I see strange groupby.apply behaviour: resulting index depends on the existence of the Timestamp dtype column. MWE:

import pandas as pd
df = pd.DataFrame({
    'date_val': {
        0: pd.Timestamp('2017-01-01 00:00:00'),
        1: pd.Timestamp('2017-01-01 00:00:00'),
        2: pd.Timestamp('2017-02-01 00:00:00'),
        3: pd.Timestamp('2017-02-01 00:00:00')
    },
    'uid': {
        0: '1',
        1: '2',
        2: '1',
        3: '2'
    },
    'str_val': {
        0: str('2017-01-01 00:00:00'),
        1: str('2017-01-01 00:00:00'),
        2: str('2017-02-01 00:00:00'),
        3: str('2017-02-01 00:00:00')
    }
})

df1 = df.groupby('uid', as_index=False)[['uid', 'str_val', 'date_val']].apply(lambda x: x.sort_values(by='str_val',ascending=True))
df2 = df.groupby('uid', as_index=False)[['uid', 'str_val']].apply(lambda x: x.sort_values(by='str_val',ascending=True))
df3 = df.groupby('uid', as_index=False)[['uid', 'date_val']].apply(lambda x: x.sort_values(by='date_val',ascending=True))

print(df1)
print(df2)
print(df3)

Output:

  uid              str_val   date_val
0   1  2017-01-01 00:00:00 2017-01-01
1   2  2017-01-01 00:00:00 2017-01-01
2   1  2017-02-01 00:00:00 2017-02-01
3   2  2017-02-01 00:00:00 2017-02-01
    uid              str_val
0 0   1  2017-01-01 00:00:00
  2   1  2017-02-01 00:00:00
1 1   2  2017-01-01 00:00:00
  3   2  2017-02-01 00:00:00
  uid   date_val
0   1 2017-01-01
1   2 2017-01-01
2   1 2017-02-01
3   2 2017-02-01

I expect the index to be like in the second output (df2). Indexes in df1 and df3 seem strange.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 5f648bf
python : 3.8.0.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
Version : Darwin Kernel Version 18.7.0: Mon Apr 27 20:09:39 PDT 2020; root:xnu-4903.278.35~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.3.2
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.2
setuptools : 52.0.0.post20210125
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.26.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@od-crypto od-crypto added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 25, 2021
@simonjayhawkins
Copy link
Member

Thanks @od-crypto for the report.

I expect the index to be like in the second output (df2). Indexes in df1 and df3 seem strange.

That was the result on 1.2.5

On master the second case now has the same index as the other two cases, a regular index.

not sure without investigating further which is correct or whether the changes were intentional.

will label as a regression for now pending further investigation.

@simonjayhawkins simonjayhawkins added Apply Apply, Aggregate, Transform, Map Groupby Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 25, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3.3 milestone Aug 25, 2021
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Aug 25, 2021
@simonjayhawkins
Copy link
Member

not sure without investigating further which is correct or whether the changes were intentional.

for 1st case

first bad commit: [a3bb751] REF: back DatetimeBlock, TimedeltaBlock by DTA/TDA (#40456)

so the change was probably not intentional. @jbrockmendel is #42921 (comment) relevant here?

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Aug 25, 2021
@simonjayhawkins
Copy link
Member

On master the second case now has the same index as the other two cases, a regular index.

first bad commit: [d037ff6] REF: remove libreduction.apply_frame_axis0 (#42992)

as this is a very recent change, could considered reverting that.

@jbrockmendel
Copy link
Member

Looks like this is driven by a difference in the cython vs python code paths, likely involving how mutated is set.

@jbrockmendel
Copy link
Member

Porting the mutation check from the now-deleted apply_frame_axis_0 into BaseGrouper.apply seems to fix this. need to run the rest of the tests

@jbrockmendel
Copy link
Member

Looks like that broke 7 tests. @od-crypto any interest in trying to track this down?

@jbrockmendel
Copy link
Member

OK, some progress here. within each group, each of the columns is already sorted, so the sort_values being applied doesn't do any reindexing, so it evaluates mutated=False, which i guess affects how it re-stacks the results, probably in _wrap_applied_output.

@od-crypto what does the [0, 1] index level in df2 correspond to? i.e. why is that what you expect?

@od-crypto
Copy link
Author

od-crypto commented Sep 3, 2021

@jbrockmendel which behaviour to choose, df1/df3 or df2, is a second question, important is that it is consistent at the end.
Yes, the date was already sorted within each 'uid' group also in the input df. However, if you change the input df to be NOT sorted by the date, you obtain the df2 behaviour in all three cases. just run the code below:

import pandas as pd
df = pd.DataFrame({
    'date_val': {
        0: pd.Timestamp('2017-02-01 00:00:00'),
        1: pd.Timestamp('2017-01-01 00:00:00'),
        2: pd.Timestamp('2017-01-01 00:00:00'),
        3: pd.Timestamp('2017-02-01 00:00:00')
    },
    'uid': {
        0: '1',
        1: '2',
        2: '1',
        3: '2'
    },
    'str_val': {
        0: str('2017-02-01 00:00:00'),
        1: str('2017-01-01 00:00:00'),
        2: str('2017-01-01 00:00:00'),
        3: str('2017-02-01 00:00:00')
    }
})

dfg = df.groupby('uid', as_index=False)

df1 = dfg[['uid', 'str_val', 'date_val']].apply(lambda x: x.sort_values(by='str_val',ascending=True))
df2 = dfg[['uid', 'str_val']].apply(lambda x: x.sort_values(by='str_val',ascending=True))
df3 = dfg[['uid', 'date_val']].apply(lambda x: x.sort_values(by='date_val',ascending=True))

print(df1)
print(df2)
print(df3)

@od-crypto
Copy link
Author

od-crypto commented Sep 3, 2021

@jbrockmendel answering your question above: the [0, 1] index level in df2 seems to correspond to the index of the 'uid' group, sorted by 'uid'.

@jbrockmendel
Copy link
Member

important is that it is consistent at the end

I agree. On master it is currently consistent, but with the result that you said in the OP seems strange. I'm trying to understand what the first level in the MultiIndex result corresponds to. Can you help me understand this?

@simonjayhawkins
Copy link
Member

important is that it is consistent at the end

I agree. On master it is currently consistent ...

but to be clear, on released pandas it is not. so if we decide that the new behavior is correct and is a bugfix and not an api breaking change we may want to consider backporting #42992

@od-crypto
Copy link
Author

important is that it is consistent at the end

I agree. On master it is currently consistent, but with the result that you said in the OP seems strange. I'm trying to understand what the first level in the MultiIndex result corresponds to. Can you help me understand this?

@jbrockmendel, By "df1/df3 seem strange" I meant strange within the same pandas release 1.3.2, as when one feeds the above reported code with an unsorted input df, one obtains for all the three cases the df2 behaviour. The first level in the Multiindex of df2 seem to "correspond to the index of the 'uid' group, sorted by 'uid'." However if choosing between the df1/df3 and df2 for the next release, one might consider voting for the df1/df3, as backward compatibility is always a weighty argument...

@jbrockmendel
Copy link
Member

The first level in the Multiindex of df2 seem to "correspond to the index of the 'uid' group, sorted by 'uid'."

The first level in df2.index in the OP is [0, 0, 1, 1] (only a single 0 and a single 1 are printed), and the UIDs are all either '1' or '2', so this doesn't make sense to me. Is there a chance you were referring to a different level?

@simonjayhawkins
Copy link
Member

changing milestone to 1.3.5

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.4, 1.3.5 Oct 16, 2021
@simonjayhawkins
Copy link
Member

However, if you change the input df to be NOT sorted by the date, you obtain the df2 behaviour in all three cases. just run the code below:

This is on master so it does appear that this invalidates the consistency argument for master.

so we can probably rule out backporting #42992 #43206 (comment) (This actually fixed another regression #41999 but was not backported)

so it looks like we effectively now have two regressions

  1. on 1.3.x: undocumented/unintentional change to df1/df3 behavior, REF: back DatetimeBlock, TimedeltaBlock by DTA/TDA #40456
  2. on master: undocumented/unintentional change to df2 behavior, REF: remove libreduction.apply_frame_axis0 #42992

@simonjayhawkins simonjayhawkins added the Blocker for rc Blocking issue or pull request for release candidate label Nov 28, 2021
@simonjayhawkins
Copy link
Member

I've added the blocker for rc label to increase priority if not fixed for 1.3.5

@simonjayhawkins
Copy link
Member

changing milestone

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.5, 1.4 Dec 11, 2021
@jreback
Copy link
Contributor

jreback commented Dec 23, 2021

we cannot block on this w/o a PR for the rc.

@jreback
Copy link
Contributor

jreback commented Dec 30, 2021

@rhshadrach @jbrockmendel can you summarise the issue & what we need to decide here

@rhshadrach
Copy link
Member

rhshadrach commented Dec 31, 2021

I agree with previous comments that in the OP, the result for df1 and df3 should be the same as the reported result for df2. Currently on master, even df2 gives the wrong result.

The core issue here is apply looking at the result and concluding "this is a transform" because the index is the same as the input in an operation where this is not generally true for the operation on other inputs. It would be fixed by #34998.

A potential secondary issue is the resulting index containing the values 0 and 1 (ref: #43206 (comment)). This looked odd at first glance to me as well, but I believe it is correct behavior. With as_index=False, the resulting index is a RangeIndex enumerating the groups. That behavior is consistent across groupby.

@jbrockmendel
Copy link
Member

This is a hard problem, my only real opinion here is that reinstating libreduction.apply_frame_axis0 (which git blame suggests would fix this) would cause more problems than it solves.

@jreback
Copy link
Contributor

jreback commented Dec 31, 2021

moving to 1.5 we cannot wait on this.

@jreback jreback modified the milestones: 1.4, 1.5 Dec 31, 2021
@jreback jreback removed the Blocker for rc Blocking issue or pull request for release candidate label Dec 31, 2021
@rhshadrach
Copy link
Member

@jbrockmendel Are you thinking #34998 wouldn't fix? Agreed on not reintroducing apply_frame_axis0.

@jbrockmendel
Copy link
Member

Are you thinking #34998 wouldn't fix?

Haven't looked at it closely. It's plausible.

@NumberPiOso
Copy link
Contributor

Currently, running

df1 = df.groupby('uid', as_index=False)[['uid', 'str_val', 'date_val']].apply(lambda x: x.sort_values(by='str_val',ascending=True))
df2 = df.groupby('uid', as_index=False)[['uid', 'str_val']].apply(lambda x: x.sort_values(by='str_val',ascending=True))
df3 = df.groupby('uid', as_index=False)[['uid', 'date_val']].apply(lambda x: x.sort_values(by='date_val',ascending=True))

produces

FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.

and results to:

>>> print(df1)
0   1  2017-01-01 00:00:00 2017-01-01
1   2  2017-01-01 00:00:00 2017-01-01
2   1  2017-02-01 00:00:00 2017-02-01
3   2  2017-02-01 00:00:00 2017-02-01
>>> print(df2)
  uid              str_val
0   1  2017-01-01 00:00:00
1   2  2017-01-01 00:00:00
2   1  2017-02-01 00:00:00
3   2  2017-02-01 00:00:00
>>> print(df3)
  uid   date_val
0   1 2017-01-01
1   2 2017-01-01
2   1 2017-02-01
3   2 2017-02-01

So now the behaviour is consistent which was the BUG. I think it should be closed now.

@NumberPiOso
Copy link
Contributor

However, the default behaviour will change in the future:

>>> df1 = df.groupby('uid', as_index=False, group_keys=True)[['uid', 'str_val', 'date_val']].apply(lambda x: x.sort_values(by='str_val',ascending=True))
>>> df2 = df.groupby('uid', as_index=False, group_keys=True)[['uid', 'str_val']].apply(lambda x: x.sort_values(by='str_val',ascending=True))
>>> df3 = df.groupby('uid', as_index=False, group_keys=True)[['uid', 'date_val']].apply(lambda x: x.sort_values(by='date_val',ascending=True))
nt(df3)
>>> print(df1)
    uid              str_val   date_val
0 0   1  2017-01-01 00:00:00 2017-01-01
  2   1  2017-02-01 00:00:00 2017-02-01
1 1   2  2017-01-01 00:00:00 2017-01-01
  3   2  2017-02-01 00:00:00 2017-02-01
>>> print(df2)
    uid              str_val
0 0   1  2017-01-01 00:00:00
  2   1  2017-02-01 00:00:00
1 1   2  2017-01-01 00:00:00
  3   2  2017-02-01 00:00:00
>>> print(df3)
    uid   date_val
0 0   1 2017-01-01
  2   1 2017-02-01
1 1   2 2017-01-01
  3   2 2017-02-01

@NumberPiOso
Copy link
Contributor

Please check @rhshadrach

@rhshadrach
Copy link
Member

Thanks @NumberPiOso - agreed. This behavior can now be controlled by the user, is consistent, and will have the expected output in pandas 2.0. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

6 participants