Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby.apply() inconsistently indexes its return value with the groupby-cols #44803

Closed
2 of 3 tasks
ypsah opened this issue Dec 7, 2021 · 10 comments · Fixed by #45509 or #45578
Closed
2 of 3 tasks

BUG: groupby.apply() inconsistently indexes its return value with the groupby-cols #44803

ypsah opened this issue Dec 7, 2021 · 10 comments · Fixed by #45509 or #45578
Assignees
Labels
Apply Apply, Aggregate, Transform, Map good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@ypsah
Copy link

ypsah commented Dec 7, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame(dict(
    name=["Alice", "Bob", "Carl", "Dan", "Eve"],
    age=[20, 21, 20, 21, 20],
    height=[165, 175, 189, 182, 157],
    weight=[65, 72, 95, 83, 58],
)).set_index("name")

indexed_by_age = df.groupby(["age"]).apply(lambda group: group.copy())
print(indexed_by_age)

print()
print("=" * 30)
print()

indexed_only_by_name = df.groupby(["age"]).apply(lambda group: group)
print(indexed_only_by_name)

Issue Description

The above snippet produces the following output:

           age  height  weight
age name
20  Alice   20     165      65
    Carl    20     189      95
    Eve     20     157      58
21  Bob     21     175      72
    Dan     21     182      83

==============================

       age  height  weight
name
Alice   20     165      65
Bob     21     175      72
Carl    20     189      95
Dan     21     182      83
Eve     20     157      58

Where we can see that despite the return values of the apply-func being semantically identical (group vs. group.copy()), one ends up being indexed with age and name while the other only gets name.

Expected Behavior

I would have expected the two returned dataframes to be the same. I have a personal preference for the first option, with the age appearing in the index, but I think I would be able to work with both.


I know other inconsistencies have been reported:

But IMHO never as clearly as this. I'm hoping this reproducer helps diagnose the issue more easily.
Interestingly, the difference only creeps up when grouping_keys is set to True. Otherwise, we get the second output twice.

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed
python : 3.9.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.6-arch2-1
Version : #1 SMP PREEMPT Thu, 02 Dec 2021 15:47:09 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.4
numpy : 1.21.4
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.5.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@ypsah ypsah added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 7, 2021
@v-liuwei
Copy link

v-liuwei commented Dec 19, 2021

I just met the same problem, and did some test on the latest released version v1.3.5.

It seems that the fast_apply set the mutated to True in the .copy() case while set it to False in the other case.

try:
sdata = splitter.sorted_data
result_values, mutated = splitter.fast_apply(f, sdata, group_keys)
except IndexError:
# This is a rare case in which re-running in python-space may
# make a difference, see test_apply_mutate.test_mutate_groups
pass

And the truth value of mutated (the same as not_indexed_same here)will affect the concatenation of applied results:

def _concat_objects(self, keys, values, not_indexed_same: bool = False):
from pandas.core.reshape.concat import concat
def reset_identity(values):
# reset the identities of the components
# of the values to prevent aliasing
for v in com.not_none(*values):
ax = v._get_axis(self.axis)
ax._reset_identity()
return values
if not not_indexed_same:
result = concat(values, axis=self.axis)
ax = (
self.filter(lambda x: True).axes[self.axis]
if self.dropna
else self._selected_obj._get_axis(self.axis)
)
# this is a very unfortunate situation
# we can't use reindex to restore the original order
# when the ax has duplicates
# so we resort to this
# GH 14776, 30667
if ax.has_duplicates and not result.axes[self.axis].equals(ax):
indexer, _ = result.index.get_indexer_non_unique(ax._values)
indexer = algorithms.unique1d(indexer)
result = result.take(indexer, axis=self.axis)
else:
result = result.reindex(ax, axis=self.axis, copy=False)
elif self.group_keys:
values = reset_identity(values)
if self.as_index:
# possible MI return case
group_keys = keys
group_levels = self.grouper.levels
group_names = self.grouper.names
result = concat(
values,
axis=self.axis,
keys=group_keys,
levels=group_levels,
names=group_names,
sort=False,
)
else:
# GH5610, returns a MI, with the first level being a
# range index
keys = list(range(len(values)))
result = concat(values, axis=self.axis, keys=keys)
else:
values = reset_identity(values)
result = concat(values, axis=self.axis)
name = self.obj.name if self.obj.ndim == 1 else self._selection
if isinstance(result, Series) and name is not None:
result.name = name
return result

I tried to disable the fast_apply (manually raise IndexError before it), then it went to the slow apply in python code:

for key, group in zipped:
object.__setattr__(group, "name", key)
# group might be modified
group_axes = group.axes
res = f(group)
if not _is_indexed_like(res, group_axes, axis):
mutated = True
result_values.append(res)

and the value of mutated and the printed results of the two cases became consistent.

Summary

I think there is a bug that the mutate checking of the fast apply(supported by apply_frame_axis0 from _libs.reduction.pyx) and the slow apply are inconsistent.

@ODemidenko
Copy link

I encounter a similar issue.
In this case whether group columns are added to the index depends on the amount fo rows returned by apply:
when it changes - group columns are added to the index. When it doesn't change - group columns aren't added.
The problem is, that the amount of rows may or may not change after applying exactly same code.

full_time_index = pd.Index(
        pd.to_datetime(['2020-01-01', '2020-01-03', '2020-01-10', '2020-01-15']),
        name='timestamp'
    )
group_cols = ['B']
df = pd.DataFrame({'A': [3, 1],
                   'B': ['foo', 'foo'],
                   'timestamp': full_time_index[::2]
                   }).set_index('timestamp')

def reindex_by_group(df, group_cols=group_cols, new_index=full_time_index):
    result = df.groupby(group_cols).apply(lambda d: d.reindex(new_index, fill_value=0)).drop(columns=group_cols)
    print(result.index.names)
    assert set(result.index.names) == set(['timestamp']+group_cols)

#this adds group column to index (desired behavior):
result = reindex_by_group(df, new_index=full_time_index)
result = reindex_by_group(df, new_index=full_time_index[:1])
result = reindex_by_group(df.iloc[:1], new_index=full_time_index)

#this fail:
result = reindex_by_group(df.iloc[:1], new_index=full_time_index[:1])
result = reindex_by_group(df, new_index=full_time_index[::2])

IMO, the most intuitive behavior is always adding group columns to index. The fact that 'transform' operations don't do it - was a major source of confustion for me and my colleagues.
And in the case above, in would be much more difficult to implement groupby with reindex, unless group columns are added to the index.

@mroeschke
Copy link
Member

This is consistent on the master branch now (which will be reflected in the 1.4 release). This was probably most likely made consistent by #42992

In [1]: import pandas as pd
   ...:
   ...: df = pd.DataFrame(dict(
   ...:     name=["Alice", "Bob", "Carl", "Dan", "Eve"],
   ...:     age=[20, 21, 20, 21, 20],
   ...:     height=[165, 175, 189, 182, 157],
   ...:     weight=[65, 72, 95, 83, 58],
   ...: )).set_index("name")
   ...:
   ...: indexed_by_age = df.groupby(["age"]).apply(lambda group: group.copy())
   ...: print(indexed_by_age)
   ...:
   ...: print()
   ...: print("=" * 30)
   ...: print()
   ...:
   ...: indexed_only_by_name = df.groupby(["age"]).apply(lambda group: group)
   ...: print(indexed_only_by_name)
       age  height  weight
name
Alice   20     165      65
Bob     21     175      72
Carl    20     189      95
Dan     21     182      83
Eve     20     157      58

==============================

       age  height  weight
name
Alice   20     165      65
Bob     21     175      72
Carl    20     189      95
Dan     21     182      83
Eve     20     157      58

Suppose could use a unit test to prevent a regression.

@mroeschke mroeschke added Apply Apply, Aggregate, Transform, Map good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 27, 2021
@aragard
Copy link

aragard commented Dec 27, 2021

take

@pratimugale
Copy link

May I work on this?

@aragard
Copy link

aragard commented Jan 16, 2022

Yes!

@aragard aragard removed their assignment Jan 16, 2022
@NumberPiOso
Copy link
Contributor

@pratimugale are you working on this? If not, I would like to take it

@pratimugale
Copy link

@NumberPiOso you can take it

@NumberPiOso
Copy link
Contributor

take

@NumberPiOso
Copy link
Contributor

NumberPiOso commented Jan 20, 2022

I did not understand if the output shown is the expected behaviour (#45476) . However, I will do the test to keep outputs similar, regardless of the results

This is consistent on the master branch now (which will be reflected in the 1.4 release). This was probably most likely made consistent by #42992

In [1]: import pandas as pd
   ...:
   ...: df = pd.DataFrame(dict(
   ...:     name=["Alice", "Bob", "Carl", "Dan", "Eve"],
   ...:     age=[20, 21, 20, 21, 20],
   ...:     height=[165, 175, 189, 182, 157],
   ...:     weight=[65, 72, 95, 83, 58],
   ...: )).set_index("name")
   ...:
   ...: indexed_by_age = df.groupby(["age"]).apply(lambda group: group.copy())
   ...: print(indexed_by_age)
   ...:
   ...: print()
   ...: print("=" * 30)
   ...: print()
   ...:
   ...: indexed_only_by_name = df.groupby(["age"]).apply(lambda group: group)
   ...: print(indexed_only_by_name)
       age  height  weight
name
Alice   20     165      65
Bob     21     175      72
Carl    20     189      95
Dan     21     182      83
Eve     20     157      58

==============================

       age  height  weight
name
Alice   20     165      65
Bob     21     175      72
Carl    20     189      95
Dan     21     182      83
Eve     20     157      58

Suppose could use a unit test to prevent a regression.

NumberPiOso added a commit to NumberPiOso/pandas that referenced this issue Jan 20, 2022
@jreback jreback added this to the 1.5 milestone Jan 21, 2022
NumberPiOso added a commit to NumberPiOso/pandas that referenced this issue Jan 21, 2022
NumberPiOso added a commit to NumberPiOso/pandas that referenced this issue Jan 23, 2022
NumberPiOso added a commit to NumberPiOso/pandas that referenced this issue Jan 23, 2022
Pending tasks from pandas-dev#45509
In line with issue pandas-dev#44803

Change test_groupby_copy from
`pandas/tests/groupby/test_groupby.py`
to `pandas/tests/groupby/test_apply_mutate.py`
mroeschke pushed a commit that referenced this issue Jan 24, 2022
Pending tasks from #45509
In line with issue #44803

Change test_groupby_copy from
`pandas/tests/groupby/test_groupby.py`
to `pandas/tests/groupby/test_apply_mutate.py`
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this issue Jul 13, 2022
Pending tasks from pandas-dev#45509
In line with issue pandas-dev#44803

Change test_groupby_copy from
`pandas/tests/groupby/test_groupby.py`
to `pandas/tests/groupby/test_apply_mutate.py`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
8 participants