BUG: `groupby.apply()` inconsistently indexes its return value with the groupby-cols #44803

ypsah · 2021-12-07T19:31:18Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame(dict(
    name=["Alice", "Bob", "Carl", "Dan", "Eve"],
    age=[20, 21, 20, 21, 20],
    height=[165, 175, 189, 182, 157],
    weight=[65, 72, 95, 83, 58],
)).set_index("name")

indexed_by_age = df.groupby(["age"]).apply(lambda group: group.copy())
print(indexed_by_age)

print()
print("=" * 30)
print()

indexed_only_by_name = df.groupby(["age"]).apply(lambda group: group)
print(indexed_only_by_name)

Issue Description

The above snippet produces the following output:

           age  height  weight
age name
20  Alice   20     165      65
    Carl    20     189      95
    Eve     20     157      58
21  Bob     21     175      72
    Dan     21     182      83

==============================

       age  height  weight
name
Alice   20     165      65
Bob     21     175      72
Carl    20     189      95
Dan     21     182      83
Eve     20     157      58

Where we can see that despite the return values of the apply-func being semantically identical (group vs. group.copy()), one ends up being indexed with age and name while the other only gets name.

Expected Behavior

I would have expected the two returned dataframes to be the same. I have a personal preference for the first option, with the age appearing in the index, but I think I would be able to work with both.

I know other inconsistencies have been reported:

But IMHO never as clearly as this. I'm hoping this reproducer helps diagnose the issue more easily.
Interestingly, the difference only creeps up when grouping_keys is set to True. Otherwise, we get the second output twice.

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed
python : 3.9.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.6-arch2-1
Version : #1 SMP PREEMPT Thu, 02 Dec 2021 15:47:09 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.4
numpy : 1.21.4
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.5.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

v-liuwei · 2021-12-19T15:04:04Z

I just met the same problem, and did some test on the latest released version v1.3.5.

It seems that the fast_apply set the mutated to True in the .copy() case while set it to False in the other case.

pandas/pandas/core/groupby/ops.py

Lines 813 to 820 in 66e3805

    
           try: 
        
               sdata = splitter.sorted_data 
        
               result_values, mutated = splitter.fast_apply(f, sdata, group_keys) 
        
           except IndexError: 
        
               # This is a rare case in which re-running in python-space may 
        
               #  make a difference, see  test_apply_mutate.test_mutate_groups 
        
               pass

And the truth value of mutated (the same as not_indexed_same here)will affect the concatenation of applied results:

pandas/pandas/core/groupby/groupby.py

Lines 1003 to 1067 in 66e3805

    
           def _concat_objects(self, keys, values, not_indexed_same: bool = False): 
        
               from pandas.core.reshape.concat import concat 
        
               def reset_identity(values): 
        
                   # reset the identities of the components 
        
                   # of the values to prevent aliasing 
        
                   for v in com.not_none(*values): 
        
                       ax = v._get_axis(self.axis) 
        
                       ax._reset_identity() 
        
                   return values 
        
               if not not_indexed_same: 
        
                   result = concat(values, axis=self.axis) 
        
                   ax = ( 
        
                       self.filter(lambda x: True).axes[self.axis] 
        
                       if self.dropna 
        
                       else self._selected_obj._get_axis(self.axis) 
        
                   ) 
        
                   # this is a very unfortunate situation 
        
                   # we can't use reindex to restore the original order 
        
                   # when the ax has duplicates 
        
                   # so we resort to this 
        
                   # GH 14776, 30667 
        
                   if ax.has_duplicates and not result.axes[self.axis].equals(ax): 
        
                       indexer, _ = result.index.get_indexer_non_unique(ax._values) 
        
                       indexer = algorithms.unique1d(indexer) 
        
                       result = result.take(indexer, axis=self.axis) 
        
                   else: 
        
                       result = result.reindex(ax, axis=self.axis, copy=False) 
        
               elif self.group_keys: 
        
                   values = reset_identity(values) 
        
                   if self.as_index: 
        
                       # possible MI return case 
        
                       group_keys = keys 
        
                       group_levels = self.grouper.levels 
        
                       group_names = self.grouper.names 
        
                       result = concat( 
        
                           values, 
        
                           axis=self.axis, 
        
                           keys=group_keys, 
        
                           levels=group_levels, 
        
                           names=group_names, 
        
                           sort=False, 
        
                       ) 
        
                   else: 
        
                       # GH5610, returns a MI, with the first level being a 
        
                       # range index 
        
                       keys = list(range(len(values))) 
        
                       result = concat(values, axis=self.axis, keys=keys) 
        
               else: 
        
                   values = reset_identity(values) 
        
                   result = concat(values, axis=self.axis) 
        
               name = self.obj.name if self.obj.ndim == 1 else self._selection 
        
               if isinstance(result, Series) and name is not None: 
        
                   result.name = name 
        
               return result

I tried to disable the fast_apply （manually raise IndexError before it）, then it went to the slow apply in python code:

pandas/pandas/core/groupby/ops.py

Lines 847 to 855 in 66e3805

    
           for key, group in zipped: 
        
               object.__setattr__(group, "name", key) 
        
               # group might be modified 
        
               group_axes = group.axes 
        
               res = f(group) 
        
               if not _is_indexed_like(res, group_axes, axis): 
        
                   mutated = True 
        
               result_values.append(res)

and the value of mutated and the printed results of the two cases became consistent.

Summary

I think there is a bug that the mutate checking of the fast apply(supported by apply_frame_axis0 from _libs.reduction.pyx) and the slow apply are inconsistent.

ODemidenko · 2021-12-20T12:03:41Z

I encounter a similar issue.
In this case whether group columns are added to the index depends on the amount fo rows returned by apply:
when it changes - group columns are added to the index. When it doesn't change - group columns aren't added.
The problem is, that the amount of rows may or may not change after applying exactly same code.

full_time_index = pd.Index(
        pd.to_datetime(['2020-01-01', '2020-01-03', '2020-01-10', '2020-01-15']),
        name='timestamp'
    )
group_cols = ['B']
df = pd.DataFrame({'A': [3, 1],
                   'B': ['foo', 'foo'],
                   'timestamp': full_time_index[::2]
                   }).set_index('timestamp')

def reindex_by_group(df, group_cols=group_cols, new_index=full_time_index):
    result = df.groupby(group_cols).apply(lambda d: d.reindex(new_index, fill_value=0)).drop(columns=group_cols)
    print(result.index.names)
    assert set(result.index.names) == set(['timestamp']+group_cols)

#this adds group column to index (desired behavior):
result = reindex_by_group(df, new_index=full_time_index)
result = reindex_by_group(df, new_index=full_time_index[:1])
result = reindex_by_group(df.iloc[:1], new_index=full_time_index)

#this fail:
result = reindex_by_group(df.iloc[:1], new_index=full_time_index[:1])
result = reindex_by_group(df, new_index=full_time_index[::2])

IMO, the most intuitive behavior is always adding group columns to index. The fact that 'transform' operations don't do it - was a major source of confustion for me and my colleagues.
And in the case above, in would be much more difficult to implement groupby with reindex, unless group columns are added to the index.

mroeschke · 2021-12-27T20:26:22Z

This is consistent on the master branch now (which will be reflected in the 1.4 release). This was probably most likely made consistent by #42992

In [1]: import pandas as pd
   ...:
   ...: df = pd.DataFrame(dict(
   ...:     name=["Alice", "Bob", "Carl", "Dan", "Eve"],
   ...:     age=[20, 21, 20, 21, 20],
   ...:     height=[165, 175, 189, 182, 157],
   ...:     weight=[65, 72, 95, 83, 58],
   ...: )).set_index("name")
   ...:
   ...: indexed_by_age = df.groupby(["age"]).apply(lambda group: group.copy())
   ...: print(indexed_by_age)
   ...:
   ...: print()
   ...: print("=" * 30)
   ...: print()
   ...:
   ...: indexed_only_by_name = df.groupby(["age"]).apply(lambda group: group)
   ...: print(indexed_only_by_name)
       age  height  weight
name
Alice   20     165      65
Bob     21     175      72
Carl    20     189      95
Dan     21     182      83
Eve     20     157      58

==============================

       age  height  weight
name
Alice   20     165      65
Bob     21     175      72
Carl    20     189      95
Dan     21     182      83
Eve     20     157      58

Suppose could use a unit test to prevent a regression.

aragard · 2021-12-27T21:35:21Z

take

pratimugale · 2022-01-09T14:44:20Z

May I work on this?

aragard · 2022-01-16T21:40:35Z

Yes!

NumberPiOso · 2022-01-18T21:23:05Z

@pratimugale are you working on this? If not, I would like to take it

pratimugale · 2022-01-19T03:32:14Z

@NumberPiOso you can take it

NumberPiOso · 2022-01-20T19:26:57Z

take

NumberPiOso · 2022-01-20T19:31:06Z

I did not understand if the output shown is the expected behaviour (#45476) . However, I will do the test to keep outputs similar, regardless of the results

This is consistent on the master branch now (which will be reflected in the 1.4 release). This was probably most likely made consistent by #42992

In [1]: import pandas as pd
   ...:
   ...: df = pd.DataFrame(dict(
   ...:     name=["Alice", "Bob", "Carl", "Dan", "Eve"],
   ...:     age=[20, 21, 20, 21, 20],
   ...:     height=[165, 175, 189, 182, 157],
   ...:     weight=[65, 72, 95, 83, 58],
   ...: )).set_index("name")
   ...:
   ...: indexed_by_age = df.groupby(["age"]).apply(lambda group: group.copy())
   ...: print(indexed_by_age)
   ...:
   ...: print()
   ...: print("=" * 30)
   ...: print()
   ...:
   ...: indexed_only_by_name = df.groupby(["age"]).apply(lambda group: group)
   ...: print(indexed_only_by_name)
       age  height  weight
name
Alice   20     165      65
Bob     21     175      72
Carl    20     189      95
Dan     21     182      83
Eve     20     157      58

==============================

       age  height  weight
name
Alice   20     165      65
Bob     21     175      72
Carl    20     189      95
Dan     21     182      83
Eve     20     157      58

Suppose could use a unit test to prevent a regression.

In line with (pandas-dev#44803).

Pending tasks from pandas-dev#45509 In line with issue pandas-dev#44803

Pending tasks from pandas-dev#45509 In line with issue pandas-dev#44803 Change test_groupby_copy from `pandas/tests/groupby/test_groupby.py` to `pandas/tests/groupby/test_apply_mutate.py`

Pending tasks from #45509 In line with issue #44803 Change test_groupby_copy from `pandas/tests/groupby/test_groupby.py` to `pandas/tests/groupby/test_apply_mutate.py`

…v#45509)

Pending tasks from pandas-dev#45509 In line with issue pandas-dev#44803 Change test_groupby_copy from `pandas/tests/groupby/test_groupby.py` to `pandas/tests/groupby/test_apply_mutate.py`

ypsah added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 7, 2021

LucasG0 mentioned this issue Dec 25, 2021

BUG: inconsistent groupby behavior when groupby.apply return dataframe #45066

Closed

3 tasks

mroeschke added Apply Apply, Aggregate, Transform, Map good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 27, 2021

github-actions bot assigned aragard Dec 27, 2021

aragard removed their assignment Jan 16, 2022

ODemidenko mentioned this issue Jan 19, 2022

BUG: groupby.apply() inconsistently indexes its return value with the groupby- #45476

Closed

3 tasks

NumberPiOso added a commit to NumberPiOso/pandas that referenced this issue Jan 20, 2022

TST: Add test inconsistency in group by (pandas-dev#44803)

7e3de8c

NumberPiOso mentioned this issue Jan 20, 2022

TST: Add test inconsistency in group by (#44803) #45509

Merged

3 tasks

github-actions bot assigned NumberPiOso Jan 20, 2022

jreback added this to the 1.5 milestone Jan 21, 2022

NumberPiOso added a commit to NumberPiOso/pandas that referenced this issue Jan 21, 2022

Change equality assertion to tm.assert_frame_equal

b405140

In line with (pandas-dev#44803).

jreback closed this as completed in #45509 Jan 23, 2022

jreback pushed a commit that referenced this issue Jan 23, 2022

TST: Add test inconsistency in group by (#44803) (#45509)

94fbed6

NumberPiOso added a commit to NumberPiOso/pandas that referenced this issue Jan 23, 2022

TST: Change test_groupby_copy to apply_mutate.py

1bc27eb

Pending tasks from pandas-dev#45509 In line with issue pandas-dev#44803

NumberPiOso mentioned this issue Jan 23, 2022

TST: Change test_groupby_copy to apply_mutate.py #45578

Merged

1 task

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this issue Jul 13, 2022

TST: Add test inconsistency in group by (pandas-dev#44803) (pandas-de…

b3a49b5

…v#45509)

ypsah mentioned this issue Aug 24, 2023

BUG: df.groupby().apply() behaves inconsistently when there's only one group #54723

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: `groupby.apply()` inconsistently indexes its return value with the groupby-cols #44803

BUG: `groupby.apply()` inconsistently indexes its return value with the groupby-cols #44803

ypsah commented Dec 7, 2021 •

edited

Loading

INSTALLED VERSIONS

v-liuwei commented Dec 19, 2021 •

edited

Loading

Uh oh!

ODemidenko commented Dec 20, 2021

Uh oh!

mroeschke commented Dec 27, 2021

Uh oh!

aragard commented Dec 27, 2021

Uh oh!

pratimugale commented Jan 9, 2022

Uh oh!

aragard commented Jan 16, 2022

Uh oh!

NumberPiOso commented Jan 18, 2022

Uh oh!

pratimugale commented Jan 19, 2022

Uh oh!

NumberPiOso commented Jan 20, 2022

Uh oh!

NumberPiOso commented Jan 20, 2022 •

edited

Loading

Uh oh!

Uh oh!

BUG: groupby.apply() inconsistently indexes its return value with the groupby-cols #44803

BUG: groupby.apply() inconsistently indexes its return value with the groupby-cols #44803

Comments

ypsah commented Dec 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

v-liuwei commented Dec 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

ODemidenko commented Dec 20, 2021

Uh oh!

mroeschke commented Dec 27, 2021

Uh oh!

aragard commented Dec 27, 2021

Uh oh!

pratimugale commented Jan 9, 2022

Uh oh!

aragard commented Jan 16, 2022

Uh oh!

NumberPiOso commented Jan 18, 2022

Uh oh!

pratimugale commented Jan 19, 2022

Uh oh!

NumberPiOso commented Jan 20, 2022

Uh oh!

NumberPiOso commented Jan 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BUG: `groupby.apply()` inconsistently indexes its return value with the groupby-cols #44803

BUG: `groupby.apply()` inconsistently indexes its return value with the groupby-cols #44803

ypsah commented Dec 7, 2021 •

edited

Loading

v-liuwei commented Dec 19, 2021 •

edited

Loading

NumberPiOso commented Jan 20, 2022 •

edited

Loading