Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: change in output of groupby.apply in 1.3.2 -> 1.3.3 #43568

Closed
jorisvandenbossche opened this issue Sep 14, 2021 · 9 comments
Closed

REGR: change in output of groupby.apply in 1.3.2 -> 1.3.3 #43568

jorisvandenbossche opened this issue Sep 14, 2021 · 9 comments
Labels
Bug Groupby Regression Functionality that used to work in a prior pandas version

Comments

@jorisvandenbossche
Copy link
Member

From dask/dask#8137

One of the corner cases in groupby apply (discussed in other issues like #34998) has changed behaviour:

import pandas as pd
df = pd._testing.makeTimeDataFrame()
df.groupby(df.index.month).apply(lambda x: x.drop_duplicates())

In pandas 1.3.2 this gives:

                     A         B         C         D
1 2000-01-03 -1.261522 -0.200411 -0.746305 -0.661842
  2000-01-04  1.248556  0.256573 -1.401839  0.508941
  2000-01-05 -0.036682  0.109758 -0.759474 -0.601479
  2000-01-06 -0.778714 -0.932903 -0.544291 -0.986584
  2000-01-07 -0.567276  0.744036  1.981160  0.222988
  2000-01-10 -0.448542  0.221418  0.485706 -1.142561
  2000-01-11 -0.125143 -1.393151  0.879428  0.011297
  2000-01-12  0.550527 -1.373356  1.149654 -0.575065
  2000-01-13  0.269579 -0.017307 -1.023269  0.738274
  2000-01-14  0.757235 -1.875451 -0.751026  0.812741
  2000-01-17  2.456978  0.992319  0.945757 -2.468437
  2000-01-18 -2.132953  1.210491  0.150581 -1.861079
  2000-01-19  0.500947 -0.861651  0.412729 -1.274573
  2000-01-20  0.215823 -0.502341 -0.060564  0.439930
  2000-01-21  0.454649  1.188960  1.167487 -0.087031
  2000-01-24 -1.194599  0.709980  1.927664 -1.868195
  2000-01-25 -1.465017  1.187098  0.262209  0.312123
  2000-01-26 -0.010187 -0.624253 -0.186090 -0.126192
  2000-01-27 -0.520074 -0.189463 -0.379236 -0.259591
  2000-01-28 -1.179406 -0.169766 -1.731189  0.583444
  2000-01-31  0.677903  0.845305 -0.282444  0.807889
2 2000-02-01 -1.561435  2.068383 -0.500742 -0.040578
  2000-02-02 -1.357882  1.302612  1.105816 -0.688315
  2000-02-03  0.604455  0.637055 -0.296199  0.699753
  2000-02-04  0.102784 -0.786359 -0.598806  0.604410
  2000-02-07  0.977317  0.530884 -0.880909 -0.963008
  2000-02-08  1.948681  0.065753 -0.530815 -1.688043
  2000-02-09  1.461333  1.105021  1.039801 -0.144059
  2000-02-10  2.105116 -1.121452  0.076824 -0.334885
  2000-02-11  0.984281  0.858620  1.602277 -0.421881

while in pandas 1.3.3 it gives:

                   A         B         C         D
2000-01-03  0.131780  0.079102 -2.631289  0.969882
2000-01-04  0.381887  0.177194 -0.031367 -1.062184
2000-01-05 -1.299994  0.951530  0.806066  1.043698
2000-01-06 -0.669137  1.036442  0.762052 -0.475059
2000-01-07  0.498415 -0.511591 -0.500675  0.098846
2000-01-10 -1.313268  0.511975 -0.935800 -0.371694
2000-01-11  1.812837 -0.017126 -0.748976  1.217975
2000-01-12  0.236695  0.012316  0.319136  0.743945
2000-01-13 -1.128511  0.367611 -0.240936 -0.847221
2000-01-14 -2.170718  1.349021 -1.205040  1.210471
2000-01-17  0.220773  1.238868 -0.208188 -0.240763
2000-01-18 -0.949992  0.273480  0.863710  2.446306
2000-01-19  0.622379  1.386699 -1.181249  0.188620
2000-01-20 -1.340407 -0.523331 -1.794468  0.877138
2000-01-21  0.029993 -0.115333  0.358685  1.652006
2000-01-24  1.209907 -1.354522  0.883701  0.686492
2000-01-25 -0.840201  1.415816  0.396826  1.342700
2000-01-26  1.206150 -0.114443  0.011106  0.995629
2000-01-27 -0.505894  0.500736  0.004411  0.807632
2000-01-28  0.117852 -0.411066  1.315072  0.731249
2000-01-31 -0.329046 -1.921455  2.564603 -0.222591
2000-02-01  0.295899 -0.169977  0.162310 -0.554688
2000-02-02 -1.144224  0.530313 -0.530216  0.287826
2000-02-03  1.491748 -1.051309  1.414135 -0.332648
2000-02-04 -0.452243 -0.087787 -0.308278  0.681506
2000-02-07 -1.041728 -0.202066  0.044722  0.665914
2000-02-08  0.712994  1.547563  2.557823 -0.801031
2000-02-09  0.396298 -1.325411 -0.926420  0.738052
2000-02-10  0.460109  0.734418  0.416767 -1.199427
2000-02-11 -0.104655 -0.440354 -0.787402  0.357853

(note the difference in index)

@jorisvandenbossche jorisvandenbossche added Groupby Regression Functionality that used to work in a prior pandas version labels Sep 14, 2021
@jorisvandenbossche jorisvandenbossche added this to the 1.3.4 milestone Sep 14, 2021
@jorisvandenbossche
Copy link
Member Author

It seems this is caused by #43054

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Sep 22, 2021
@simonjayhawkins
Copy link
Member

before the backport PR, this was changed on master

first bad commit: [d037ff6] REF: remove libreduction.apply_frame_axis0 (#42992)

cc @jbrockmendel

@jbrockmendel
Copy link
Member

IIUC the underlying cause is that the libreduction code used different logic to determine mutated, I think doing a index is obj.index vs a index.equals(obj.index). I haven't found a way to alter the non-cython mutated-determining code to retain the old behavior here without breaking other tests.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Sep 22, 2021

@simonjayhawkins that PR you link to is not included in 1.3.x. So while that might be the first commit on master, it's not the cause of the change in 1.3.x AFAIK

@simonjayhawkins
Copy link
Member

also xref #43206 for another groupby index related change in behavior.

Also note that df.groupby(df.index.month).apply(lambda x: x.drop_duplicates()) and df.groupby(df.index.month).apply(lambda x: x) now give consistent results.

@simonjayhawkins
Copy link
Member

changing milestone to 1.3.5

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.4, 1.3.5 Oct 16, 2021
@simonjayhawkins
Copy link
Member

IIUC the underlying cause is that the libreduction code used different logic to determine mutated, I think doing a index is obj.index vs a index.equals(obj.index). I haven't found a way to alter the non-cython mutated-determining code to retain the old behavior here without breaking other tests.

moving off 1.3.5 milestone.


@jorisvandenbossche feel free to add back milestone and blocker label if appropriate.

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.5, Contributions Welcome Nov 27, 2021
@simonjayhawkins
Copy link
Member

@simonjayhawkins that PR you link to is not included in 1.3.x. So while that might be the first commit on master, it's not the cause of the change in 1.3.x AFAIK

That PR fixed the regression #41999 and that PR was not backported. But this issue also persists on master, so yes we can probably eliminate those changes as directly related to this regression.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@rhshadrach
Copy link
Member

I'm now seeing the output as it was on 1.3.2. This was caused by #52660, as expected. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

5 participants