Skip to content

Aggregating groupby with multiple functions all return the same value #16904

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
HristoBuyukliev opened this issue Jul 13, 2017 · 9 comments
Closed
Labels
Groupby Testing pandas testing functions or related to the test suite
Milestone

Comments

@HristoBuyukliev
Copy link

So, I want to aggregate a grouped by object by two criterion: count of observations, and count of nonzero observations:

# nonzero counts by shipment number
data.groupby(['Shipment number'])['Shipment number']
	.agg({'count' : lambda x: x.count(), 
	      'nonzero' : lambda x: x.nonzero()[0].size})
	.sort_values('count')

Problem description

While this code is referenced in issue 7186, and the issue is closed, there is a bug: all the output columns are the same. That is evident even in issue 7186, and I'm shocked how nobody picked it up.

Output of pd.show_versions()

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-31-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.0.1
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.4.1
html5lib: 1.0b3
sqlalchemy: 1.0.14
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.7.3
s3fs: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Could you include a code sample to create the DataFrame?

@dsm054
Copy link
Contributor

dsm054 commented Jul 13, 2017

I don't understand. The output in #7186 is obviously not the same:

         three       two       one
b                                 
-253  0.156897  0.156897  0.156897
-216  0.452120  0.452120  0.452120
-191  0.893074  0.893074  0.893074
-178  1.170801  1.170801  1.170801
-177 -1.324476 -1.324476 -1.324476
-162  0.835708  1.241353  1.038531

The reason they look so similar is because they're defined by [np.mean, lambda x : np.mean(x) + np.std(x) , lambda x : np.mean(x) - np.std(x) ], so the only difference is in the std, and the std is often NaN because the group size is so small [note that the following data is from my own randomly-generated dataset using that code, so won't quite match the original]:

In [126]: grp.std().head()
Out[126]: 
            a
b            
2.0  0.264902
3.0       NaN
4.0  0.224881
8.0       NaN
9.0       NaN

In [127]: grp.size().head()
Out[127]: 
b
2.0    4
3.0    1
4.0    2
8.0    1
9.0    1
dtype: int64

What am I missing?

@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

Regardless of how this turns out, it does seem like another test might be needed to shore this one up for good (@dsm054 : your examples would be a good starting point unless @HristoBuyukliev has additional code to provide).

@HristoBuyukliev
Copy link
Author

Yeah, my bad. My code had a bug in it, and I didn't see the old issue's results have some differences. I'm closing the issue now.

@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

Actually, I would like to keep the issue open until we can confirm coverage for this issue. @dsm054 , would you like to add your example code to tests?

@gfyoung gfyoung reopened this Jul 14, 2017
@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

@HristoBuyukliev : feel free to contribute a test on that front as well, so that you can remain assured that this won't be an issue for you again 😄

@dsm054
Copy link
Contributor

dsm054 commented Jul 14, 2017

Yeah, I mean, say it turned out that when you have a numpy function and multiple lambdas in an agg call that the last lambda function dominated the others for some reason. Would any of us really have been shocked? Surprised, maybe, but usually there's about a bug a week where I'm genuinely startled no one noticed before..

I'll add a test this weekend if no one else gets to it.

@gfyoung
Copy link
Member

gfyoung commented Jul 14, 2017

@dsm054 : Go right ahead and add it! You were the one who helped resolve this.

@gfyoung gfyoung added the Testing pandas testing functions or related to the test suite label Jul 14, 2017
@gfyoung gfyoung added this to the 0.21.0 milestone Jul 16, 2017
@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

closing, but @dsm054 if you have a repor with a test pls comment (or open a new issue).

@jreback jreback closed this as completed Sep 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Testing pandas testing functions or related to the test suite
Projects
None yet
Development

No branches or pull requests

5 participants