Skip to content

It's difficult to predict what DataFrame.groupby().apply() will return: #9867

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #13056
ruoyu0088 opened this issue Apr 13, 2015 · 10 comments
Closed
Tracked by #13056

Comments

@ruoyu0088
Copy link

I found It's difficult to predict what DataFrame.groupby().apply() will return. the result depends on the type of the return object and the index of the return object. for example:

import pandas as pd
df = pd.DataFrame({"a":[1, 2, 1, 2], "b":[1, 2, 3, 4], "c":[5, 6, 7, 8]})

When the argument and the return object is DataFrame and has the same index object, there are not group keys in the result:

print df.groupby("a").apply(lambda x:x)

the output is:

   a  b  c
0  1  1  5
1  2  2  6
2  1  3  7
3  2  4  8

if the index is not the same object, there are group keys, even the index values are the same:

print df.groupby("a").apply(lambda x:x[:])

the output is:

     a  b  c
a           
1 0  1  1  5
  2  1  3  7
2 1  2  2  6
  3  2  4  8

if the function returns Series object and the index of these Series objects are not he same values, the index of the result is a MultiIndex:

print df.groupby("a").apply(lambda x:x.b + x.c)

the output:

a   
1  0     6
   2    10
2  1     8
   3    12
dtype: int64

If all the Series objects have the same index values, the Series objects are the rows of the result:

print df.groupby("a").apply(lambda x:(x.b + x.c).reset_index(drop=True))

the output:

   0   1
a       
1  6  10
2  8  12

Here are more exampes:

Because the index is the same object:

print df.groupby("a").apply(lambda x:(x.b + x.c).to_frame())

not group keys in the output:

    0
0   6
1   8
2  10
3  12

If we copy the return value, the index is not the same object:

print df.groupby("a").apply(lambda x:(x.b + x.c).to_frame()[:])

the output contains group keys:

      0
a      
1 0   6
  2  10
2 1   8
  3  12
print df.groupby("a").apply(lambda x:x[["b", "c"]])

no group keys because the index object is the same (but use x[:] will get the group keys):

   b  c
0  1  5
1  2  6
2  3  7
3  4  8

It seems that there is no document about this.

@jreback
Copy link
Contributor

jreback commented Apr 13, 2015

docs are here: http://pandas-docs.github.io/pandas-docs-travis/groupby.html#flexible-apply
this tries to figure out what you are doing an combine appropriately. so you could show a mini-table if you think that would help in a note in the docs I suppose.

@jreback jreback added this to the Next Major Release milestone Apr 13, 2015
@jreback jreback changed the title It's difficult to predict what DataFrame.groupby().apply() will return: It's difficult to predict what DataFrame.groupby().apply() will return: Apr 13, 2015
@mbirdi
Copy link

mbirdi commented Apr 13, 2015

I am at the Pandas Sprint. I am starting to work on this one, and trying to create a mini-table example.

@shoyer
Copy link
Member

shoyer commented Apr 13, 2015

I agree, I think this could definitely use some cleaning up. In particular, the identity specific behavior seems highly questionable -- nothing else in numpy/pandas works like that.

The first step would be to thoroughly document the existing behavior and see it's possible to come up with a more consistent set of rules.

@shoyer
Copy link
Member

shoyer commented Apr 19, 2015

Link to a notebook from @mbirdi illustrating the current behavior: http://nbviewer.ipython.org/gist/mbirdi/05f8a83d340476e5f03a

@jreback
Copy link
Contributor

jreback commented May 9, 2016

tracked in #13056

@ultrabosss
Copy link

Hi could you please confirm whether this issue fixed on pandas 0.20.2. I am still getting the issue. some times groupby lambda returns pivot table and some time it returns normal dataframe

@jreback
Copy link
Contributor

jreback commented Jun 5, 2017

see #13056

you would have to show an example

@ultrabosss
Copy link

Example 1:

image

always i need to unstack and reset index before i consume this result

Example 2:
get the same dataframe using sample(). Now I am not getting the results like pivot table where I no need to use unstack.

image

Hope this helps.
Thanks

@jreback
Copy link
Contributor

jreback commented Jun 5, 2017

pls create a new issue and show a copy-pastable example (NO images), showing pd.show_versions() as instructed.

@r-barnes
Copy link

r-barnes commented Mar 1, 2019

I consider myself confused on this point, per my question here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants