Skip to content

QST: Related to working of as_index in pandas groupby #40591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks done
ninjakx opened this issue Mar 23, 2021 · 5 comments
Closed
2 tasks done

QST: Related to working of as_index in pandas groupby #40591

ninjakx opened this issue Mar 23, 2021 · 5 comments
Labels
Docs Duplicate Report Duplicate issue or pull request Groupby

Comments

@ninjakx
Copy link

ninjakx commented Mar 23, 2021

  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.


Question about pandas

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
df.A = df.agg('City_{0[A]}'.format, axis=1)
df.index = (df.index)*int(np.random.randint(6,23)**0.5)+int(np.random.randint(2,23))

This is my df:

 A B C D
City_2 3 7 3
City_0 4 8 9
City_1 1 2 1
City_5 5 0 9
City_5 1 6 0
City_0 3 8 6
City_7 2 6 6
City_1 6 0 2
City_8 6 2 4
City_2 2 5 6

If I do

df.groupby([df.index//5,df.A], as_index=True).mean()
or
df.groupby([df.index//5,df.A]).mean()

Result:

B C D
A
1 City_2 3 7 3
2 City_0 4 8 9
3 City_1 1 2 1
City_5 5 0 9
4 City_0 3 8 6
City_5 1 6 0
5 City_7 2 6 6
6 City_1 6 0 2
City_8 6 2 4
7 City_2 2 5 6

So If do .reset_index()

I will get this (that index with no column name will get name as level_0 ):

  level_0 A B C D
1 City_2 3 7 3
2 City_0 4 8 9
3 City_1 1 2 1
3 City_5 5 0 9
4 City_0 3 8 6
4 City_5 1 6 0
5 City_7 2 6 6
6 City_1 6 0 2
6 City_8 6 2 4
7 City_2 2 5 6

The same is not with the as_index=False
df.groupby([df.index//5,df.A], as_index=False).mean() will give me

 A B C D
City_2 3 7 3
City_0 4 8 9
City_1 1 2 1
City_5 5 0 9
City_0 3 8 6
City_5 1 6 0
City_7 2 6 6
City_1 6 0 2
City_8 6 2 4
City_2 2 5 6

I will lose the groupby index info.

My question:
shouldn't be as_index=False give df.index//5 column also. Or as_index is designed to work like this only.

@ninjakx ninjakx added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Mar 23, 2021
@ParfaitG
Copy link
Contributor

Possibly, the issue is that df.index//5 is not actually a column in the data frame but an external object. Whereas reset_index runs after groupby, as_index interacts with existing columns in data frame that potentially become indexes. To fix, consider assigning calculated object as a new column before groupby.

(Personally, I always call groupby on named columns. Additionally, I also avoid open arithmetic operators but use the functional form (Series.add, Series.mul, Series.floordiv, Series.pow, etc.) and avoid period referencing of columns: df['A'].)

agg = ( 
        df.assign(multi5 = np.floor_divide(df.index, 5))  # functional form of df.index//5
          .groupby(['multi5', 'A'], as_index=False)
          .mean()
      )

agg            # below output to differ w/o np.random.seed(###)
multi5 A B C D
0 2 City_5 7 2 9
1 3 City_2 0 5 1
2 3 City_3 2 9 3
3 4 City_2 2 9 5
4 4 City_9 1 8 9
5 5 City_9 7 3 1
6 6 City_3 4 0 8
7 6 City_7 9 8 8
8 7 City_6 6 5 8
9 7 City_7 3 3 9

@rhshadrach
Copy link
Member

rhshadrach commented Mar 25, 2021

Thanks @ninjakx for the report! From the groupby userguide docs:

Passing as_index=False will return the groups that you are aggregating over, if they are named columns.

Perhaps this could be added to the API docs as well.

@jorisvandenbossche
Copy link
Member

The full note in the documentation says:

Aggregation functions will not return the groups that you are aggregating over
if they are named columns, when as_index=True, the default. The grouped columns will
be the indices of the returned object.

Passing as_index=False will return the groups that you are aggregating over, if they are
named columns.

I find this not very clear. Also the part about as_index=True uses the phrase "if they are named columns", but in that case the group keys are always set as the index, no? (regardless whether they are present as column or not, or whether they are named or not)
So from that you could assume that the same condition holds for as_index=False. In addition it is not also not super explicit what a "named column" is (one could think of a Series with a name also as a "named column", it's not clear whether this is about being "present" in the calling dataframe)

@jorisvandenbossche
Copy link
Member

And regardless of the (unclear) documentation, I also don't understand the rationale of this behaviour if it is intentional. Why wouldn't we always make the group values a column with as_index=False?

@rhshadrach rhshadrach added the Duplicate Report Duplicate issue or pull request label Nov 8, 2022
@rhshadrach
Copy link
Member

Closing in favor of #49519

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Duplicate Report Duplicate issue or pull request Groupby
Projects
None yet
Development

No branches or pull requests

5 participants