Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow as_index=False on SeriesGroupBy #36507

Open
rhshadrach opened this issue Sep 20, 2020 · 19 comments
Open

ENH: Allow as_index=False on SeriesGroupBy #36507

rhshadrach opened this issue Sep 20, 2020 · 19 comments
Labels
Enhancement Groupby Needs Discussion Requires discussion from core team before further action

Comments

@rhshadrach
Copy link
Member

Would make the code:

s.to_frame(s.name).groupby(groups, as_index=False).sum()

Become:

s.groupby(groups, as_index=False).sum()

The change should just amount to calling a function like:

def adjust_for_as_index(self, result: FrameOrSeries) -> DataFrame:
    if result.ndim == 1:
        result = result.to_frame(result.name)
    self._insert_inaxis_grouper_inplace(result)
    result.index = np.arange(len(result))

in various methods of SeriesGroupBy (and could be used in DataFrameGroupBy).

This would close #13666 and make #35443 to be just about error reporting for the axis=1 case.

@erfannariman
Copy link
Member

I can give this a shot, just not sure where to place the function. Would that be in generic.py as well? Since it can be used in DataFrameGroupBy as well.

@erfannariman
Copy link
Member

take

@rhshadrach
Copy link
Member Author

@erfannariman you may want to wait and see if there is general agreement to expand the API as is being proposed here. No problem with putting up a PR before that, but just be aware that the idea as a whole may be opposed.

@erfannariman
Copy link
Member

@erfannariman you may want to wait and see if there is general agreement to expand the API as is being proposed here. No problem with putting up a PR before that, but just be aware that the idea as a whole may be opposed.

Sure, will unassign for now.

@erfannariman erfannariman removed their assignment Sep 21, 2020
@dsaxton dsaxton removed the Needs Triage Issue that has not been reviewed by a pandas team member label Sep 21, 2020
@phofl
Copy link
Member

phofl commented Oct 1, 2020

As it is now, there is a bug when having a Series and setting as_index=False or the documentation is not clear. When doing

df = pd.DataFrame(
    {"key": ["x", "y", "z", "x", "y", "z"], "val": [1.0, 0.8, 2.0, 3.0, 3.6, 0.75]}
).set_index("key")

grouped = df.groupby("key", as_index=False)["val"]
result = grouped.agg("min")

The index vanishes instead of doing nothing.

    val
0  1.00
1  0.80
2  0.75

Is probably not relevant, if we extend the API. If not, this should be fixed or the docs should be more specific

@rhshadrach rhshadrach added the Needs Discussion Requires discussion from core team before further action label Oct 4, 2020
@rhshadrach
Copy link
Member Author

@WillAyd @jreback @jorisvandenbossche @jbrockmendel Any thoughts here?

@jbrockmendel
Copy link
Member

As it is now, there is a bug when having a Series and setting as_index=False or the documentation is not clear. When doing

To the extent that this can be addressed separately, we should do that. i.e. i don't want to have to think about two things at the same time

IIRC there was a discussion of deprecating as_index altogether, but im having trouble finding it.

@rhshadrach
Copy link
Member Author

@jbrockmendel Issue is #35860, I closed upon seeing mild opposition that ultimately included myself.

@WillAyd
Copy link
Member

WillAyd commented Oct 8, 2020

I am -0 with maybe a slight -1 on this, only because this keyword isn't very clearly defined and should probably be deprecated instead of invested in further. I assume what you are looking for is just df.groupby(...).reset_index() at the end of the day right?

@rhshadrach
Copy link
Member Author

@WillAyd Please feel free to reopen #35860 if you think deprecating as_index warrants further discussion. To separate this issue from that, it seems best to me to assume here that as_index will not be deprecated. Assuming that, are you still -0/-1 here?

@WillAyd
Copy link
Member

WillAyd commented Oct 19, 2020

Yea I'd still be -1 on investing a lot in this keyword. I'll reopen the other issue

@rhshadrach
Copy link
Member Author

rhshadrach commented Dec 28, 2022

This is getting in the way of cleaning up groupby (e.g. #46944). The issue is that when as_index=False, we return a DataFrameGroupBy instead of a SeriesGroupBy upon selecting a single column. So you wind up in what I think is the unexpected case where you have a DataFrameGroupBy with self.obj being a Series. This also gives rise to a difference between _selected_obj and _obj_with_exclusions.

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]})
gb = df.groupby('a', as_index=False)['b']
with gb._group_selection_context():
    # This is a Series
    print(gb._selected_obj)
    # 0    3
    # 1    4
    # 2    5
    # Name: b, dtype: int64

    # This is a DataFrame
    print(gb._obj_with_exclusions)
    #    b
    # 0  3
    # 1  4
    # 2  5

I don't believe our test coverage for this case is very good and I'd guess that this creates inconsistencies in results or even leads to outright errors.

Because supporting as_index=False in SeriesGroupBy is straightforward and will lead to cleaner and more maintainable code overall, I think we should pursue this even if we eventually decide to remove as_index altogether.

cc @WillAyd

@WillAyd
Copy link
Member

WillAyd commented Dec 28, 2022

Sounds good @rhshadrach - let's run with it

@anmyachev
Copy link
Contributor

Because supporting as_index=False in SeriesGroupBy is straightforward and will lead to cleaner and more maintainable code overall, I think we should pursue this even if we eventually decide to remove as_index altogether.

I also find this behavior more consistent.

In pandas 2.0.2, SeriesGroupBy is allowed to take as_index=False for example in _gotitem method:

return SeriesGroupBy(
subset,
self.keys,
level=self.level,
grouper=self.grouper,
exclusions=self.exclusions,
selection=key,
as_index=self.as_index,
sort=self.sort,

I see no reason to throw an exception here:

pandas/pandas/core/series.py

Lines 2078 to 2079 in 027adc5

if not as_index:
raise TypeError("as_index=False only valid with DataFrame")

@rhshadrach are you still planning to make this change?

@rhshadrach
Copy link
Member Author

rhshadrach commented Jun 14, 2023

@anmyachev - in the time since my previous comment, as_index is now implemented throughout SeriesGroupBy but can only be accessed indirectly via DataFrameGroupBy.

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]})
gb = df.groupby('a', as_index=False)['b']
print(gb.sum())
#    a  b
# 0  1  7
# 1  2  5

I would be supportive of adding as_index to SeriesGroupBy; it seems generally useful. A PR implementing this would need thorough tests.

@kbruegge
Copy link

I just stumbled upon this. Trying as_index=False in a series groupby.
The docs seem to say that its a supported flag for series groupby as well. Am I reading the docs wrong?

TypeError: as_index=False only valid with DataFrame

Pandas 2.2.0
Python 3.11.5
Linux

@rhshadrach
Copy link
Member Author

rhshadrach commented Apr 17, 2024

The docs, as you quoted in your comment, say you cannot use as_index=False with SeriesGroupBy.

@kbruegge
Copy link

Hey thanks for the reply. I took another close look. I attached a screenshot from the webpage. At least to me it looks like at first glance it should be supported.

Screenshot 2024-07-16 at 12 09 05

Now taking a closer look at the parameter description seems to mention the fact that its "Only relevant for DataFrame input." Is that what you're referring to? It seems a little confusing to me personally but maybe I'm missing something. Any clarification would be greatly appreciated.

Screenshot 2024-07-16 at 12 11 33

@rhshadrach
Copy link
Member Author

Now taking a closer look at the parameter description seems to mention the fact that its "Only relevant for DataFrame input." Is that what you're referring to?

Yes.

It seems a little confusing to me personally but maybe I'm missing something. Any clarification would be greatly appreciated.

I am not understanding what you are finding confusing, could you expand on this any further?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Groupby Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

8 participants