Skip to content

Conversation

@Zrahay
Copy link

@Zrahay Zrahay commented Oct 25, 2025

Description

This PR fixes issue #62778 where groupby aggregation methods (like mean(), sum(), std(), etc.) were incorrectly accepting non-boolean values for the numeric_only parameter.

Problem

The numeric_only parameter was only being checked for truthiness/falsiness, allowing invalid inputs like lists, strings, or integers to be passed without raising an error.

Solution

  • Added explicit type validation in _cython_agg_general() method in pandas/core/groupby/groupby.py
  • Raises ValueError with message "numeric_only accepts only Boolean values" when a non-boolean value is provided
  • Added comprehensive test case in pandas/tests/groupby/test_reductions.py

Example

Before (incorrect behavior):

import pandas as pd
df = pd.DataFrame({"A": range(5), "B": range(5)})
df.groupby(["A"]).mean(["B"]) 

After (correct behavior):

df.groupby(["A"]).mean(["B"]) 

Valid usage still works:

df.groupby(["A"]).mean()  # Works
df.groupby(["A"]).mean(numeric_only=True)  
df.groupby(["A"]).mean(numeric_only=False) 

- Add type check for numeric_only parameter in _cython_agg_general
- Raise ValueError if numeric_only is not a boolean
- Add test case for validation
- Closes pandas-dev#62778
@Zrahay Zrahay requested a review from rhshadrach as a code owner October 25, 2025 20:43
@Zrahay
Copy link
Author

Zrahay commented Oct 25, 2025

If there are any feedbacks or issues here, please do let me know. It's my first time contributing to this repo!

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Can you also add a note to the whatsnew for 3.0.

I've also updated your title to adhere to the contribution guidelines: https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#making-a-pull-request

Comment on lines 1760 to 1761
if(isinstance(numeric_only, bool)):
data = self._get_data_to_aggregate(numeric_only=numeric_only, name=how)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change negate this condition and raise here; then the rest of this function can go untouched which makes the diff smaller. It also decreases the amount of indentation needed, improving readability.

Also use is_bool from pandsa.core.dtypes.common. We should accept e.g. np.bool here.

# that goes through SeriesGroupBy

data = self._get_data_to_aggregate(numeric_only=numeric_only, name=how)
# Check to confirm numeric_only is fed either True or False and no other data type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment repeats the code, can you remove it.

):
# Note: we never get here with how="ohlc" for DataFrameGroupBy;
# that goes through SeriesGroupBy
# that goes through SeriesGroupBy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still needs to be reverted in order to minimize the diff.

@rhshadrach rhshadrach changed the title [BUG]: Validate numeric_only parameter in groupby aggregations BUG: Validate numeric_only parameter in groupby aggregations Oct 25, 2025
@rhshadrach rhshadrach added Groupby Bug Error Reporting Incorrect or improved errors from pandas labels Oct 25, 2025
@rhshadrach rhshadrach added this to the 3.0 milestone Oct 25, 2025
…mment on line 1759 and reverted the comment change on line 1757
@Zrahay
Copy link
Author

Zrahay commented Oct 25, 2025

Hey there @rhshadrach ! I've made all the 3 changes you had requested. Please let me know if there's any other issue with anything in my PR.

Thanks!

@Zrahay
Copy link
Author

Zrahay commented Oct 25, 2025

Thanks for the PR! Can you also add a note to the whatsnew for 3.0.

I'm sorry to ask but where do I find whatsnew for 3.0?

@rhshadrach
Copy link
Member

I'm sorry to ask but where do I find whatsnew for 3.0?

No problem! See here: https://pandas.pydata.org/pandas-docs/dev/development/contributing_codebase.html#documenting-your-code

@Zrahay
Copy link
Author

Zrahay commented Oct 25, 2025

@rhshadrach Just added the note in the respective file. Do let me know if there's something else left from my end.

Cheers!

):
# Note: we never get here with how="ohlc" for DataFrameGroupBy;
# that goes through SeriesGroupBy
# that goes through SeriesGroupBy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still needs to be reverted in order to minimize the diff.

Comment on lines 1525 to 1528
"""
Test that numeric_only parameter only accepts boolean values.
See GH#62778
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
Test that numeric_only parameter only accepts boolean values.
See GH#62778
"""
# GH#62778

"""
df = pd.DataFrame({"A": range(5), "B": range(5)})

# These test cases should raise a ValueError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this comment, it repeats the code.

Comment on lines 1542 to 1545
# These test cases should work absolutely fine
df.groupby(["A"]).mean()
df.groupby(["A"]).mean(numeric_only=True)
df.groupby(["A"]).mean(numeric_only=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove these; the test suite has many tests for numeric_only being specified or not specified, these are not increasing our test coverage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should I remove this whole function from the test file or just these 3 lines of code?

Copy link
Member

@rhshadrach rhshadrach Oct 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just these four lines.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach Made the suggested changes. I had made the changes in the comment inside groupby.py file but forgot to add it while doing git add, hence, it didn't show up in the changes here.

This version should be fine. Please let me know if any other changes are there.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

- Bug in :meth:`.DataFrameGroupBy.groups` and :meth:`.SeriesGroupBy.groups` would fail when the groups were :class:`Categorical` with an NA value (:issue:`61356`)
- Bug in :meth:`.DataFrameGroupBy.groups` and :meth:`.SeriesGroupby.groups` that would not respect groupby argument ``dropna`` (:issue:`55919`)
- Bug in :meth:`.DataFrameGroupBy.median` where nat values gave an incorrect result. (:issue:`57926`)
- Bug in :meth:`.DataFrameGroupBy` reductions where boolean-valued inputs were mishandled in the Cython aggregation path (``_cython_agg_general``); adding an ``is_bool`` check fixes incorrect results for some bool inputs. (:issue:`62778`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only mention public things in the whatsnew. Also "Boolean-valued inputs" makes it sounds like there was an issue with True and False.

Suggested change
- Bug in :meth:`.DataFrameGroupBy` reductions where boolean-valued inputs were mishandled in the Cython aggregation path (``_cython_agg_general``); adding an ``is_bool`` check fixes incorrect results for some bool inputs. (:issue:`62778`)
- Bug in :meth:`.DataFrameGroupBy` reductions where non-Boolean values were allowed for the ``numeric_only`` argument; passing a non-Boolean value will now raise (:issue:`62778`)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me correct that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach Made the required changes!

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, ping on green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Error Reporting Incorrect or improved errors from pandas Groupby

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: groupby.<reduction>(numeric_only=) does not validate non-bool arguments

2 participants