-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
BUG: Validate numeric_only parameter in groupby aggregations #62842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add type check for numeric_only parameter in _cython_agg_general - Raise ValueError if numeric_only is not a boolean - Add test case for validation - Closes pandas-dev#62778
|
If there are any feedbacks or issues here, please do let me know. It's my first time contributing to this repo! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Can you also add a note to the whatsnew for 3.0.
I've also updated your title to adhere to the contribution guidelines: https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#making-a-pull-request
pandas/core/groupby/groupby.py
Outdated
| if(isinstance(numeric_only, bool)): | ||
| data = self._get_data_to_aggregate(numeric_only=numeric_only, name=how) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change negate this condition and raise here; then the rest of this function can go untouched which makes the diff smaller. It also decreases the amount of indentation needed, improving readability.
Also use is_bool from pandsa.core.dtypes.common. We should accept e.g. np.bool here.
pandas/core/groupby/groupby.py
Outdated
| # that goes through SeriesGroupBy | ||
|
|
||
| data = self._get_data_to_aggregate(numeric_only=numeric_only, name=how) | ||
| # Check to confirm numeric_only is fed either True or False and no other data type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment repeats the code, can you remove it.
pandas/core/groupby/groupby.py
Outdated
| ): | ||
| # Note: we never get here with how="ohlc" for DataFrameGroupBy; | ||
| # that goes through SeriesGroupBy | ||
| # that goes through SeriesGroupBy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you revert this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still needs to be reverted in order to minimize the diff.
…mment on line 1759 and reverted the comment change on line 1757
|
Hey there @rhshadrach ! I've made all the 3 changes you had requested. Please let me know if there's any other issue with anything in my PR. Thanks! |
I'm sorry to ask but where do I find whatsnew for 3.0? |
No problem! See here: https://pandas.pydata.org/pandas-docs/dev/development/contributing_codebase.html#documenting-your-code |
|
@rhshadrach Just added the note in the respective file. Do let me know if there's something else left from my end. Cheers! |
pandas/core/groupby/groupby.py
Outdated
| ): | ||
| # Note: we never get here with how="ohlc" for DataFrameGroupBy; | ||
| # that goes through SeriesGroupBy | ||
| # that goes through SeriesGroupBy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still needs to be reverted in order to minimize the diff.
| """ | ||
| Test that numeric_only parameter only accepts boolean values. | ||
| See GH#62778 | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| """ | |
| Test that numeric_only parameter only accepts boolean values. | |
| See GH#62778 | |
| """ | |
| # GH#62778 |
| """ | ||
| df = pd.DataFrame({"A": range(5), "B": range(5)}) | ||
|
|
||
| # These test cases should raise a ValueError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this comment, it repeats the code.
| # These test cases should work absolutely fine | ||
| df.groupby(["A"]).mean() | ||
| df.groupby(["A"]).mean(numeric_only=True) | ||
| df.groupby(["A"]).mean(numeric_only=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove these; the test suite has many tests for numeric_only being specified or not specified, these are not increasing our test coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should I remove this whole function from the test file or just these 3 lines of code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just these four lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhshadrach Made the suggested changes. I had made the changes in the comment inside groupby.py file but forgot to add it while doing git add, hence, it didn't show up in the changes here.
This version should be fine. Please let me know if any other changes are there.
… from test_reductions.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
doc/source/whatsnew/v3.0.0.rst
Outdated
| - Bug in :meth:`.DataFrameGroupBy.groups` and :meth:`.SeriesGroupBy.groups` would fail when the groups were :class:`Categorical` with an NA value (:issue:`61356`) | ||
| - Bug in :meth:`.DataFrameGroupBy.groups` and :meth:`.SeriesGroupby.groups` that would not respect groupby argument ``dropna`` (:issue:`55919`) | ||
| - Bug in :meth:`.DataFrameGroupBy.median` where nat values gave an incorrect result. (:issue:`57926`) | ||
| - Bug in :meth:`.DataFrameGroupBy` reductions where boolean-valued inputs were mishandled in the Cython aggregation path (``_cython_agg_general``); adding an ``is_bool`` check fixes incorrect results for some bool inputs. (:issue:`62778`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only mention public things in the whatsnew. Also "Boolean-valued inputs" makes it sounds like there was an issue with True and False.
| - Bug in :meth:`.DataFrameGroupBy` reductions where boolean-valued inputs were mishandled in the Cython aggregation path (``_cython_agg_general``); adding an ``is_bool`` check fixes incorrect results for some bool inputs. (:issue:`62778`) | |
| - Bug in :meth:`.DataFrameGroupBy` reductions where non-Boolean values were allowed for the ``numeric_only`` argument; passing a non-Boolean value will now raise (:issue:`62778`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me correct that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhshadrach Made the required changes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, ping on green.
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.Description
This PR fixes issue #62778 where
groupbyaggregation methods (likemean(),sum(),std(), etc.) were incorrectly accepting non-boolean values for thenumeric_onlyparameter.Problem
The
numeric_onlyparameter was only being checked for truthiness/falsiness, allowing invalid inputs like lists, strings, or integers to be passed without raising an error.Solution
_cython_agg_general()method inpandas/core/groupby/groupby.pyValueErrorwith message "numeric_only accepts only Boolean values" when a non-boolean value is providedpandas/tests/groupby/test_reductions.pyExample
Before (incorrect behavior):
After (correct behavior):
Valid usage still works: