Skip to content

API: numeric_only in Series ops behavior #47500

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #46560
rhshadrach opened this issue Jun 24, 2022 · 7 comments
Closed
Tracked by #46560

API: numeric_only in Series ops behavior #47500

rhshadrach opened this issue Jun 24, 2022 · 7 comments
Labels
API - Consistency Internal Consistency of API/Behavior Needs Discussion Requires discussion from core team before further action Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply
Milestone

Comments

@rhshadrach
Copy link
Member

rhshadrach commented Jun 24, 2022

Part of #46560

For Series and SeriesGroupBy ops (and perhaps others like resample, rolling, window, expanding, ewm), there is an inconsistency when numeric_only is passed to ops that have it as an argument:

ser = pd.Series([1, 2, 3])

print(ser.mean(numeric_only=True))
# Raises `NotImplementedError: Series.mean does not implement numeric_only.`

print(ser.mean(numeric_only=False))
# 2.0

print(ser.groupby([0, 0, 1]).mean(numeric_only=True))
# 0    1.5
# 1    3.0
# dtype: float64

print(ser.groupby([0, 0, 1]).mean(numeric_only=False))
# 0    1.5
# 1    3.0
# dtype: float64

I see three possible options:

  1. Ops raise NotImplementedError if numeric_only is truthy (the Series behavior)
  2. Ops ignore numeric_only when the dtype is numeric (the SeriesGroupBy behavior)
  3. Ops raise if anything is passed (I don't know of any ops that act like this currently)

I am in favor of 1 - I view passing in numeric_only=True as erroneous; it doesn't make sense to request this from a Series even if it is a no-op. On the other hand, I view passing in numeric_only=False as "requesting nothing", though perhaps it is slightly odd. Option 1 would also allow the default to be False across all ops (in pandas 2.0).

@rhshadrach rhshadrach added Needs Discussion Requires discussion from core team before further action API - Consistency Internal Consistency of API/Behavior Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply labels Jun 24, 2022
@mroeschke
Copy link
Member

When interpreting the Include only float, int, boolean columns docstring for this Series case, I would interpret numeric_only as a filter that could behave like:

  1. A Series has no columns therefore this should raise a NotImplementedError if numeric_only=True/False (so case 3)
  2. A Series is like a 1D columns therefore the behavior should be like a 1 column DataFrame

e.g.

In [9]: pd.DataFrame(["foo", "bar"]).mean(numeric_only=True)
Out[9]: Series([], dtype: float64)

In [10]: pd.DataFrame(["foo", "bar"]).mean(numeric_only=False)
TypeError: Could not convert ['foobar'] to numeric

In [11]: pd.DataFrame([1, 2]).mean(numeric_only=False)
Out[11]:
0    1.5
dtype: float64

In [12]: pd.DataFrame([1, 2]).mean(numeric_only=True)
Out[12]:
0    1.5
dtype: float64

IMO given the docstring I think users could be a little surprised Series([1, 2, 3]).mean(numeric_only=True) raises because "hey the data there in numeric"

@mroeschke
Copy link
Member

Side note regarding the Include only float, int, boolean **columns** docstring for numeric_only have you explored the effect of when axis=1 is specified with these reduction functions?

@rhshadrach
Copy link
Member Author

IMO given the docstring I think users

Thanks @mroeschke; I think we should potentially consider modifying the docstring (for 2.0). I am wondering if given that option changes your preferences here.

have you explored the effect of when axis=1 is specified with these reduction functions?

A quick check and I'm seeing Series._reduce, Series.groupby, Series._accum_func, Series.fillna all using _get_axis_number which raises ValueError: No axis named 1 for object type Series. I think the question might be more interesting for DataFrame (which is maybe what you're asking?). I've added it to #46560 and plan to circle back with a more detailed look.

@mroeschke
Copy link
Member

I think we should potentially consider modifying the docstring (for 2.0). I am wondering if given that option changes your preferences here.

Most definitely. I haven't dove deeply into the numeric_only topic as you have, but currently IMO numeric_only appears as it should act like "filter based on these types before applying the reduction" which might be an incorrect view.

I think the question might be more interesting for DataFrame (which is maybe what you're asking?)

Yes, this is more specifically what I was asking. I think for Series the ValueError: No axis named 1 for object type Series is expected.

@rhshadrach
Copy link
Member Author

That makes sense. Allowing numeric_only=True to work on Series with numeric dtype also allows a user to have an op that will raise if a dtype is non-numeric, which perhaps could be useful. ALso, it's a smaller (and easier) change. I'm going to go with your suggestion.

@rhshadrach
Copy link
Member Author

rhshadrach commented Jul 4, 2022

From #47561 (review) and #47561 (comment) I ran a git bisect

730b307 is the first bad commit
commit 730b307
Author: Richard Shadrach 45562402+rhshadrach@users.noreply.github.com
Date: Wed Jun 8 12:25:05 2022 -0400

ENH: Add numeric_only to window ops (#47265)

On both main and 1.4.x, ops like sum raise on any args and ignore any kwargs except for "axis", "dtype", "out" (numeric_only does not go into kwargs). If those three kwargs are provided, the op raises.

I think we should deprecate unused args/kwargs; but it's less clear how to handle numeric_only. If we are to still include numeric_only in rolling ops for 1.5, I do think they should raise when used with Series and non-numeric dtypes. Technically this is a breaking change, but one where I think an exception is warranted (pun intended).

@rhshadrach
Copy link
Member Author

This was discussed in the July dev meeting and consensus was that raising in this situation for 1.5 is preferred.

zhengruifeng pushed a commit to apache/spark that referenced this issue Aug 21, 2023
…0 and enabling tests

### What changes were proposed in this pull request?

This PR proposes to match the behavior with pandas 2.0.0 and above for stat functions, such as `sum`, `quantile`, `prod`, etc. See pandas-dev/pandas#41480 and pandas-dev/pandas#47500 for more detail.

### Why are the changes needed?

To match the behavior to latest pandas.

### Does this PR introduce _any_ user-facing change?

Yes, the behaviors for stat funcs are now matched with pandas 2.0.0 and above.

### How was this patch tested?

Enabling & updating the existing UTs.

Closes #42526 from itholic/pandas_stat.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
valentinp17 pushed a commit to valentinp17/spark that referenced this issue Aug 24, 2023
…0 and enabling tests

### What changes were proposed in this pull request?

This PR proposes to match the behavior with pandas 2.0.0 and above for stat functions, such as `sum`, `quantile`, `prod`, etc. See pandas-dev/pandas#41480 and pandas-dev/pandas#47500 for more detail.

### Why are the changes needed?

To match the behavior to latest pandas.

### Does this PR introduce _any_ user-facing change?

Yes, the behaviors for stat funcs are now matched with pandas 2.0.0 and above.

### How was this patch tested?

Enabling & updating the existing UTs.

Closes apache#42526 from itholic/pandas_stat.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
ragnarok56 pushed a commit to ragnarok56/spark that referenced this issue Mar 2, 2024
…0 and enabling tests

### What changes were proposed in this pull request?

This PR proposes to match the behavior with pandas 2.0.0 and above for stat functions, such as `sum`, `quantile`, `prod`, etc. See pandas-dev/pandas#41480 and pandas-dev/pandas#47500 for more detail.

### Why are the changes needed?

To match the behavior to latest pandas.

### Does this PR introduce _any_ user-facing change?

Yes, the behaviors for stat funcs are now matched with pandas 2.0.0 and above.

### How was this patch tested?

Enabling & updating the existing UTs.

Closes apache#42526 from itholic/pandas_stat.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Needs Discussion Requires discussion from core team before further action Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply
Projects
None yet
Development

No branches or pull requests

2 participants