Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: clip where the bound is a series with NA values returns NA #40420

Closed
zhangyingmath opened this issue Mar 13, 2021 · 7 comments · Fixed by #40927
Closed

BUG: clip where the bound is a series with NA values returns NA #40420

zhangyingmath opened this issue Mar 13, 2021 · 7 comments · Fixed by #40927
Assignees
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@zhangyingmath
Copy link

zhangyingmath commented Mar 13, 2021

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '1.2.3'

In [3]: tmp = pd.DataFrame({'f': [1,2]})

In [4]: import numpy as np

In [5]: tmp['h'] = np.nan
#This is the expected behaviour
In [6]: tmp['f'].clip(0, np.inf)
Out[6]: 
0    1
1    2
Name: f, dtype: int64
# This seems wrong:
In [7]: tmp['f'].clip(0, tmp['h'])
Out[7]: 
0   NaN
1   NaN
Name: f, dtype: float64
@dsaxton
Copy link
Member

dsaxton commented Mar 14, 2021

This output seems correct to me. Why would an upper bound of np.nan behave the same way as np.inf?

@zhangyingmath
Copy link
Author

Hi @dsaxton, please allow me to break my comments in two layers:

  1. There is an inconsistency in pandas clip: as pointed out in the thread above, if you do df[col].clip(0, np.nan), it ignore the NA. Or, as you said, it gives you identical result as df[col].clip(0, np.inf). No clipping happens. However, if you do df[col].clip(0, df[col2]) where df[col2] is nothing but NA, then you get NA in return. The NA propagate.

  2. Personally, I would prefer that NA values get ignored in clipping, regardless if it is a float or as part of a series/array. In pandas, by default, max, min, sum etc functions all ignore NA, so I have built up such an expectation. Personally I would feel it counter-intuitive and maybe even confusing if clip breaks this rule.

@dsaxton
Copy link
Member

dsaxton commented Mar 15, 2021

Ah, I didn't realize Series.clip(0, np.nan) does indeed ignore the np.nan (in your scalar example you're using np.inf as the upper bound not np.nan). That is a bug I would say:

[ins] In [7]: s
Out[7]: 
0    1
1    2
2    3
dtype: int64

[ins] In [8]: s.clip(0, np.nan)
Out[8]: 
0    1
1    2
2    3
dtype: int64

[ins] In [9]: s.clip(0, [np.nan, np.nan, np.nan])
Out[9]: 
0   NaN
1   NaN
2   NaN
dtype: float64

However, I think the correct behavior would be the last case where the result is NaN. If we're clipping with an unknown or invalid upper bound, the only correct answer would be unknown / invalid. I don't think this case is really analogous to taking the min or max of a Series containing NaN, since then it is the input where NaN is ignored, rather than an argument that defines what operation we want to perform.

@dsaxton dsaxton added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations labels Mar 15, 2021
@dsaxton
Copy link
Member

dsaxton commented Mar 15, 2021

Actually, according to an old issue it looks like your preference was actually the original intention: #17276. So if we were to stay true to that then the "correct" behavior would be to not return NaN for an array-like upper bound.

@zhangyingmath
Copy link
Author

Thanks @dsaxton for the information!

  1. You are exactly right. Sorry there was a typo in my original post. I meant to use np.nan, not np.inf as a scalar inside .clip().
  2. I think we are in consensus that no matter which choice we make, having consistent behavior between NA scalar or as part of a series is desirable.
  3. In the case of having NA in the input, should we return NA, or not do any clipping? In addition to staying compatible to original intention in BUG: clip should handle null values #17276, I would make an additional argument here for "no clipping": consider the following example:
    df = pd.DataFrame({'f': [1, 2], 'g': [np.nan, np.nan]})
    I would think that we want df['f'].clip(upper=df['g']) to give the same result as df[['f', 'g']].min(axis=1), since they are mathematically equivalent operations.

@dsaxton
Copy link
Member

dsaxton commented Mar 16, 2021

My guess would be that always treating np.nan as "no upper bound" (regardless of whether it's an array or scalar) would be the right way to fix this for now. If you wanted to work on a pull request, the part where it seems we're handling NaN is around here:

if not is_list_like(lower) and np.any(isna(lower)):

@DriesSchaumont
Copy link
Member

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants