-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: min on non-numeric data with nans #4147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@jreback Think we should fix this on the pandas side, can you think of a cheaper/efficient way to do this. Sticking this at the top of min/max/more?
|
from core/nanops.py
|
also can prob also do this if regardless of skipna (I don't think that makes sense for |
I think there might just be a typo in |
maybe not |
@jreback fancy throwing two cents to the numpy issue/discussion? Will have another go making the "easy change" this evening :). Bit confusing as |
yeh that was how it was as far as i can remember prob should change it as its only internal to this module |
Related to #4006. |
why do u think these are related? |
In that a solution to this is likely to solve the other. They only differ in how null is represented in the series. |
your solution to the other issue looks ok |
Will have another go at this at the weekend. (Maybe I don't understand your "easy change" @jreback ) |
hah! look at core/nanops |
If this is changes, |
def not an easy change :) i'll put up a pr, but this is a terrible hack around the insanity of 'a' > inf and 'a' > -inf == True # ??? |
I think you might need to do something like this, e.g. order strings/non-strings separately: https://github.com/pydata/pandas/blob/master/pandas/core/algorithms.py#L144 take the min of both then do some heuristc |
The bug behaves differently now. In [4]: s = pd.Series(['alpha', np.nan, 'charlie', 'delta'])
In [5]: s
Out[5]:
0 alpha
1 NaN
2 charlie
3 delta
dtype: object
In [6]: s.min()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/Users/facaiyan/Workshop/pandas/pandas/core/nanops.py in f(values, axis, skipna, **kwds)
99 else:
--> 100 result = alt(values, axis=axis, skipna=skipna, **kwds)
101 except Exception:
/Users/facaiyan/Workshop/pandas/pandas/core/nanops.py in reduction(values, axis, skipna)
439 else:
--> 440 result = getattr(values, meth)(axis)
441
/Users/facaiyan/Library/anaconda3/lib/python3.5/site-packages/numpy/core/_methods.py in _amin(a, axis, out, keepdims)
28 def _amin(a, axis=None, out=None, keepdims=False):
---> 29 return umr_minimum(a, axis, None, out, keepdims)
30
TypeError: unorderable types: str() <= float() |
@ningchi yes, this now needs to mask the nulls first |
What do you mean by mask here? I know for floats and strings right now the nulls get masked to INF for min and -INF for max. What would we do for the strings though, what kind of values did you have in mind? Are pull requests welcome? |
@mrpoor pull requests are certainly welcome to fix this. With masking is meant that those values are not used when calculating the min or max (like |
Where would you put the tests? Does test_timeseries.py sound good? |
This is not timeseries related, so you can put somewhere in |
With a Categorical Series, I see this bug for import pandas as pd
from pandas.api.types import CategoricalDtype as CD
x = pd.Series(list("abcaa"), dtype=CD(ordered = True))
print(x.min(), x.max()) # => a c
xn = x.copy()
xn[1] = None
print(xn.min(), xn.max()) # => NaN c
y = pd.Series(list("cabdd"), dtype=CD(ordered = True))
print(pd.DataFrame({"x": x, "y": y}).min(axis = 1)) # => a a b a a
print(pd.DataFrame({"x": x, "y": y}).max(axis = 1)) # => c b c d d
print(pd.DataFrame({"xn": xn, "y": y}).min(axis = 1)) # => all NaN
print(pd.DataFrame({"xn": xn, "y": y}).max(axis = 1)) # => all NaN |
@Kodiologist could this be because the df isn't separated into multiple elements/indices? (New member, sorry if I'm behind the ball but I'm trying to catch up!) |
I'm afraid I don't understand your question. The DataFrames in my example do have multiple elements. But I don't know much about pandas internals, anyway. I've only submitted one PR, which was back in 2015. |
Gotcha, the issue was still open and I was curious if it's been resolved!
I'll poke around a little more and see if I can be more accurate and
specific
…On Thu, Oct 18, 2018, 2:51 PM Kodi Arfer ***@***.***> wrote:
I'm afraid I don't understand your question. The DataFrames in my example
do have multiple elements. But I don't know much about pandas internals,
anyway. I've only submitted one PR, which was back in 2015.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4147 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AlL_lJ6YaJ7_8cjB65y5ydsT6FbAJkoRks5umPf4gaJpZM4Ay0QX>
.
|
related #4279
related #5967
Min doesn't seem to work as expected with NaNs and non-numeric data:
The hack/workaround is to exclude them, perhaps we should special case this in the code:
From discussion on this pandas SO question. and I also posted a corresponding numpy issue:
numpy/numpy#3508.
The text was updated successfully, but these errors were encountered: