Skip to content

BUG: min on non-numeric data with nans #4147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hayd opened this issue Jul 6, 2013 · 26 comments
Open

BUG: min on non-numeric data with nans #4147

hayd opened this issue Jul 6, 2013 · 26 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Reduction Operations sum, mean, min, max, etc.

Comments

@hayd
Copy link
Contributor

hayd commented Jul 6, 2013

related #4279
related #5967

Min doesn't seem to work as expected with NaNs and non-numeric data:

In [1]: s = pd.Series(['alpha', np.nan, 'charlie', 'delta'])

In [2]: s
Out[2]:
0      alpha
1        NaN
2    charlie
3      delta
dtype: object

In [3]: s.min()  # by default docstring suggests this should skipnan
Out[3]: inf

The hack/workaround is to exclude them, perhaps we should special case this in the code:

In [4]: s[s.notnull()].min()
Out[4]: 'alpha'

From discussion on this pandas SO question. and I also posted a corresponding numpy issue:
numpy/numpy#3508
.

@hayd
Copy link
Contributor Author

hayd commented Jul 6, 2013

@jreback Think we should fix this on the pandas side, can you think of a cheaper/efficient way to do this.

Sticking this at the top of min/max/more?

if s.dtype == object and np.nan in s.values:
   s = s[s.notnull()]

@jreback
Copy link
Contributor

jreback commented Jul 6, 2013

from core/nanops.py
easy change

@bottleneck_switch()
def nanmin(values, axis=None, skipna=True):
    values, mask, dtype = _get_values(values, skipna, fill_value_typ = '+inf')

    # numpy 1.6.1 workaround in Python 3.x
    if values.dtype == np.object)
_
            if sys.version_info[0] >= 3):  # pragma: no cover
                import __builtin__
                if values.ndim > 1:
                   apply_ax = axis if axis is not None else 0
                   result = np.apply_along_axis(__builtin__.min, apply_ax, values)
                else:
                   result = __builtin__.min(values)

             ##### add something like this #####
             else:
                 result = values[mask].min()
    else:
        if ((axis is not None and values.shape[axis] == 0)
                or values.size == 0):
            result = com.ensure_float(values.sum(axis))
            result.fill(np.nan)
        else:
            result = values.min(axis)

    result = _wrap_results(result,dtype)
    return _maybe_null_out(result, axis, mask)

@jreback
Copy link
Contributor

jreback commented Jul 6, 2013

also can prob also do this if regardless of skipna (I don't think that makes sense for object type anyhow)

@hayd
Copy link
Contributor Author

hayd commented Jul 6, 2013

I think there might just be a typo in nanops._get_values (a isnull rather than a notnull).

@hayd
Copy link
Contributor Author

hayd commented Jul 6, 2013

maybe not

@hayd
Copy link
Contributor Author

hayd commented Jul 8, 2013

@jreback fancy throwing two cents to the numpy issue/discussion?

Will have another go making the "easy change" this evening :). Bit confusing as _get_values seems to return the opposite mask then you'd expect (i.e. the null values are True).

@jreback
Copy link
Contributor

jreback commented Jul 8, 2013

yeh that was how it was as far as i can remember

prob should change it as its only internal to this module

@miketkelly
Copy link

Related to #4006.

@jreback
Copy link
Contributor

jreback commented Jul 14, 2013

why do u think these are related?

@miketkelly
Copy link

In that a solution to this is likely to solve the other. They only differ in how null is represented in the series.

@jreback
Copy link
Contributor

jreback commented Jul 14, 2013

your solution to the other issue looks ok
but this is different, and a numpy bug that needs working around

@hayd
Copy link
Contributor Author

hayd commented Aug 22, 2013

Will have another go at this at the weekend. (Maybe I don't understand your "easy change" @jreback )

@jreback
Copy link
Contributor

jreback commented Aug 22, 2013

hah!

look at core/nanops

@jankatins
Copy link
Contributor

If this is changes, Categorical.min also needs to change (currently it ignores skipna)

@cpcloud
Copy link
Member

cpcloud commented Jun 24, 2014

def not an easy change :)

i'll put up a pr, but this is a terrible hack around the insanity of

'a' > inf and 'a' > -inf == True  # ???

@jreback
Copy link
Contributor

jreback commented Jun 24, 2014

I think you might need to do something like this, e.g. order strings/non-strings separately: https://github.com/pydata/pandas/blob/master/pandas/core/algorithms.py#L144

take the min of both then do some heuristc

@jreback jreback removed this from the Next Major Release milestone Apr 21, 2016
@jreback jreback modified the milestones: 0.18.2, Next Major Release Jul 6, 2016
@facaiy
Copy link
Contributor

facaiy commented Aug 13, 2016

The bug behaves differently now.

In [4]: s = pd.Series(['alpha', np.nan, 'charlie', 'delta'])

In [5]: s
Out[5]:
0      alpha
1        NaN
2    charlie
3      delta
dtype: object

In [6]: s.min()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/Users/facaiyan/Workshop/pandas/pandas/core/nanops.py in f(values, axis, skipna, **kwds)
     99                 else:
--> 100                     result = alt(values, axis=axis, skipna=skipna, **kwds)
    101             except Exception:

/Users/facaiyan/Workshop/pandas/pandas/core/nanops.py in reduction(values, axis, skipna)
    439         else:
--> 440             result = getattr(values, meth)(axis)
    441

/Users/facaiyan/Library/anaconda3/lib/python3.5/site-packages/numpy/core/_methods.py in _amin(a, axis, out, keepdims)
     28 def _amin(a, axis=None, out=None, keepdims=False):
---> 29     return umr_minimum(a, axis, None, out, keepdims)
     30

TypeError: unorderable types: str() <= float()

@jreback
Copy link
Contributor

jreback commented Aug 15, 2016

@ningchi yes, this now needs to mask the nulls first

@gregorylivschitz
Copy link

@jreback

What do you mean by mask here? I know for floats and strings right now the nulls get masked to INF for min and -INF for max.

What would we do for the strings though, what kind of values did you have in mind?

Are pull requests welcome?

@jorisvandenbossche
Copy link
Member

@mrpoor pull requests are certainly welcome to fix this. With masking is meant that those values are not used when calculating the min or max (like s[s.notnull()].min() in user API)

@gregorylivschitz
Copy link

@jorisvandenbossche

Where would you put the tests? Does test_timeseries.py sound good?

@jorisvandenbossche
Copy link
Member

This is not timeseries related, so you can put somewhere in pandas/tests/series (possibly test_analytics.py, you should look where the existing tests for min() are located)

@Kodiologist
Copy link
Contributor

With a Categorical Series, I see this bug for min but not for max. With DataFrames of Categoricals, on the other hand, it's visible for both min and max.

import pandas as pd
from pandas.api.types import CategoricalDtype as CD

x = pd.Series(list("abcaa"), dtype=CD(ordered = True))
print(x.min(), x.max())    # => a c
xn = x.copy()
xn[1] = None
print(xn.min(), xn.max())  # => NaN c
y = pd.Series(list("cabdd"), dtype=CD(ordered = True))
print(pd.DataFrame({"x": x, "y": y}).min(axis = 1))   # => a a b a a
print(pd.DataFrame({"x": x, "y": y}).max(axis = 1))   # => c b c d d
print(pd.DataFrame({"xn": xn, "y": y}).min(axis = 1)) # => all NaN
print(pd.DataFrame({"xn": xn, "y": y}).max(axis = 1)) # => all NaN

@Remnan13
Copy link

@Kodiologist could this be because the df isn't separated into multiple elements/indices? (New member, sorry if I'm behind the ball but I'm trying to catch up!)

@Kodiologist
Copy link
Contributor

I'm afraid I don't understand your question. The DataFrames in my example do have multiple elements. But I don't know much about pandas internals, anyway. I've only submitted one PR, which was back in 2015.

@Remnan13
Copy link

Remnan13 commented Oct 18, 2018 via email

@jbrockmendel jbrockmendel added the Reduction Operations sum, mean, min, max, etc. label Sep 22, 2020
@jbrockmendel jbrockmendel removed the Numeric Operations Arithmetic, Comparison, and Logical operations label Dec 28, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

No branches or pull requests