-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Series.argmax() fails with np.inf #16449
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -486,23 +486,23 @@ def reduction(values, axis=None, skipna=True): | |
nanmax = _nanminmax('max', fill_value_typ='-inf') | ||
|
||
|
||
@disallow('O') | ||
def nanargmax(values, axis=None, skipna=True): | ||
""" | ||
Returns -1 in the NA case | ||
""" | ||
values, mask, dtype, _ = _get_values(values, skipna, fill_value_typ='-inf', | ||
isfinite=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would actually be ok with NOT allowing this for strings at all. IOW you can use the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure what you mean by “it doesn't work now anyhow”? I think the build failures were just a linting error that I missed locally; trying again. I think the argument for allowing it with strings is consistency: right now, the Pandas version of data = ['spam', 'eggs']
data_a = np.array(data, dtype='object')
data_s = pd.Series(data)
max(data) # -> 'spam'
data_a.max() # -> 'spam'
data_s.max() # -> 'spam'
# So far so good
data.index(max(data)) # -> 0
data_a.argmax() # -> 0
data_s.argmax() #-> ValueError: could not convert string to float: 'eggs' This is confusing; we expect data_s.argmax() to return 0 just like everything else. After this PR, it does return 0, as expected. It's true that if we change the example above to use mixed data types, like data2 = ['spam', 'eggs', None]
data2_a = np.array(data2, dtype='object')
data2_s = pd.Series(data2)
max(data2) # -> 'spam' under Py2, TypeError under Py3
data2_a.max() # -> 'spam' under Py2, TypeError under Py3
data2_s.max() # -> 'spam' under Py2, TypeError under Py3
data2.index(max(data2)) # -> 0 under Py2, TypeError under Py3
data2_a.argmax() # -> 0 under Py2, TypeError under Py3 Now, with the current release of Pandas, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would be ok with allowing this, but it HAS to work with missing values. Everything else does and no reason for this not too. you simply mask, apply function (e.g. argmax), then replace the missing by -1 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it better to put in the fix for inf, then in a followup fix the strings. |
||
values, mask, dtype, _ = _get_values(values, skipna, fill_value_typ='-inf') | ||
result = values.argmax(axis) | ||
result = _maybe_arg_null_out(result, axis, mask, skipna) | ||
return result | ||
|
||
|
||
@disallow('O') | ||
def nanargmin(values, axis=None, skipna=True): | ||
""" | ||
Returns -1 in the NA case | ||
""" | ||
values, mask, dtype, _ = _get_values(values, skipna, fill_value_typ='+inf', | ||
isfinite=True) | ||
values, mask, dtype, _ = _get_values(values, skipna, fill_value_typ='+inf') | ||
result = values.argmin(axis) | ||
result = _maybe_arg_null_out(result, axis, mask, skipna) | ||
return result | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1857,3 +1857,50 @@ def test_op_duplicate_index(self): | |
result = s1 + s2 | ||
expected = pd.Series([11, 12, np.nan], index=[1, 1, 2]) | ||
assert_series_equal(result, expected) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add another tests which asserts the behavior for strings There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is covered now, starting on line 1898 below |
||
@pytest.mark.parametrize( | ||
"test_input,error_type", | ||
[ | ||
(pd.Series([]), ValueError), | ||
|
||
# For strings, or any Series with dtype 'O' | ||
(pd.Series(['foo', 'bar', 'baz']), TypeError), | ||
(pd.Series([(1,), (2,)]), TypeError), | ||
|
||
# For mixed data types | ||
( | ||
pd.Series(['foo', 'foo', 'bar', 'bar', None, np.nan, 'baz']), | ||
TypeError | ||
), | ||
] | ||
) | ||
def test_assert_argminmax_raises(self, test_input, error_type): | ||
""" | ||
Cases where ``Series.argmax`` and related should raise an exception | ||
""" | ||
with pytest.raises(error_type): | ||
test_input.argmin() | ||
with pytest.raises(error_type): | ||
test_input.argmin(skipna=False) | ||
with pytest.raises(error_type): | ||
test_input.argmax() | ||
with pytest.raises(error_type): | ||
test_input.argmax(skipna=False) | ||
|
||
def test_argminmax_with_inf(self): | ||
# For numeric data with NA and Inf (GH #13595) | ||
s = pd.Series([0, -np.inf, np.inf, np.nan]) | ||
|
||
assert s.argmin() == 1 | ||
assert np.isnan(s.argmin(skipna=False)) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if you do
I suspect this will fail. Can you test with this as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I didn't even know about this! So based on the notes on the options documentation page, it looks like Pandas used to treat floating point +/- inf, as well as nan, as missing values by default, and this behavior was changed at some point. That's a big help; will check this and update. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback can you confirm this is the expected behavior? s = pd.Series([0, -np.inf, np.inf, np.nan])
with pd.option_context('mode.use_inf_as_null', True):
s.argmax() # --> 0
s.argmax(skipna=False) # --> nan There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks like it's working; see line 1891 below |
||
assert s.argmax() == 2 | ||
assert np.isnan(s.argmax(skipna=False)) | ||
|
||
# Using old-style behavior that treats floating point nan, -inf, and | ||
# +inf as missing | ||
with pd.option_context('mode.use_inf_as_na', True): | ||
assert s.argmin() == 0 | ||
assert np.isnan(s.argmin(skipna=False)) | ||
assert s.argmax() == 0 | ||
np.isnan(s.argmax(skipna=False)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually we need to have a mini-section in api breaking changes that min/max now don't work on
object
dtypes.and if you can add a couple of tests (does any exiting ones break because of the
disallow('O')
changes?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay; I wrote something for the breaking API changes section. Ultimately, though, the difference is just that
argmax
used to throwValueError
and now throwsTypeError
with string data; is that an appropriate change for the breaking section?I added a couple of other tests for Series with object dtype. I haven't seen any existing tests break as a result of this change.