Skip to content

BUG: CategoricalIndex.searchsorted doesn't return a scalar if input was scalar #21019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion doc/source/whatsnew/v0.23.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,8 @@ Indexing
- Bug in :meth:`MultiIndex.set_names` where error raised for a ``MultiIndex`` with ``nlevels == 1`` (:issue:`21149`)
- Bug in :class:`IntervalIndex` constructors where creating an ``IntervalIndex`` from categorical data was not fully supported (:issue:`21243`, issue:`21253`)
- Bug in :meth:`MultiIndex.sort_index` which was not guaranteed to sort correctly with ``level=1``; this was also causing data misalignment in particular :meth:`DataFrame.stack` operations (:issue:`20994`, :issue:`20945`, :issue:`21052`)
-
- Bug in :func:`CategoricalIndex.searchsorted` where the method did not return a scalar when the input values was scalar (:issue:`21019`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move to 0.23.2

- Bug in :class:`CategoricalIndex` where slicing beyond the range of the data raised a ``KeyError`` (:issue:`21019`)

I/O
^^^
Expand Down
2 changes: 2 additions & 0 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -1342,6 +1342,8 @@ def searchsorted(self, value, side='left', sorter=None):

if -1 in values_as_codes:
raise ValueError("Value(s) to be inserted must be in categories.")
if is_scalar(value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would rather do this in pandas/core/base.py/searchsorted

use is_scalar rather than a numpy function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • the issue is rather with the helper function _get_codes_for_values which always returns an array. I didn't want to change it there since the way it is written right now only works for array like objects. In base.py we're already calling searchsorted directly on the numpy array, i.e. it obeys the in/output shape
  • I'm using is_scalar here, is this wrong? Are you referring to the np.asscalar? I couldn't find a suitable pandas function for that (other than ~ values[0])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I c, change here is ok
don't use np.asscalar, rather use .item()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I found out in #21699, numpy.searchsorted doesn't like python ints, but needs numpy ints to archieve its speed.

>>> n = 1_000_000
>>> c = pd.Categorical(list('a' * n + 'b' * n + 'c' * n), ordered=True)
>>> %timeit c.codes.searchsorted(1)  # python int
7 ms ± 24.7 µs per loop
>>> c.codes.dtype
int8
>>> %timeit c.codes.searchsorted(np.int8(1))
2.46 µs ± 82.4 ns per loop

So the scalar version should be values_as_codes = values_as_codes[0] to avoid speed loss.

values_as_codes = values_as_codes.item()

return self.codes.searchsorted(values_as_codes, side=side,
sorter=sorter)
Expand Down
3 changes: 2 additions & 1 deletion pandas/core/indexes/category.py
Original file line number Diff line number Diff line change
Expand Up @@ -432,13 +432,14 @@ def get_loc(self, key, method=None):
>>> monotonic_index.get_loc('b')
slice(1, 3, None)

>>> non_monotonic_index = p.dCategoricalIndex(list('abcb'))
>>> non_monotonic_index = pd.CategoricalIndex(list('abcb'))
>>> non_monotonic_index.get_loc('b')
array([False, True, False, True], dtype=bool)
"""
codes = self.categories.get_loc(key)
if (codes == -1):
raise KeyError(key)

return self._engine.get_loc(codes)

def get_value(self, series, key):
Expand Down
6 changes: 3 additions & 3 deletions pandas/tests/categorical/test_analytics.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,9 +86,9 @@ def test_searchsorted(self):
# Searching for single item argument, side='left' (default)
res_cat = c1.searchsorted('apple')
res_ser = s1.searchsorted('apple')
exp = np.array([2], dtype=np.intp)
tm.assert_numpy_array_equal(res_cat, exp)
tm.assert_numpy_array_equal(res_ser, exp)
exp = np.intp(2)
assert res_cat == exp
assert res_ser == exp

# Searching for single item array, side='left' (default)
res_cat = c1.searchsorted(['bread'])
Expand Down
79 changes: 72 additions & 7 deletions pandas/tests/indexing/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -627,15 +627,80 @@ def test_reindexing(self):
lambda: self.df2.reindex(['a'], limit=2))

def test_loc_slice(self):
# slicing
# not implemented ATM
# GH9748
df = DataFrame(
{"A": range(0, 6)},
index=CategoricalIndex(list("aabcde"), name="B"),
)

# slice on an unordered categorical using in-sample, connected edges
result = df.loc["b":"d"]
expected = df.iloc[2:5]
assert_frame_equal(result, expected)

pytest.raises(TypeError, lambda: self.df.loc[1:5])
# Slice the entire dataframe
result = df.loc["a":"e"]
assert_frame_equal(result, df)
result_iloc = df.iloc[0:6]
assert_frame_equal(result_iloc, result)

# check if the result is identical to an ordinary index
df_non_cat_index = df.copy()
df_non_cat_index.index = df_non_cat_index.index.astype(str)
result = df.loc["a":"e"]
result_non_cat = df_non_cat_index.loc["a": "e"]
result.index = result.index.astype(str)
assert_frame_equal(result_non_cat, result)

@pytest.mark.parametrize(
"content",
[list("aab"), list("bbc"), list('bbc')],
ids=["right_edge", "left_edge", "both_edges"],
)
def test_loc_beyond_edge_slicing(self, content):
"""
This test ensures that no `KeyError` is raised if trying to slice
beyond the edges of known, ordered categories.

see GH21019
"""
# This dataframe might be a slice of a larger categorical
# (i.e. more categories are known than there are in the column)

ordered_df = DataFrame(
{"A": range(0, 3)},
index=CategoricalIndex(
content, categories=list("abcde"), name="B", ordered=True
),
)

# Although the edge is not within the slice, this should fall back
# to searchsorted slicing since the category is known and the index
# is ordered. Since we're selecting a value larger/lower than the
# right/left edge we should get the original slice again.
result = ordered_df.loc["a": "d"]
assert_frame_equal(result, ordered_df)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also test the left edge as well

# Ensure that index based slicing gives the same result
result_iloc = ordered_df.iloc[0:4]
assert_frame_equal(result, result_iloc)

# If the categorical is not sorted and the requested edge
# is not in the slice we cannot perform slicing
ordered_df.index = ordered_df.index.as_unordered()
with pytest.raises(KeyError):
ordered_df.loc["a": "d"]

# result = df.loc[1:5]
# expected = df.iloc[[1,2,3,4]]
# assert_frame_equal(result, expected)
with pytest.raises(KeyError):
# If the category is not known, there is nothing we can do
ordered_df.loc["a":"z"]

unordered_df = ordered_df.copy()
unordered_df.index = unordered_df.index.as_unordered()
with pytest.raises(KeyError):
# This operation previously succeeded for an ordered index. Since
# this index is no longer ordered, we cannot perfom out of range
# slicing / searchsorted
unordered_df.loc["a": "d"]

def test_boolean_selection(self):

Expand Down