Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] loc behavior differs from pandas when a duplicated index is requested #8693

Closed
Tracked by #12793
brandon-b-miller opened this issue Jul 8, 2021 · 2 comments · Fixed by #13625
Closed
Tracked by #12793
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@brandon-b-miller
Copy link
Contributor

brandon-b-miller commented Jul 8, 2021

Describe the bug
When we have a series with a duplicated index such as

x = cudf.Series([1,2,3,4], index=[0,1,1,2])

And we try and fetch the items corresponding to that duplicated index, we get just the first element rather than a series of all the elements corresponding to that index:

x.loc[1]
# 2

Steps/Code to reproduce bug
See above

Expected behavior
We should get the same thing as pandas, basically all the rows corresponding to that index gathered.

x.to_pandas().loc[1]
# 1    2
# 1    3
# dtype: int64

Environment overview (please complete the following information)

  • Environment location: [Bare-metal]
  • Method of cuDF install: [source]

Environment details

Additional context
Add any other context about the problem here.

@brandon-b-miller brandon-b-miller added bug Something isn't working Python Affects Python cuDF API. labels Jul 8, 2021
@brandon-b-miller brandon-b-miller self-assigned this Jul 8, 2021
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@shwina
Copy link
Contributor

shwina commented Nov 22, 2022

This is because of an "optimization" for scalar inputs to loc (

if _is_scalar_or_zero_d_array(arg):
), that we should probably get rid of.

A workaround is to pass the index in a list:

s.loc[[0]] 

wence- added a commit to wence-/cudf that referenced this issue Jun 27, 2023
Now that we have indices_of available on index objects, use that when
looking up scalar values in an index for loc-based indexing.

- Closes rapidsai#8693
rapids-bot bot pushed a commit that referenced this issue Jun 30, 2023
…3625)

Much of the index-specific search code for label-based lookup only worked in the case where the index was sorted in ascending order and/or the requested slice had the same step sign as the index. To fix this, handle ascending and descending sorted indices specially, as well as refactoring to remove unused codepaths.

The core idea is to push `find_first` and `find_last` to being generic column methods (in doing so we remove the `closest` argument, but that was only used to produce pandas-incompatible index behaviour).

Lookup in indices then uses `get_slice_bound` (that can be specialised for index types) that uses first (if applicable) `searchsorted` and then `find_first/last`.

While we're here, since we now check the sortedness properties of Index objects, turn them into `@cached_property` (partially addressing the request of #13357).

- Closes #8693
- Closes #12833

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #13625
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants