Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH]: Reworking of iloc and loc indexing #12793

Open
17 of 38 tasks
wence- opened this issue Feb 16, 2023 · 5 comments
Open
17 of 38 tasks

[ENH]: Reworking of iloc and loc indexing #12793

wence- opened this issue Feb 16, 2023 · 5 comments
Assignees
Labels
improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.

Comments

@wence-
Copy link
Contributor

wence- commented Feb 16, 2023

Status quo

Indexing of dataframes and series happens through six user-facing routes:

  • DataFrame.__setitem__/DataFrame.__getitem__
  • DataFrame.iloc.__setitem__/DataFrame.iloc.__getitem__
  • DataFrame.loc.__setitem__/DataFrame.loc.__getitem__
  • Series.__setitem__/Series.__getitem__
  • Series.iloc.__setitem__/Series.iloc.__getitem__
  • Series.loc.__setitem__/Series.loc.__getitem__

These all have slightly different semantics (to match pandas behaviour), but there is still quite a lot of (possibly unnecessary) code duplication and a number of bugs around indexing. Many of these look to be because the business logic of handling slicing/gather-by-mask/indexing is intertwined with error handling and determining exactly what to slice. There's also logic effectively repeated between the loc and iloc versions in both cases.

It would be nice if the number of different paths into indexing was reduced, perhaps it is a pipe dream to share between Series and DataFrame (since a DataFrame is not just a collection of Series), but it feels like it should be possible to share more between iloc/loc/setgetitem.

Related issues:

iloc bugs

Preview Give feedback
  1. bug
  2. Python bug improvement
    wence-
  3. bug improvement
    wence-
  4. Python bug improvement
    wence-
  5. Python bug improvement
    wence-
  6. Python bug improvement
    wence-
  7. Python bug
    wence-
  8. 0 - Backlog Python bug
    wence-

Index bugs

Preview Give feedback
  1. Python bug
    wence-

loc bugs

Preview Give feedback
  1. Python bug
    wence-
  2. wence-
  3. Python bug inactive-90d
    brandon-b-miller
  4. wence-
  5. Python bug
    galipremsagar
  6. 0 - Waiting on Author question
  7. 0 - Backlog Python bug
  8. Python bug
  9. 0 - Backlog Python feature request
  10. Python bug improvement
    wence-
  11. Python bug improvement
    wence-
  12. bug improvement
    wence-
  13. Python bug improvement
  14. Python bug improvement
    wence-
  15. Python bug improvement
    wence-
  16. Python bug
    wence-
  17. 2 - In Progress Python bug
    wence-
  18. 2 - In Progress Python bug
    wence-
  19. 2 - In Progress Python bug
    wence-

Views vs. copies

Preview Give feedback
  1. Python bug
  2. shwina
  3. 2 - In Progress

Other (mostly dtype-related)

Preview Give feedback
  1. Python feature request
  2. Python feature request
  3. Python bug
    brandon-b-miller mroeschke
  4. 2 - In Progress Python feature request improvement proposal question

Your issue here.

As we can see from this classification, loc-based indexing is definitely the harder nut to crack. The edge-cases that provoke most of the issues are cases where the values used in the indexing are not in the index.

@wence- wence- added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function labels Feb 16, 2023
@wence- wence- self-assigned this Feb 16, 2023
@vyasr
Copy link
Contributor

vyasr commented Feb 16, 2023

Thank you for putting in the effort to collect these issues!

@mmccarty
Copy link
Contributor

cc @mroeschke

@vyasr
Copy link
Contributor

vyasr commented May 14, 2024

@wence- @mroeschke how many of these issues become moot once COW becomes the default behavior in pandas 3.0? If many, I think that is probably the better path for us to follow on those issues rather than trying to fix issues that we know are going away.

@wence-
Copy link
Contributor Author

wence- commented May 14, 2024

CoW unfortunately, I think, does not fix most of the issues here. They are mostly not to do with views vs copies. Most of them are that the desugaring step from the top-level syntax is not handled compatibly with pandas.

@vyasr
Copy link
Contributor

vyasr commented May 15, 2024

Sad, but not surprising. Just wanted to check since I did close a couple of related issues over the past week where COW definitely does fix them, but many others didn't look like they would be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.
Projects
Status: Todo
Development

No branches or pull requests

3 participants