Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search for an item in a list (or array), part 2 #20626

Open
itamarst opened this issue Jan 8, 2025 · 5 comments · May be fixed by #21192
Open

Search for an item in a list (or array), part 2 #20626

itamarst opened this issue Jan 8, 2025 · 5 comments · May be fixed by #21192
Labels
enhancement New feature or an improvement of an existing feature

Comments

@itamarst
Copy link
Contributor

itamarst commented Jan 8, 2025

Description

Now that Expr.index_of is a thing, the next step is adding something similar to lists.

Potential APIs

Thing is, there are actually multiple different multiple APIs people might want:

Option 1: For each list in the Series/Column, find the index of the value in that list:

>>> pl.Series([[1, 3], [2, 4], [5, 1]).list.index_of_or_maybe_some_other_name(1)
shape: (3,)
Series: '' [i64]
[
        0
        null
        1
]

This is what @thobai requested when they filed #5503.

Option 1B: Like option 1, but supports a expression-oriented version

df.select(pl.col("list_of_floats").index_of_or_maybe_some_other_name(pl.col("floats"))

and search each list in the list_of_floats column for the corresponding value taken from the floats column. I think this is just a superset of option 1, and is what a G-Research user would like.

Option 2: Find the index of the first list that contains the value:

>>> pl.Series([[1, 3], [2, 4], [5, 1]).list.index_of_or_maybe_some_other_name(1)
0

Option 2 can be implemented on top of option 1, albeit inefficiently, so providing it as an exposed feature is an optimization, not a necessity. Perhaps this API if ever implemented should be called .list.first_index_of().

Initial questions for maintainers

  1. Which option should be implemented? I'm inclined towards 1B since that is the most general purpose, and meets requests of multiple users.
  2. What should the API be called? I'm inclined towards calling option 1/1B .list.index_of(), and option 2 .list.first_index_of().
  3. For implementation of option 1/1B, amortized_iter() seems like the easy approach?
@itamarst itamarst added the enhancement New feature or an improvement of an existing feature label Jan 8, 2025
@coastalwhite
Copy link
Collaborator

I, personally, think both Option 1A/1B and Option 2 have their place, maybe under different names.

Some names that might work.

  • index_of_in: Option 1A/1B
  • index_of_with: Option 2

As for implementation: Option 2 can be implemented with Series.index_of on the underlying values (make sure to propagate the nulls) and then a binary search on the offsets.

@orlp
Copy link
Collaborator

orlp commented Jan 8, 2025

Option 1:

>>> df = pl.DataFrame({"x": [[1, 3], [2, 4], [5, 1]], "y": [3, 4, 5]})
>>> df.select(pl.col.x.list.eval(pl.element().index_of(1)).list.first())
shape: (3, 1)
┌──────┐
│ x    │
│ ---  │
│ u32  │
╞══════╡
│ 0    │
│ null │
│ 1    │
└──────┘

Option 1B:

>>> df.with_row_index().select(pl.col.x.explode().index_of(pl.col.y.first()).over("index"))
shape: (3, 1)
┌─────┐
│ x   │
│ --- │
│ u32 │
╞═════╡
│ 1   │
│ 1   │
│ 0   │
└─────┘

Option 2:

>>> df.with_row_index().filter(pl.col.x.list.contains(1)).select(pl.col.index.first())
shape: (1, 1)
┌───────┐
│ index │
│ ---   │
│ u32   │
╞═══════╡
│ 0     │
└───────┘

@itamarst
Copy link
Contributor Author

itamarst commented Jan 9, 2025

@orlp are you suggesting those as an implementation strategy, or implying that it's not worth doing since it's already possible?

@orlp
Copy link
Collaborator

orlp commented Jan 10, 2025

@itamarst Neither (perhaps a hint of the latter), just showing how you could do it in today's Polars.

@mcrumiller
Copy link
Contributor

mcrumiller commented Feb 11, 2025

I'm a bit late here, why isn't 1/1A just called list.index_of? index_of_in sounds pretty odd to me.

For option 2, what about list.find? 'find' is used in many languages to determine the positional index of an item. It often is accompanied by a parameter k to locate the first k instances of the item.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants