Search for an item in a list (or array), part 2 #20626

itamarst · 2025-01-08T13:44:39Z

Description

Now that Expr.index_of is a thing, the next step is adding something similar to lists.

Potential APIs

Thing is, there are actually multiple different multiple APIs people might want:

Option 1: For each list in the Series/Column, find the index of the value in that list:

>>> pl.Series([[1, 3], [2, 4], [5, 1]).list.index_of_or_maybe_some_other_name(1)
shape: (3,)
Series: '' [i64]
[
        0
        null
        1
]

This is what @thobai requested when they filed #5503.

Option 1B: Like option 1, but supports a expression-oriented version

df.select(pl.col("list_of_floats").index_of_or_maybe_some_other_name(pl.col("floats"))

and search each list in the list_of_floats column for the corresponding value taken from the floats column. I think this is just a superset of option 1, and is what a G-Research user would like.

Option 2: Find the index of the first list that contains the value:

>>> pl.Series([[1, 3], [2, 4], [5, 1]).list.index_of_or_maybe_some_other_name(1)
0

Option 2 can be implemented on top of option 1, albeit inefficiently, so providing it as an exposed feature is an optimization, not a necessity. Perhaps this API if ever implemented should be called .list.first_index_of().

Initial questions for maintainers

Which option should be implemented? I'm inclined towards 1B since that is the most general purpose, and meets requests of multiple users.
What should the API be called? I'm inclined towards calling option 1/1B .list.index_of(), and option 2 .list.first_index_of().
For implementation of option 1/1B, amortized_iter() seems like the easy approach?

The text was updated successfully, but these errors were encountered:

coastalwhite · 2025-01-08T13:55:59Z

I, personally, think both Option 1A/1B and Option 2 have their place, maybe under different names.

Some names that might work.

index_of_in: Option 1A/1B
index_of_with: Option 2

As for implementation: Option 2 can be implemented with Series.index_of on the underlying values (make sure to propagate the nulls) and then a binary search on the offsets.

orlp · 2025-01-08T20:56:48Z

Option 1:

>>> df = pl.DataFrame({"x": [[1, 3], [2, 4], [5, 1]], "y": [3, 4, 5]})
>>> df.select(pl.col.x.list.eval(pl.element().index_of(1)).list.first())
shape: (3, 1)
┌──────┐
│ x    │
│ ---  │
│ u32  │
╞══════╡
│ 0    │
│ null │
│ 1    │
└──────┘

Option 1B:

>>> df.with_row_index().select(pl.col.x.explode().index_of(pl.col.y.first()).over("index"))
shape: (3, 1)
┌─────┐
│ x   │
│ --- │
│ u32 │
╞═════╡
│ 1   │
│ 1   │
│ 0   │
└─────┘

Option 2:

>>> df.with_row_index().filter(pl.col.x.list.contains(1)).select(pl.col.index.first())
shape: (1, 1)
┌───────┐
│ index │
│ ---   │
│ u32   │
╞═══════╡
│ 0     │
└───────┘

itamarst · 2025-01-09T16:39:23Z

@orlp are you suggesting those as an implementation strategy, or implying that it's not worth doing since it's already possible?

orlp · 2025-01-10T13:23:40Z

@itamarst Neither (perhaps a hint of the latter), just showing how you could do it in today's Polars.

mcrumiller · 2025-02-11T16:05:03Z

I'm a bit late here, why isn't 1/1A just called list.index_of? index_of_in sounds pretty odd to me.

For option 2, what about list.find? 'find' is used in many languages to determine the positional index of an item. It often is accompanied by a parameter k to locate the first k instances of the item.

itamarst added the enhancement New feature or an improvement of an existing feature label Jan 8, 2025

itamarst mentioned this issue Jan 15, 2025

.list.index_of_in() architectural review PR #20733

Closed

coastalwhite mentioned this issue Feb 7, 2025

Index of element in list type #21138

Closed

itamarst linked a pull request Feb 11, 2025 that will close this issue

feat: Add list.index_of_in() to Expr and Series #21192

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search for an item in a list (or array), part 2 #20626

Search for an item in a list (or array), part 2 #20626

itamarst commented Jan 8, 2025 •

edited

Loading

coastalwhite commented Jan 8, 2025

orlp commented Jan 8, 2025

itamarst commented Jan 9, 2025

orlp commented Jan 10, 2025

mcrumiller commented Feb 11, 2025 •

edited

Loading

Search for an item in a list (or array), part 2 #20626

Search for an item in a list (or array), part 2 #20626

Comments

itamarst commented Jan 8, 2025 • edited Loading

Description

Potential APIs

Option 1: For each list in the Series/Column, find the index of the value in that list:

Option 1B: Like option 1, but supports a expression-oriented version

Option 2: Find the index of the first list that contains the value:

Initial questions for maintainers

coastalwhite commented Jan 8, 2025

orlp commented Jan 8, 2025

itamarst commented Jan 9, 2025

orlp commented Jan 10, 2025

mcrumiller commented Feb 11, 2025 • edited Loading

itamarst commented Jan 8, 2025 •

edited

Loading

mcrumiller commented Feb 11, 2025 •

edited

Loading