feat(api): add selectors for easier selection of columns #5307

cpcloud · 2023-01-24T00:57:23Z

Convenient columnn selectors, shamelessly cribbed from https://dplyr.tidyverse.org/reference/select.html.

github-actions · 2023-01-24T01:06:40Z

Test Results

      43 files       43 suites 1h 48m 23s ⏱️
13 792 tests 10 309 ✔️   3 483 💤 0 ❌
47 711 runs 35 164 ✔️ 12 547 💤 0 ❌

Results for commit 23c977f.

♻️ This comment has been updated with latest results.

kszucs · 2023-01-24T11:32:28Z

Nice! Are you still working on this?

cpcloud · 2023-01-24T11:45:04Z

@kszucs Just finished up the first iteration of the docs. It's ready for review now.

jreback · 2023-01-24T11:53:00Z

pretty neat @cpcloud

ibis/expr/selectors.py

kszucs

Nice API addition!

krzysztof-kwitt · 2023-01-24T17:54:06Z

I wonder if this API will be easy to use in real world., so I'm looking at the Ibis code that used column groups. Is it possible to rewrite this piece of code to use this new API?

ibis/docs/backends/app/backend_info_app.py

Lines 155 to 158 in 318179a

    
           supported_backend_count = sum( 
        
               getattr(table_expr, backend_name).ifelse(1, 0) 
        
               for backend_name in current_backend_names 
        
           )

cpcloud · 2023-01-24T18:14:17Z

I wonder if this API will be easy to use in real world., so I'm looking at the Ibis code that used column groups. Is it possible to rewrite this piece of code to use this new API?

ibis/docs/backends/app/backend_info_app.py

Lines 155 to 158 in 318179a

supported_backend_count = sum(

getattr(table_expr, backend_name).ifelse(1, 0)

for backend_name in current_backend_names

)

I don't think so, because that bit of code isn't filtering the columns by name or property in any way, it's looking at every column in the table.

cpcloud · 2023-01-24T19:03:58Z

You could do something like:

sum(map(t.__getitem__, t.select(s.of_type("bool")).columns))

cpcloud · 2023-01-24T19:05:04Z

I wonder if it makes sense to define TableExpr.__iter__ to iterate over the column objects, which which make the computation much shorter:

sum(t.select(s.of_type("bool")))

cpcloud · 2023-01-24T19:08:19Z

I know some systems are starting to allow concise syntax for doing whole-table operations. Maybe we can revisit adding aggregations and some kind of apply expression to every column method, so you can do things like:

t.select(s.of_type("bool")).cast("int").sum()  # single row table, cast applies to every column

cpcloud · 2023-01-25T13:55:45Z

@mik-laj You might find https://dplyr.tidyverse.org/articles/colwise.html interesting. Those examples are perhaps a good place to start experimenting with APIs for summarizing things across a subset of the columns in a table.

I can imagine something like:

# group by a, b and count all string columns excluding a and b
t.group_by([_.a, _.b]).summarize(s.across(s.of_type("string") & !s.contains(("a", "b")), [_.count()]))

# desugared into this
# selector = s.across(s.of_type("string") & !s.contains(("a", "b")), [_.count()])
t.group_by([_.a, _.b]).agg(**{f"{col}_count": col.count() for col in selector.expand(t)})

jcmkk3 · 2023-01-25T15:48:53Z

@mik-laj You might find https://dplyr.tidyverse.org/articles/colwise.html interesting. Those examples are perhaps a good place to start experimenting with APIs for summarizing things across a subset of the columns in a table.

It might also be worth taking a look at the polars API for this. Basically, the pl.col selector allows selecting multiple columns in a few different ways, then when you perform methods on that selection it will apply to all columns that were selected. dplyr has to approach the problem a bit differently since the data transformations are functions instead of methods.

# group by a, b and count all string columns excluding a and b
t.groupby(["a", "b"]).agg(pl.col(pl.Utf8).exclude(["a", "b"]).count())

cpcloud · 2023-01-25T15:51:06Z

Neat, didn't know pl.col did all of that!

cpcloud · 2023-01-25T15:55:49Z

Can you comment more on why the method/function distinction matters for anything other than than the API design (as opposed to being an implementation detail)?

I would expect that the fundamentals of the problem are unrelated to how a thing is called, e.g., a function vs a method on an object.

jcmkk3 · 2023-01-25T17:02:52Z

Can you comment more on why the method/function distinction matters for anything other than than the API design (as opposed to being an implementation detail)?

I think that you're generally right in that it is a design constraint and also differences in style between the languages. R can do object oriented programming, but it is not the norm (especially in the tidyverse ecosystem).

The dplyr package doesn't own the compute functions like Ibis does so it makes a lot of sense for them to implement the across helper as a higher order function.

I personally think that something like the polars approach is more natural in python (minus the Utf8 type that should probably be Str). It could still be a function like across that originally selects multiple columns, but the transformations/computations would be applied through method chaining just like working with a single column.

ksuarez1423 · 2023-01-25T21:44:50Z

While across seems to fulfill the need quite nicely, I agree that the Polars approach feels more natural in Python. When I went to try the selector with an aggregate, my instinct was to call t.select(s.numeric()).min(), because of its similarity to t["col"].min(). From my perspective, pl.col() matches that experience well enough for it to feel natural.

Table-wide operations also make sense from a user perspective, because it means if I have a fully numeric table, I could do t.sum() or even t.all().sum() if that would make code easier. Then, that makes t.select(s.numeric()).sum() a very easy counterpart -- I'd be able to fairly easily think, "oh, t.select() gets me a subset that's a table, so all the table-wide things work!"

cpcloud · 2023-01-25T23:04:41Z

pl.col has a bit too much going on for my taste but I do agree that method chaining to build up the selectors is the way to go.

jcmkk3 · 2023-01-26T01:31:20Z

pl.col has a bit too much going on for my taste but I do agree that method chaining to build up the selectors is the way to go.

I agree with this as well. I think that starting with a specific function for performing an operation on multiple columns would stand out more when scanning the code.

cpcloud added feature Features or general enhancements expressions Issues or PRs related to the expression API ux User experience related issues labels Jan 24, 2023

cpcloud force-pushed the selectors branch 3 times, most recently from 1e4e3f0 to 2bbd2e8 Compare January 24, 2023 11:15

cpcloud force-pushed the selectors branch 3 times, most recently from 2258d69 to a9c61cb Compare January 24, 2023 11:43

cpcloud marked this pull request as ready for review January 24, 2023 11:44

cpcloud added this to the 4.1 milestone Jan 24, 2023

cpcloud added the docs Documentation related issues or PRs label Jan 24, 2023

feat(api): add selectors for easier selection of columns

beffd6b

cpcloud force-pushed the selectors branch from a9c61cb to 5fe79f3 Compare January 24, 2023 11:55

chore(selectors): add docstrings and examples

23c977f

cpcloud force-pushed the selectors branch from 5fe79f3 to 23c977f Compare January 24, 2023 11:56

kszucs reviewed Jan 24, 2023

View reviewed changes

ibis/expr/selectors.py Show resolved Hide resolved

kszucs approved these changes Jan 24, 2023

View reviewed changes

kszucs merged commit 133cf6f into ibis-project:master Jan 24, 2023

cpcloud deleted the selectors branch January 24, 2023 13:22

eitsupi mentioned this pull request Apr 4, 2023

tidyselect like column selection and update PRQL/prql#2386

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): add selectors for easier selection of columns #5307

feat(api): add selectors for easier selection of columns #5307

cpcloud commented Jan 24, 2023 •

edited

Loading

github-actions bot commented Jan 24, 2023 •

edited

Loading

kszucs commented Jan 24, 2023

cpcloud commented Jan 24, 2023

jreback commented Jan 24, 2023

kszucs left a comment

krzysztof-kwitt commented Jan 24, 2023

cpcloud commented Jan 24, 2023

cpcloud commented Jan 24, 2023

cpcloud commented Jan 24, 2023

cpcloud commented Jan 24, 2023

cpcloud commented Jan 25, 2023 •

edited

Loading

jcmkk3 commented Jan 25, 2023

cpcloud commented Jan 25, 2023

cpcloud commented Jan 25, 2023

jcmkk3 commented Jan 25, 2023

ksuarez1423 commented Jan 25, 2023

cpcloud commented Jan 25, 2023

jcmkk3 commented Jan 26, 2023

feat(api): add selectors for easier selection of columns #5307

feat(api): add selectors for easier selection of columns #5307

Conversation

cpcloud commented Jan 24, 2023 • edited Loading

github-actions bot commented Jan 24, 2023 • edited Loading

Test Results

kszucs commented Jan 24, 2023

cpcloud commented Jan 24, 2023

jreback commented Jan 24, 2023

kszucs left a comment

Choose a reason for hiding this comment

krzysztof-kwitt commented Jan 24, 2023

cpcloud commented Jan 24, 2023

cpcloud commented Jan 24, 2023

cpcloud commented Jan 24, 2023

cpcloud commented Jan 24, 2023

cpcloud commented Jan 25, 2023 • edited Loading

jcmkk3 commented Jan 25, 2023

cpcloud commented Jan 25, 2023

cpcloud commented Jan 25, 2023

jcmkk3 commented Jan 25, 2023

ksuarez1423 commented Jan 25, 2023

cpcloud commented Jan 25, 2023

jcmkk3 commented Jan 26, 2023

cpcloud commented Jan 24, 2023 •

edited

Loading

github-actions bot commented Jan 24, 2023 •

edited

Loading

cpcloud commented Jan 25, 2023 •

edited

Loading