Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api): add selectors for easier selection of columns #5307

Merged
merged 2 commits into from
Jan 24, 2023

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Jan 24, 2023

Convenient columnn selectors, shamelessly cribbed from https://dplyr.tidyverse.org/reference/select.html.

@cpcloud cpcloud added feature Features or general enhancements expressions Issues or PRs related to the expression API ux User experience related issues labels Jan 24, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jan 24, 2023

Test Results

       43 files         43 suites   1h 48m 23s ⏱️
13 792 tests 10 309 ✔️   3 483 💤 0
47 711 runs  35 164 ✔️ 12 547 💤 0

Results for commit 23c977f.

♻️ This comment has been updated with latest results.

@cpcloud cpcloud force-pushed the selectors branch 3 times, most recently from 1e4e3f0 to 2bbd2e8 Compare January 24, 2023 11:15
@kszucs
Copy link
Member

kszucs commented Jan 24, 2023

Nice! Are you still working on this?

@cpcloud cpcloud force-pushed the selectors branch 3 times, most recently from 2258d69 to a9c61cb Compare January 24, 2023 11:43
@cpcloud cpcloud marked this pull request as ready for review January 24, 2023 11:44
@cpcloud
Copy link
Member Author

cpcloud commented Jan 24, 2023

@kszucs Just finished up the first iteration of the docs. It's ready for review now.

@cpcloud cpcloud added this to the 4.1 milestone Jan 24, 2023
@cpcloud cpcloud added the docs Documentation related issues or PRs label Jan 24, 2023
@jreback
Copy link
Contributor

jreback commented Jan 24, 2023

pretty neat @cpcloud

Copy link
Member

@kszucs kszucs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice API addition!

@kszucs kszucs merged commit 133cf6f into ibis-project:master Jan 24, 2023
@cpcloud cpcloud deleted the selectors branch January 24, 2023 13:22
@krzysztof-kwitt
Copy link
Contributor

I wonder if this API will be easy to use in real world., so I'm looking at the Ibis code that used column groups. Is it possible to rewrite this piece of code to use this new API?

supported_backend_count = sum(
getattr(table_expr, backend_name).ifelse(1, 0)
for backend_name in current_backend_names
)

@cpcloud
Copy link
Member Author

cpcloud commented Jan 24, 2023

I wonder if this API will be easy to use in real world., so I'm looking at the Ibis code that used column groups. Is it possible to rewrite this piece of code to use this new API?

supported_backend_count = sum(
getattr(table_expr, backend_name).ifelse(1, 0)
for backend_name in current_backend_names
)

I don't think so, because that bit of code isn't filtering the columns by name or property in any way, it's looking at every column in the table.

@cpcloud
Copy link
Member Author

cpcloud commented Jan 24, 2023

You could do something like:

sum(map(t.__getitem__, t.select(s.of_type("bool")).columns))

@cpcloud
Copy link
Member Author

cpcloud commented Jan 24, 2023

I wonder if it makes sense to define TableExpr.__iter__ to iterate over the column objects, which which make the computation much shorter:

sum(t.select(s.of_type("bool")))

@cpcloud
Copy link
Member Author

cpcloud commented Jan 24, 2023

I know some systems are starting to allow concise syntax for doing whole-table operations. Maybe we can revisit adding aggregations and some kind of apply expression to every column method, so you can do things like:

t.select(s.of_type("bool")).cast("int").sum()  # single row table, cast applies to every column

@cpcloud
Copy link
Member Author

cpcloud commented Jan 25, 2023

@mik-laj You might find https://dplyr.tidyverse.org/articles/colwise.html interesting. Those examples are perhaps a good place to start experimenting with APIs for summarizing things across a subset of the columns in a table.

I can imagine something like:

# group by a, b and count all string columns excluding a and b
t.group_by([_.a, _.b]).summarize(s.across(s.of_type("string") & !s.contains(("a", "b")), [_.count()]))

# desugared into this
# selector = s.across(s.of_type("string") & !s.contains(("a", "b")), [_.count()])
t.group_by([_.a, _.b]).agg(**{f"{col}_count": col.count() for col in selector.expand(t)})

@jcmkk3
Copy link

jcmkk3 commented Jan 25, 2023

@mik-laj You might find https://dplyr.tidyverse.org/articles/colwise.html interesting. Those examples are perhaps a good place to start experimenting with APIs for summarizing things across a subset of the columns in a table.

It might also be worth taking a look at the polars API for this. Basically, the pl.col selector allows selecting multiple columns in a few different ways, then when you perform methods on that selection it will apply to all columns that were selected. dplyr has to approach the problem a bit differently since the data transformations are functions instead of methods.

# group by a, b and count all string columns excluding a and b
t.groupby(["a", "b"]).agg(pl.col(pl.Utf8).exclude(["a", "b"]).count())

@cpcloud
Copy link
Member Author

cpcloud commented Jan 25, 2023

Neat, didn't know pl.col did all of that!

@cpcloud
Copy link
Member Author

cpcloud commented Jan 25, 2023

Can you comment more on why the method/function distinction matters for anything other than than the API design (as opposed to being an implementation detail)?

I would expect that the fundamentals of the problem are unrelated to how a thing is called, e.g., a function vs a method on an object.

@jcmkk3
Copy link

jcmkk3 commented Jan 25, 2023

Can you comment more on why the method/function distinction matters for anything other than than the API design (as opposed to being an implementation detail)?

I think that you're generally right in that it is a design constraint and also differences in style between the languages. R can do object oriented programming, but it is not the norm (especially in the tidyverse ecosystem).

The dplyr package doesn't own the compute functions like Ibis does so it makes a lot of sense for them to implement the across helper as a higher order function.

I personally think that something like the polars approach is more natural in python (minus the Utf8 type that should probably be Str). It could still be a function like across that originally selects multiple columns, but the transformations/computations would be applied through method chaining just like working with a single column.

@ksuarez1423
Copy link
Contributor

While across seems to fulfill the need quite nicely, I agree that the Polars approach feels more natural in Python. When I went to try the selector with an aggregate, my instinct was to call t.select(s.numeric()).min(), because of its similarity to t["col"].min(). From my perspective, pl.col() matches that experience well enough for it to feel natural.

Table-wide operations also make sense from a user perspective, because it means if I have a fully numeric table, I could do t.sum() or even t.all().sum() if that would make code easier. Then, that makes t.select(s.numeric()).sum() a very easy counterpart -- I'd be able to fairly easily think, "oh, t.select() gets me a subset that's a table, so all the table-wide things work!"

@cpcloud
Copy link
Member Author

cpcloud commented Jan 25, 2023

pl.col has a bit too much going on for my taste but I do agree that method chaining to build up the selectors is the way to go.

@jcmkk3
Copy link

jcmkk3 commented Jan 26, 2023

pl.col has a bit too much going on for my taste but I do agree that method chaining to build up the selectors is the way to go.

I agree with this as well. I think that starting with a specific function for performing an operation on multiple columns would stand out more when scanning the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation related issues or PRs expressions Issues or PRs related to the expression API feature Features or general enhancements ux User experience related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants