-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(api): add selectors for easier selection of columns #5307
Conversation
1e4e3f0
to
2bbd2e8
Compare
Nice! Are you still working on this? |
2258d69
to
a9c61cb
Compare
@kszucs Just finished up the first iteration of the docs. It's ready for review now. |
pretty neat @cpcloud |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice API addition!
I wonder if this API will be easy to use in real world., so I'm looking at the Ibis code that used column groups. Is it possible to rewrite this piece of code to use this new API? ibis/docs/backends/app/backend_info_app.py Lines 155 to 158 in 318179a
|
I don't think so, because that bit of code isn't filtering the columns by name or property in any way, it's looking at every column in the table. |
You could do something like: sum(map(t.__getitem__, t.select(s.of_type("bool")).columns)) |
I wonder if it makes sense to define sum(t.select(s.of_type("bool"))) |
I know some systems are starting to allow concise syntax for doing whole-table operations. Maybe we can revisit adding aggregations and some kind of t.select(s.of_type("bool")).cast("int").sum() # single row table, cast applies to every column |
@mik-laj You might find https://dplyr.tidyverse.org/articles/colwise.html interesting. Those examples are perhaps a good place to start experimenting with APIs for summarizing things across a subset of the columns in a table. I can imagine something like: # group by a, b and count all string columns excluding a and b
t.group_by([_.a, _.b]).summarize(s.across(s.of_type("string") & !s.contains(("a", "b")), [_.count()]))
# desugared into this
# selector = s.across(s.of_type("string") & !s.contains(("a", "b")), [_.count()])
t.group_by([_.a, _.b]).agg(**{f"{col}_count": col.count() for col in selector.expand(t)}) |
It might also be worth taking a look at the polars API for this. Basically, the # group by a, b and count all string columns excluding a and b
t.groupby(["a", "b"]).agg(pl.col(pl.Utf8).exclude(["a", "b"]).count()) |
Neat, didn't know |
Can you comment more on why the method/function distinction matters for anything other than than the API design (as opposed to being an implementation detail)? I would expect that the fundamentals of the problem are unrelated to how a thing is called, e.g., a function vs a method on an object. |
I think that you're generally right in that it is a design constraint and also differences in style between the languages. R can do object oriented programming, but it is not the norm (especially in the tidyverse ecosystem). The dplyr package doesn't own the compute functions like Ibis does so it makes a lot of sense for them to implement the I personally think that something like the polars approach is more natural in python (minus the |
While Table-wide operations also make sense from a user perspective, because it means if I have a fully numeric table, I could do |
|
I agree with this as well. I think that starting with a specific function for performing an operation on multiple columns would stand out more when scanning the code. |
Convenient columnn selectors, shamelessly cribbed from https://dplyr.tidyverse.org/reference/select.html.