-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tidyselect
like column selection and update
#2386
Comments
Excellent write up @eitsupi !
(Edit: I don't think these are that different on reflection) I think the requirements are similar — either PRQL needs to know the column names itself, or the DB needs to support the abstraction. Without a DB abstraction, the SQL for Names is fairly concise, the SQL for Values is quite verbose, since it requires listing a set of conditions.~ Overall, I'm open to these. My take is that they're more complicated than the standard PRQL because of those constraints. When we have DB cohesion, they'll be more powerful, since providing the input columns won't be necessary. My focus for the project would still be to make the basic language more robust (we're still finding occasional bugs, many of which remain!), but there's no reason we can't do these in parallel. What are others' thoughts? |
I quite like the way DuckDB does this with Not having thought about it too much yet, to me it seems that this veers into the territory of macros. I was thinking that we will probably need those anyway as I was thinking about them in the context of functions generalising over idents (maybe that can be done differently but that's how I've been thinking about it so far).
Are these really different? My understanding from what I've read about
The part of this that seems a bit magical to me is how does it know that the expansion stops at the I would propose implementing this with a macro named select [min columns! price_*] Is that clear enough or does it need parentheses? select [min (columns! price_*)] What if you have multiple functions? select [columns! price_* | abs | min, columns! price_* | abs | max] Could this be written better? Looking at the second case filter (columns! price_*) != null I'm not sure about any of this. Just exploring some ideas really. Thoughts? |
Agree with this. Without knowledge of the schema, PRQL could only translate to the DuckDB syntax (or equivalent) where that's available. Once we are able to operate on a cached schema definition then we could do the column name expansions at compilation time. |
No, , on reflection, not really! I've updated my comment. Thanks. What's the difference between a macro and a function? A function operates on a column, and a macro generates column names? I would vote quite strongly against introducing a totally new language concept like macros given the current state of the language & project — adding this will add to the maintenance and complexity burden. (I'm a big +1 on things like the module system & type system, I'm not saying "no new features" — but we should be balancing the cost of new features with our confidence that they're going to have make the language more useful) |
In Rust, difference is that macros operate on syntactic level, before any type checking or anything semantic. But this is already very close to our functions. We may be able to get this working without new syntax:
This does require more parenthesis and knowledge of functional programming, but it is not magic. For beginners, we could have a snippet to copy-paste / library / package that would contain useful stuff like:
|
This comment was marked as outdated.
This comment was marked as outdated.
I'd guess that Am I missing something, why is this relevant? |
My apologies, my misunderstanding! |
Column selection with tidyselect is a very powerful feature of dplyr.
https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html
Similar features have recently been introduced in ibis.
ibis-project/ibis#5307
There is also a dbt package that mimics this functionality.
https://github.com/emilyriederer/dbtplyr
I believe that many functions of tidyselect can be realized only when the schema is obvious, but I know that DuckDB has
COLUMNS()
and some functions can be realized (duckdb/duckdb#6621).So I am wondering if it is possible to partially implement this in PRQL like
select ![foo]
->SELECT * EXCLUDE foo
.The following examples use dplyr and duckdb on R.
DuckDB is the unreleased edge version (0.8.0). (Installed from R-universe)
Filter
The most common use would be to remove rows that contain null in either column.
The tidyverse also has a dedicated function called
tidyr::drop_na
, but as a result of the introduction ofdplyr::across
in dplyr 1.0, it can be executed by dplyr alone.(
if_all
andif_any
are functions derived fromacross
)Using dbplyr, we can see that the dplyr query is converted to SQL like the following.
The next version of DuckDB can execute the same operation by using
COLUMNS()
.It seems equivalent to dplyr's
filter(if_all())
.Select
Columns can be selected using regular expressions.
In DuckDB, this could be done as follows.
Update columns
We can update the selected columns.
Unfortunately, it is difficult to manipulate column names in DuckDB with
COLUMNS(*)
.Select columns using lambda function
Note that tidyselect can also be used to select columns using lambda functions, but this is not supported by dbplyr.
c.f. https://news.ycombinator.com/item?id=30067462
The text was updated successfully, but these errors were encountered: