-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improvement on programmatically substituting columns in expressions #2655
Comments
of course |
I like it. Here's a similar idea from #633 by Synergist:
|
Yes -- absolutely. The original news item in 1.10.2 (Jan 2017)
I've bolded the relevant sentence. I wasn't sure about it at the time and I hoped for feedback, before continuing. Seems like it's a go, then. IIRC, without looking at the code yet, it's not that hard as there's already a substitution of all appearances of My recent tweet here seems to have yielded a positive response, too. |
Nice! I like it too. It'd also, when implemented, take care of issues like: dt[x > x] which could be then done as: dt[x > ..x] |
@MichaelChirico I assumed that's the case based on #2620 |
At one stage I think there was a suggestion to use a single |
Looking at #697 (comment), I guess the exception for a single symbol in |
Hm, yes. The difference there is that a single symbol in |
Looking forward to this one! Seems like this has potential to replace a boatload of One pipe dream of mine might look like the following, where both expressions would give the same output. (As I started to type this out I did start to question if this is even realistic) set.seed(1234)
DT <- data.table(foo = rep(LETTERS[1:2],8),
bar = rep(letters[17:20],4),
month = rep(month.name[1:4],4),
day = seq_len(16),
yyy = rnorm(16,mean = 0.5,1.5))
## Expression 1: with hard coded columns
DT[yyy > 0,.(NewCol = paste(month, day),
bar,
yyy), by = foo]
## Expression 2: everything passed by reference
A = "yyy"
B = 0
C = "NewCol"
D = "month"
E = "day"
G = "foo"
H = c("bar","yyy")
DT[..A > ..B , .(..C = paste(..D, ..E),
..H), by = ..G]
However, even as I write this I'm starting to see some potential hang-ups, in particular with how the In the case below, it does seem like there is potential ambiguity in whether A = "foo" ## intended column name
B = "xyz" ## intended literal character string for comparison
DT[..A == ..B] Whatever gets implemented, I'm looking forward to it! Some relevant stack overflow questions (with lots of responses from contributors here):
|
Ok, now in master. Just for symbols used in Looking at @msummersgill's comment more closely just now, there seems to be two different ideas :
I had been thinking only of case 1 so far, and that's what @arunsrinivasan raised this issue about and @franknarf1 and @MichaelChirico agreed with further examples on that. That's what the now-merged PR does. But @msummersgill is suggesting something completely different (case 2), I hadn't considered before. That's a very neat idea and I like it. Do we need a different prefix for case 2 then and do that too? If I've understood correctly so far, which case does Since And now I see, @msummersgill already detailed these two cases at the end of his comment, but I didn't digest that before. |
Looks like we can't have So, taking Matt S's example, how about : DT[ " @A > @B , .(@C = paste(@D, @E), @H), by = @G " ] The rule would be (a little like the first argument of fread) that if More examples : col = "x"
DT[, ..col] # select column "x" (even if "cols" is a column name too)
DT[, c(..col, "y")] # select columns "x" and "y"
thresh = 6.5
DT[ x > ..thresh ] # select rows where "x" column > 6.5
DT[ "@col > y" ] # same as DT [ x > y ] i.e. comparison of two columns
anyString = "..thresh"
DT[ "@col > @anyString" ] # same as DT[ x > ..thresh ] It's now actually an advantage that |
I don't like idea of one string with replacements of |
I agree with Jan re preferring expressions over strings. One other idea: add a function like
I don't know anything about feasibility, but something like this would be easy to understand and work with, I imagine. Related to #1712 (comment) and #633 Just to clarify, here are translations of Matt's examples:
|
The idea of having a single string in i with While those drawbacks certainly might be worth the trade-off, one more objective speed bump I see at the moment is that the use of a single symbol, It seems like one of the underlying challenges here is that column names in a data.table can be used in the external environment with a completely different definition. The @mattdowle I've tried to address your questions below by adding some usage examples illustrating each of the possible cases outlined below. By "External" I mean that the intended target of the evaluated name is explicitly not a column within the data.table, let me know if there's anything else I can try to better clarify! If you decide to continue down the path of Thinking about manipulations on a single table only (joins introduce a whole new set of potential overlaps in names, not sure if that should be considered in this scope), here's a list of some of the potential ways I (and perhaps other users) might like to be able to define different potential targets explicitly to avoid ambiguity. To me it would make intuitive sense to extend the usage of
Examples:# Case --------------------------------
# 1 2
DT[foo == "A"] # Cases 1 & 2: Literal column name, and a literal value
col = "foo" # Case 3: Column Name defined externally
val = "A" # Case 4: Value defined externally
# Case --------------------------------
# 3 4
DT[..col == ..ext.val] # DT[foo == "A"]
cols = c("foo","bar") # Case 5: List of column names defined externally
# Case --------------------------------
# 5
DT[, ..cols] # DT[, .(foo,bar)]
newcol = "nextday" # Case 6:Name of new column created in DT, defined externally
# Case --------------------------------
# 6 1 2
DT[, ..ext.newcol := day + 1] # DT[, nextday := day + 1]
newcols = c("nextday", "nextday2") # Case 7: List of names for multiple new columns created in DT, defined externally
# Case --------------------------------
# 7 1 2 1 2
DT[, ..ext.newcols := .(day + 1, day + 2)] # DT[, c("nextday", "nextday2") := .(day + 1, day + 2)]
## Putting it all together (in a totally nonsensical way for illustration only)
col = "foo"
col2 = "bar"
col3 = "day"
newgroup = "baz"
val = "A"
val2 = 1
newcols = c("nextday", "fooday")
printcols = c("baz","nextday","fooday")
# Case -------------------------------------------------------------------------------------------------------------
# 3 4 7 3 4 3 3 6 3 5
DT[..col == ..ext.val, ..ext.newcols := .(..col3 + ..ext.val2, paste0(..col, ..col3)), by = .(..ext.newgroup = ..col2)][, ..printcols]
#DT[foo == "A", c("nextday", "fooday") := .(day + 1, paste0(foo, day)), by = .(baz = bar)][,.(baz, nextday, fooday)]
From what I understand, is the prefix symbol limited to what will parse as valid object name in base R? If so, some options to cover cases 4, 6, and 7 might be As alluded to by others, it does seem that potentially a single prefix for all of cases 3-7 could work if the column names (cases 3 and 6) were quoted when defined in the external environment, tidyeval style? Not to familiar with the framework, will do some more reading to see if some of the same design decisions would be potentially useful here. Update: did some reading on The idea of having some kind of a |
Let's take the comments from the top, in order :
Can you please show an example how easy it is to use
In fact they won't be able to pass a string at all, because that would need a variable. A variable in
Can you provide one example of fragility please, so I can understand.
I have done programmatic things like this in the past when I have been a user of data.table. But honestly, I never really liked it. The
Because I've just extended
Let's revisit after Jan has responded to points above.
The trouble is, how to deal with it internally. Let's say I like the token replacement because it's an explicit syntax that allows you to say ... ok, this query is programmatic, but up-front programmatically defined once. After that, it runs exactly like a regular query. That's clean to reason about in my view. There's a clear separation of logic: i) the first step that defines the query to run, then ii) that query runs. If we have g() and eval() and bquote() and things like that, then they run throughout the internals and make it hard. I remember writing once why I liked and recommended EVAL = parse(text=paste(...)) approach because it resulted in a query that could be optimized cleanly just as if the regular simple query was typed in the first place.
True. After a few sleeps, I'm still liking it. 'Normal' in this case is using bquote and eval, etc, so the comparison should be to that since 'normal' data.table syntax doesn't have this ability.
Those things aren't going to work anyway when writing the meta-query (because the column name used will be flexibly defined.) So maybe it's good that this is with a string literal and therefore colored differently in the editor so it stands out. Even looking at the colors here in GitHub above in my comment showing the examples, it looks quite good to my eyes already. The meta-query stands out in blue with quotes around, while the true variables in calling scope are highlighted in orange. It looks similar in my editor locally too. I don't follow much of the rest of your comment. I need to see actual code examples. I can't translate that table to code because I'm not 100% clear what you mean by 'external', for example.
Yes. That idea is transitional, after several years if it goes well, strict=TRUE would be the case always and not optional. It is not just for advanced users, though. In fact, the opposite. It's to save accidents by non-advanced users. Beginners will need to use |
Not bquote actually but language solution, below.
There is no fragility in this solution but querying DT by character expression is well handled by eval-parse. Language and string solutions, none of them really user friendly, but powerful, base R, stable, well defined and tested API. library(data.table)
set.seed(1234)
DT <- data.table(foo = rep(LETTERS[1:2],8),
bar = rep(letters[17:20],4),
month = rep(month.name[1:4],4),
day = seq_len(16),
yyy = rnorm(16,mean = 0.5,1.5))
## Expression 1: with hard coded columns
DT[yyy > 0,.(NewCol = paste(month, day),
bar,
yyy), by = foo] -> ans
## Expression 2: everything passed by reference
A = "yyy"
B = 0
C = "NewCol"
D = "month"
E = "day"
G = "foo"
H = c("bar","yyy")
#DT[..A > ..B , .(..C = paste(..D, ..E),
# ..H), by = ..G]
# foo NewCol bar yyy
"# ---- language ----"
qc = as.call(list(
as.name("["),
x = quote(DT),
i = call(">", as.name(A), B),
j = as.call(c(list(as.name(".")), setNames(list(call("paste", as.name(D), as.name(E))), C), lapply(H, as.name))),
by = as.name(G)
))
eval(qc) -> ans2
"# ---- string ----"
ci = paste(A, B, sep=" > ")
cj = paste0(".(", C, " = paste(", D, ", ", E, "), ", paste(H,collapse=", "),")")
cby = paste0("list(", G, ")") # this could be simplified
DT[eval(parse(text=ci)), eval(parse(text=cj)), by = eval(parse(text=cby))] -> ans3
# another
cdt = paste0("DT[", paste(c(ci, cj, cby), collapse=","), "]")
eval(parse(text=cdt)) -> ans4
identical(ans, ans2) # T
identical(ans, ans3) # T
identical(ans, ans4) # T What actually seems more resonable to me is a helper function for string constraction by substituting query = helper("@A > @B, .(@C = paste(@D, @E), @H), @G")
DT[text = query]
# or alternative api
DT[~query] Which might be tricky because it is not obvious if
agree |
Thanks for the response; that makes sense. I was imagining One objection to the @ prefix is that it seems to be designed only to cover the single-variable use case. What if we want to pass a vector of columns that is not a subset of My other objection is what Jan mentioned: "Yet for production code I would still construct call rather than operate on string as it adds extra sanity check for syntax." I like R's syntax checking that comes with writing code as code, as opposed to writing code inside a string. And for production code, I wouldn't really trust any code I'd written into a string, I guess. Another variation on the same idea: extend
So it looks up |
Jan seems to be almost there, now agreeing about @ token substitution, albeit as a helper. (Those quote(), call() and as.call() are all hard work for me, let alone users.) The @ tokens would be checked to be variables in calling scope, type character, and length 1 (error if not). One extra "if" to the documentation is pretty good, I'd say. It would explain what a meta-query is, and it's simple. Much simpler than explaining the alternatives, for which there is established confusion and bug reports. The reason for not passing in a variable containing the meta-query is so that it is clear it is a meta-query right there. We don't want whether it is a meta-query or not to depend on a variable which is defined higher up in the code, or passed in. It should look like a meta-query. Frank - ok I see your thinking on g() now. But if g() is to be substituted, may as well just use token substitution. The reader of such code would need to know what g() was and what it did. I think we need a more dramatic switch in look-and-feel to more clearly indicate meta-query mode.
It's designed to cover anything. Completely flexible. Straight forward token substitution to build any data.table query. Verbose mode would print the query that has been constructed. If that query doesn't parse, R will still catch that. But at run -time. Which is what is required anyway. It is a dynamic query.
What's the full example please and lets explore that.
I would trust the token substitution method for production, more than a muddle of |
I am thinking of, for example, summarizing multiple groups of variables, programmatically selecting input columns and naming output columns:
So, "@" here is doing two different types of substitutions:
Is that what you had in mind? I ask because I only noticed single-column examples so far and wasn't sure if it would also cover this use-case. Re my version of "current syntax" above, I could write get/mget instead of subsetting .SD, but tend to try to avoid those. Also, I guess there is a way with current syntax to make sure GForce still kicks in in this example, but I guess it would be tedious. Good point re simplicity trumping other concerns; thanks for explaining. I'm convinced that I could/would use this. |
Great example. So, given :
This is a tough one! To do all of this is maybe stretching what I had in mind. But let's give it a go. First thought is to perhaps codify the requirement into one definition rather than lots of variable names. As itself a data.table, for fun. > def = data.table( fun = c("min", "mean", "max"),
prefix = c("min", "mean", "max"),
cols=list (c("disp","hp"), c("drat", "wt"), c("qsec") )
> def
fun prefix cols
<char> <char> <list>
1: min min disp,hp
2: mean mean drat,wt
3: max max qsec Now expand the definition. (There must be a better way of doing this step.) > expanded = def[, paste0(prefix,".",cols[[1]],"=",fun,"(",cols[[1]],")"), by=1:3]
> expanded
: V1
<int> <char>
1: 1 min.disp=min(disp)
2: 1 min.hp=min(hp)
3: 2 mean.drat=mean(drat)
4: 2 mean.wt=mean(wt)
5: 3 max.qsec=max(qsec)
> J = paste(expanded$V1,collapse=", ")
> J
[1] "min.disp=min(disp), min.hp=min(hp), mean.drat=mean(drat), mean.wt=mean(wt), max.qsec=max(qsec)" and then just use it : > DT[ " , .(@J), by=@by_em " ]
...
Meta-query expanded to: DT[, .(min.disp=min(disp), min.hp=min(hp), mean.drat=mean(drat), mean.wt=mean(wt), max.qsec=max(qsec)), by=am]
... Part of the difficulty here is that the requirement is not "these cols by all these funs using the fun as a prefix". Otherwise perhaps it would be easier. I realize it doesn't satisfy the As one of the people who wrote the internal optimizations, I'm confident that the expanded meta-query in this case will be efficient and optimized and I'm happy to work with doing better on queries like this. Whereas I am not at all happy or comfortable optimizing the version with lots of When we look later at this line in isolation, perhaps written by someone else : DT[ " , .(@J), by=@by_em " ] Yes we need to know it is a meta query. But as long as we know that, we can see that |
Definitely, alternatives I was showing are for experience R programmer, not a beginner.
Almost completely flexible. Completely flexible is R eval-parse or operating on language objects. Example query used above is not that obvious: DT["@A > @B, .(@C = paste(@D, @E), @H), @G"] Direct translation of that should map to: DT[i = yyy > 0, j = .(NewCol = paste(month, day), list(bar, yyy)), by = foo] and not to: DT[i = yyy > 0, j = .(NewCol = paste(month, day), bar, yyy), by = foo] Of course the second one is expected in this case but it requires
It is not a muddle, it is powerful and obviously complex R metaprogramming feature. Yes it is hard to write, unless you use R metaprogramming regularly. Not harder to debug than string, same way you log character string you log quoted call by I never had any issues because of using language objects. I use I fully agree on making new more user friendly way for metaprogramming, but I would keep it as minimal as possible and re-use base R eval-parse or language objects as much as possible. "Great example" using computing on the language R feature. library(data.table)
DT = as.data.table(mtcars)
build.j = function(min_em, min_em_prefix, mean_em, mean_em_prefix, max_it, max_it_name)
as.call(c(
list(as.name(".")),
lapply(setNames(min_em, paste(min_em_prefix, min_em, sep=".")), function(col) call("min", as.name(col))),
lapply(setNames(mean_em, paste(mean_em_prefix, mean_em, sep=".")), function(col) call("mean", as.name(col))),
setNames(list(call("max", as.name(max_it))), max_it_name)
))
qj = build.j(
min_em = c("disp", "hp"),
min_em_prefix = "min",
mean_em = c("drat", "wt"),
mean_em_prefix = "mean",
max_it = "qsec",
max_it_name = "mymax"
)
by_em = "am"
print(qj)
#.(min.disp = min(disp), min.hp = min(hp), mean.drat = mean(drat), mean.wt = mean(wt), mymax = max(qsec))
DT[, eval(qj), by_em] -> ans1
# as single call
qcall = as.call(list(as.name("["), quote(DT), substitute(), qj, as.name(by_em)))
print(qcall)
#DT[, .(min.disp = min(disp), min.hp = min(hp), mean.drat = mean(drat), mean.wt = mean(wt), mymax = max(qsec)), am]
eval(qcall) -> ans2 |
Another thing is security when doing library(data.table)
DT = as.data.table(mtcars)
col = "am"
# eval-parse way
eval(parse(text=paste0("DT[1L,.(",col,")]")))
# am
#1: 1
# language way
eval(as.call(list(as.name("["), quote(DT), quote(1L), call(".", as.name(col)))))
# am
#1: 1
# eval-parse abuse
writeLines("cat('harmful code here\n')", "somescript.R")
col = "{source(\"somescript.R\"); am}"
eval(parse(text=paste0("DT[1L,.(",col,")]")))
#harmful code here
# V1
#1: 1
# try language abuse
eval(as.call(list(as.name("["), quote(DT), quote(1L), call(".", as.name(col)))))
#Error in eval(jsub, SDenv, parent.frame()) :
# object '{source("somescript.R"); am}' not found This is not much a problem for data.table users, more for a developers who wants to utilize data.table inside their apps giving UI to third parties. |
I edited my previous post to include some examples and clarifications. Let me know if it's still unclear. On the topic of ["@..."]I think continued steps towards full non-standard evaluation within If you implemented this today, I would probably use it for some cases.
I do use data.table in every shiny application I write, but I'm definitely on the fence about this one and curious to hear what others have to say. Since many users just Testing current development branch
Vector vs data.table resultThis may be a reasonable design decision, but this isn't the behavior I expected when DT = data.table(x=1:3, y=2:4)
var = "x"
DT[, x] ## Returns a vector
DT[, ..var] ## Returns a one column data.table with column x Unexpected evaluation behaviorDT = data.table(x=1:3, y=2:4)
var = "x"
DT[, paste(..var)] # ..var evaluates as column x
# x
# 1: 1
# 2: 2
# 3: 3
DT[, paste(..var, x)] # ..var evaluates as string "x"
# [1] "x 1" "x 2" "x 3"
DT[, paste("blah",..var)] # ..var is concatenated by paste
# Error in `[.data.table`(DT, , paste("blah", ..var)) :
# column(s) not found: blah x
x = "foo"
var = "x"
dt[, paste("blah",get(..var))] # get(..var) "redirected" to global variable x
# Error in `[.data.table`(dt, , paste("blah", get(..var))) :
# column(s) not found: blah foo
dt[, .(paste("blah",..var))] # When wrapped in .() behavior is as expected
# V1
# 1: blah x
dt[, .(paste("blah",get(..var)))] # When wrapped in .() behavior with get() is also as expected
# V1
# 1: blah 1
# 2: blah 2
# 3: blah 3
Still need to use
|
@msummersgill One advantage of a separate symbol is that we can be sure that we're getting a column:
Similarly, regarding security, it seems like only letting @ select columns would be safe, so...
And developers shouldn't write functions like
Not trying to disagree here, but to clarify, the better way would be
@mattdowle So your objection is to using It's cheap talk since I've never contributed to the project, but I imagine that #852 would lower the barrier to potential contributors like me; I may be misunderstanding that issue or what you're saying here, though. |
I changed the title because whole discussion deviated heavily from simple columns selection ( Another alternative could be providing an interface similar to DT[, .(out_col = fun(in_col)), by=by_col,
env = .(out_col="mpg_max", fun="max", in_col="mpg", by_col="cyl")] All elements of Working example using base R. Except for LHS names, which is a tricky thing: x = list()
class(x) = "cl"
"[.cl" = function(x, i, j, env) {
if (!missing(env) && length(env)) {
stopifnot(is.list(env), sapply(env, is.character))
env = lapply(env, as.symbol)
isub = eval(substitute(substitute(i, env)))
jsub = eval(substitute(substitute(j, env)))
} else {
isub = substitute(i)
jsub = substitute(j)
}
list(isub=isub, jsub=jsub)
}
x[var1==5L & var2%in%c("a","b"),
.(fvar1=fun(var1, na.rm=TRUE), charhead=head(var2, 1L))]
#$isub
#var1 == 5L & var2 %in% c("a", "b")
#$jsub
#.(fvar1 = fun(var1, na.rm = TRUE), charhead = head(var2, 1L))
x[var1==5L & var2%in%c("a","b"),
.(fvar1=fun(var1, na.rm=TRUE), charhead=head(var2, 1L)),
env = list(var1="myIntCol", var2="myCharCol", fun="sum")]
#$isub
#myIntCol == 5L & myCharCol %in% c("a", "b")
#$jsub
#.(fvar1 = sum(myIntCol, na.rm = TRUE), charhead = head(myCharCol, 1L)) Thanks to @ggrothendieck for assisting on how to handle LHS names: x = list()
class(x) = "cl"
"[.cl" = function(x, i, j, env) {
if (!missing(env) && length(env)) {
stopifnot(is.list(env), sapply(env, is.character), sapply(env, nzchar))
env = lapply(env, as.symbol)
isub = if (!missing(i)) eval(substitute(substitute(i, env)))
jsub = if (!missing(j)) eval(substitute(substitute(j, env)))
# substitute names?
if (any(w <- names(jsub) %in% names(env))) {
names(jsub)[w] = unlist(lapply(env[names(jsub)[w]], as.character))
}
} else {
isub = substitute(i)
jsub = substitute(j)
}
list(isub=isub, jsub=jsub)
}
x[, .(fvar1=list(fvar2 = val1)),
env = list(fvar1="outer", fvar2="inner", val1="val_column")]
#$isub
#NULL
#$jsub
#.(outer = list(fvar2 = val_column)) Unfortunately we need to recursively traverse expression to look for nested substitutions of LHS. RHS is well handled by |
Recursive traverse to substitute names in nested calls has been addressed in PR #4304. |
Because this issue turns out to be more generic than just |
Any chance ya'll have a timeline for when this will be released to production? |
There are some issues asking for release date already and best to follow those. |
This issue is resolved by new env var #4304, already in master branch. If any of examples in this issue is not covered by new function please let us know so we re-open this issue. |
Now we can subset cols using the
..
notation conveniently as follows:Since
..var
has a very special meaning, we could also allow this?Currently, it errors as follows:
# Error in eval(jsub, SDenv, parent.frame()) : object '..cols' not found
Haven't really given it a lot of thought.. Just came across it and wanted to file it as an issue so that it doesn't get lost.
The text was updated successfully, but these errors were encountered: