Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

programming on data.table #4304

Merged
merged 39 commits into from
May 10, 2021
Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
5b8f98b
working draft of substitute names
jangorecki Mar 13, 2020
838f00d
substitute2 draft
jangorecki Mar 14, 2020
a2b876b
add example from #2655
jangorecki Mar 14, 2020
994ce4a
! on scalar not vector
MichaelChirico Mar 14, 2020
7816631
handle AsIs so char and symbols can be both used in env arg
jangorecki Mar 17, 2020
6084f28
non-symbol handling
jangorecki Mar 17, 2020
472d6d6
data.table query env support, rm AsIs class from env input
jangorecki Mar 17, 2020
8167ef2
manual and tests
jangorecki Mar 17, 2020
fd44346
tests roadmap
jangorecki Mar 17, 2020
247ed7a
export substitute2 and is.AsIs, minor manual improvements
jangorecki Mar 18, 2020
048e370
move and improve tests
jangorecki Mar 18, 2020
d60152d
manual typo
jangorecki Mar 18, 2020
fed2553
simplifies API, export less
jangorecki Mar 22, 2020
fd5582e
rework inner loop, tests, docs
jangorecki Mar 22, 2020
c1fbdfb
tests
jangorecki Mar 22, 2020
ffb97d4
man wording
jangorecki Mar 22, 2020
e6f2203
minor tests adjustments
jangorecki Mar 22, 2020
507e253
add use cases contributed by @renkun-ken
jangorecki Mar 22, 2020
e99ae00
solve known issues about get
jangorecki Mar 22, 2020
9e61025
improve test for renkun-ken use case
jangorecki Mar 22, 2020
b498263
minor fix for proper scoping
jangorecki Mar 22, 2020
b0094f7
more tests addressing existing issues
jangorecki Mar 22, 2020
9878dbf
add use case by #@tdeenes
jangorecki Mar 23, 2020
57514ed
missing env
jangorecki Mar 23, 2020
4c1001d
env test and man
jangorecki Mar 25, 2020
24f4a07
fix test num reference
jangorecki Mar 26, 2020
783f3a9
news entry and minor doc changes
jangorecki Mar 27, 2020
9a31604
test can now test itself
jangorecki Mar 29, 2020
5d34753
enlist, recursive enlisting
jangorecki Mar 31, 2020
28e240f
programming on data.table vignette
jangorecki Mar 31, 2020
d056afe
NEWS wording
jangorecki Mar 31, 2020
a42e3d2
fix rmd yaml index entry
jangorecki Apr 1, 2020
5c47c09
typos and wording
jangorecki Apr 8, 2020
e869026
empty input to substitute2
jangorecki May 27, 2020
09e0499
empty input to env more robust
jangorecki May 27, 2020
0ebef14
Merge branch 'master' into programming
jangorecki Jan 15, 2021
73ca1ca
Merge branch 'master' into programming
mattdowle May 9, 2021
c67d75f
move news item up
mattdowle May 9, 2021
d00f292
merge follow-up to pass programming.Rraw
mattdowle May 9, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
^.*\.Rproj$
^\.Rproj\.user$
^\.idea$
^\.libs$

^.*\.dll$

Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ export(nafill)
export(setnafill)
export(.Last.updated)
export(fcoalesce)
export(substitute2)

S3method("[", data.table)
S3method("[<-", data.table)
Expand Down
31 changes: 31 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,37 @@ unit = "s")

14. Added support for `round()` and `trunc()` to extend functionality of `ITime`. `round()` and `trunc()` can be used with argument units: "hours" or "minutes". Thanks to @JensPederM for the suggestion and PR.

15. New interface for _programming on data.table_ has been added. It is built using base R `substitute`-like interface via new `env` argument to `[.data.table`. For details of substitution see new vignette *programming on data.table* and `?substitute2` manual.

```r
DT = data.table(x = 1:5, y = 5:1)

# parameters
in_col_name = "x"
fun = "sum"
fun_arg1 = "na.rm"
fun_arg1val = TRUE
out_col_name = "sum_x"

# parameterized query
#DT[, .(out_col_name = fun(in_col_name, fun_arg1=fun_arg1val))]

# desired query
DT[, .(sum_x = sum(x, na.rm=TRUE))]

# new interface
DT[, .(out_col_name = fun(in_col_name, fun_arg1=fun_arg1val)),
env = list(
in_col_name = "x",
fun = "sum",
fun_arg1 = "na.rm",
fun_arg1val = TRUE,
out_col_name = "sum_x"
)]
```

Addresses [#2655](https://github.com/Rdatatable/data.table/issues/2655) any many other linked issues. Thanks to numerous users for filling requests for a better flexibility in parameterizing data.table queries.

## BUG FIXES

1. A NULL timezone on POSIXct was interpreted by `as.IDate` and `as.ITime` as UTC rather than the session's default timezone (`tz=""`) , [#4085](https://github.com/Rdatatable/data.table/issues/4085).
Expand Down
37 changes: 32 additions & 5 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ replace_dot_alias = function(e) {
}
}

"[.data.table" = function (x, i, j, by, keyby, with=TRUE, nomatch=getOption("datatable.nomatch", NA), mult="all", roll=FALSE, rollends=if (roll=="nearest") c(TRUE,TRUE) else if (roll>=0) c(FALSE,TRUE) else c(TRUE,FALSE), which=FALSE, .SDcols, verbose=getOption("datatable.verbose"), allow.cartesian=getOption("datatable.allow.cartesian"), drop=NULL, on=NULL)
"[.data.table" = function (x, i, j, by, keyby, with=TRUE, nomatch=getOption("datatable.nomatch", NA), mult="all", roll=FALSE, rollends=if (roll=="nearest") c(TRUE,TRUE) else if (roll>=0) c(FALSE,TRUE) else c(TRUE,FALSE), which=FALSE, .SDcols, verbose=getOption("datatable.verbose"), allow.cartesian=getOption("datatable.allow.cartesian"), drop=NULL, on=NULL, env=NULL)
{
# ..selfcount <<- ..selfcount+1 # in dev, we check no self calls, each of which doubles overhead, or could
# test explicitly if the caller is [.data.table (even stronger test. TO DO.)
Expand All @@ -151,12 +151,26 @@ replace_dot_alias = function(e) {
if (!missing(keyby)) {
if (!missing(by)) stop("Provide either by= or keyby= but not both")
if (missing(j)) { warning("Ignoring keyby= because j= is not supplied"); keyby=NULL; }
by=bysub=substitute(keyby)
if (is.null(env)) by=bysub=substitute(keyby) else {
by = bysub = eval(substitute(
substitute2(.keyby, env),
list(.keyby = substitute(keyby))
))
if (verbose) cat("Argument 'by' after substitute: ", paste(deparse(bysub, width.cutoff=500L), collapse=" "), "\n", sep="")
}
keyby=TRUE
# Assign to 'by' so that by is no longer missing and we can proceed as if there were one by
} else {
if (!missing(by) && missing(j)) { warning("Ignoring by= because j= is not supplied"); by=NULL; }
by=bysub= if (missing(by)) NULL else substitute(by)
if (missing(by)) by=bysub=NULL else {
if (is.null(env)) by=bysub=substitute(by) else {
by = bysub = eval(substitute(
substitute2(.by, env),
list(.by = substitute(by))
))
if (verbose) cat("Argument 'by' after substitute: ", paste(deparse(bysub, width.cutoff=500L), collapse=" "), "\n", sep="")
}
}
keyby=FALSE
}
bynull = !missingby && is.null(by) #3530
Expand Down Expand Up @@ -213,7 +227,14 @@ replace_dot_alias = function(e) {
av = NULL
jsub = NULL
if (!missing(j)) {
jsub = replace_dot_alias(substitute(j))
if (is.null(env)) jsub = substitute(j) else {
jsub = eval(substitute(
substitute2(.j, env),
list(.j = substitute(j))
))
if (verbose) cat("Argument 'j' after substitute: ", paste(deparse(jsub, width.cutoff=500L), collapse=" "), "\n", sep="")
}
jsub = replace_dot_alias(jsub)
root = if (is.call(jsub)) as.character(jsub[[1L]])[1L] else ""
if (root == ":" ||
(root %chin% c("-","!") && jsub[[2L]] %iscall% '(' && jsub[[2L]][[2L]] %iscall% ':') ||
Expand Down Expand Up @@ -291,8 +312,14 @@ replace_dot_alias = function(e) {
dupdiff = function(x, y) x[!x %chin% y]

if (!missing(i)) {
if (is.null(env)) isub = substitute(i) else {
isub = eval(substitute(
substitute2(.i, env),
list(.i = substitute(i))
))
if (verbose) cat("Argument 'i' after substitute: ", paste(deparse(isub, width.cutoff=500L), collapse=" "), "\n", sep="")
}
xo = NULL
isub = substitute(i)
if (identical(isub, NA)) {
# only possibility *isub* can be NA (logical) is the symbol NA itself; i.e. DT[NA]
# replace NA in this case with NA_integer_ as that's almost surely what user intended to
Expand Down
76 changes: 76 additions & 0 deletions R/programming.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
is.AsIs = function(x) {
inherits(x, "AsIs")
}
rm.AsIs = function(x) {
cl = oldClass(x)
oldClass(x) = cl[cl!="AsIs"]
x
}
list2lang = function(x) {
if (!is.list(x))
stop("'x' must be a list")
if (is.AsIs(x))
return(rm.AsIs(x))
asis = vapply(x, is.AsIs, FALSE)
char = vapply(x, is.character, FALSE)
to.name = !asis & char
if (any(to.name)) { ## turns "my_name" character scalar into `my_name` symbol, for convenience
if (any(non.scalar.char <- vapply(x[to.name], length, 0L)!=1L)) {
stop("Character objects provided in the input are not scalar objects, if you need them as character vector rather than a name, then wrap each into 'I' call: ",
paste(names(non.scalar.char)[non.scalar.char], collapse=", "))
}
x[to.name] = lapply(x[to.name], as.name)
}
if (isTRUE(getOption("datatable.enlist", TRUE))) { ## recursively enlist for nested lists, see note section in substitute2 manual
islt = vapply(x, is.list, FALSE)
to.enlist = !asis & islt
if (any(to.enlist)) {
x[to.enlist] = lapply(x[to.enlist], enlist)
}
}
if (any(asis)) {
x[asis] = lapply(x[asis], rm.AsIs)
}
x
}
enlist = function(x) {
if (!is.list(x))
stop("'x' must be a list")
if (is.AsIs(x))
return(rm.AsIs(x))
as.call(c(quote(list), list2lang(x)))
}

substitute2 = function(expr, env) {
if (missing(env)) {
stop("'env' must not be missing")
} else if (is.null(env)) {
# null is fine, will be escaped few lines below
} else if (is.environment(env)) {
env = as.list(env, all.names=TRUE, sorted=TRUE)
jangorecki marked this conversation as resolved.
Show resolved Hide resolved
} else if (!is.list(env)) {
stop("'env' must be a list or an environment")
}
if (!length(env)) {
return(substitute(expr))
}
env.names = names(env)
if (is.null(env.names)) {
stop("'env' argument does not have names")
} else if (!all(nzchar(env.names))) {
stop("'env' argument has zero char names")
} else if (anyNA(env.names)) {
stop("'env' argument has NA names")
} else if (anyDuplicated(env.names)) {
stop("'env' argument has duplicated names")
}
# character to name/symbol, and list to list call
env = list2lang(env)
# R substitute
expr.sub = eval(substitute(
substitute(.expr, env),
env = list(.expr = substitute(expr))
))
# substitute call argument names
.Call(Csubstitute_call_arg_namesR, expr.sub, env)
}
11 changes: 6 additions & 5 deletions R/test.data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ test.data.table = function(script="tests.Rraw", verbose=FALSE, pkg=".", silent=F
timings = env$timings
DT = head(timings[-1L][order(-time)], 10L) # exclude id 1 as in dev that includes JIT
if ((x<-sum(timings[["nTest"]])) != ntest) {
warning("Timings count mismatch:",x,"vs",ntest) # nocov
warning("Timings count mismatch: ",x," vs ",ntest) # nocov
}
cat("10 longest running tests took ", as.integer(tt<-DT[, sum(time)]), "s (", as.integer(100*tt/(ss<-timings[,sum(time)])), "% of ", as.integer(ss), "s)\n", sep="")
print(DT, class=FALSE)
Expand Down Expand Up @@ -243,6 +243,7 @@ test = function(num,x,y=TRUE,error=NULL,warning=NULL,message=NULL,output=NULL,no
# iv) if warning is supplied, y is checked to equal x, and x should result in a warning message matching the pattern
# v) if output is supplied, x is evaluated and printed and the output is checked to match the pattern
# num just needs to be numeric and unique. We normally increment integers at the end, but inserts can be made using decimals e.g. 10,11,11.1,11.2,12,13,...
# num=0 to escape global failure tracking so we can test behaviour of test function itself: test(1.1, test(0, TRUE, FALSE), FALSE, output="1 element mismatch")
# Motivations:
# 1) we'd like to know all tests that fail not just stop at the first. This often helps by revealing a common feature across a set of
# failing tests
Expand All @@ -256,7 +257,7 @@ test = function(num,x,y=TRUE,error=NULL,warning=NULL,message=NULL,output=NULL,no
prevtest = get("prevtest", parent.frame())
nfail = get("nfail", parent.frame()) # to cater for both test.data.table() and stepping through tests in dev
whichfail = get("whichfail", parent.frame())
assign("ntest", get("ntest", parent.frame()) + 1L, parent.frame(), inherits=TRUE) # bump number of tests run
assign("ntest", get("ntest", parent.frame()) + if (num>0) 1L else 0L, parent.frame(), inherits=TRUE) # bump number of tests run
lasttime = get("lasttime", parent.frame())
timings = get("timings", parent.frame())
memtest = get("memtest", parent.frame())
Expand All @@ -265,7 +266,7 @@ test = function(num,x,y=TRUE,error=NULL,warning=NULL,message=NULL,output=NULL,no
foreign = get("foreign", parent.frame())
showProgress = get("showProgress", parent.frame())
time = nTest = NULL # to avoid 'no visible binding' note
on.exit( {
if (num>0) on.exit( {
now = proc.time()[3L]
took = now-lasttime # so that prep time between tests is attributed to the following test
assign("lasttime", now, parent.frame(), inherits=TRUE)
Expand Down Expand Up @@ -326,7 +327,7 @@ test = function(num,x,y=TRUE,error=NULL,warning=NULL,message=NULL,output=NULL,no
fwrite(mem, "memtest.csv", append=TRUE, verbose=FALSE) # nocov
}
fail = FALSE
if (.test.data.table) {
if (.test.data.table && num>0) {
if (num<prevtest+0.0000005) {
# nocov start
cat("Test id", numStr, "is not in increasing order\n")
Expand Down Expand Up @@ -439,7 +440,7 @@ test = function(num,x,y=TRUE,error=NULL,warning=NULL,message=NULL,output=NULL,no
}
# nocov end
}
if (fail && .test.data.table) {
if (fail && .test.data.table && num>0) {
# nocov start
assign("nfail", nfail+1L, parent.frame(), inherits=TRUE)
assign("whichfail", c(whichfail, numStr), parent.frame(), inherits=TRUE)
Expand Down
Loading