Subset.i #4139

ColeMiller1 · 2019-12-27T03:45:27Z

Allows for two new helper functions in i: filter.at(cols, logic, by, all.vars = TRUE) and top.n(n, wt, by, ties = FALSE). Closes #4133 and #3804. WIP

The helper functions assist end-users by helping make the by function available in i while also allowing developers to optimize for the most likely use cases. For example, depending on the arguments provided, top.n allows for similarities to head(dt, n), optimized non-grouping with dt[order(-wt)[1:n]], and then performant groupings with dt[dt[order(-wt), .I[1:n], by = grp]$V1].

See the filter.at.R and top.n.R for many use cases which I would use to make a vignette if this is ultimately merged, but here are a few examples:

filter.at:

set.seed(123)
dt <- data.table(replicate(3, sample(c(T, F), 1E2, replace = T)))
cols <- c('V1', 'V2', 'V3')

# to show all identical and not performance:
bench::mark(dt[filter.at(cols = TRUE, logic = x)],
            dt[filter.at(c('V1','V2','V3'), x)],
            dt[filter.at(patterns('V'), logic = x)],
            dt[filter.at(cols, x)],
            dt[filter.at(V1:V3, x)],
            dt[V1 & V2 & V3] #creates index with default options
            )

# see https://stackoverflow.com/questions/58570110/how-to-delete-rows-for-leading-and-trailing-nas-by-group-in-r
df1<-data.frame(ID=(rep(c("C1001","C1008","C1009","C1012"),each=17)),
                Year=(rep(c(1996:2012),4)),x1=(floor(runif(68,20,75))),x2= 
                  (floor(runif(68,1,100))))

#Introduce leading / tailing NAs
df1[cbind(c(1:5, 18:23, 35:42, 49:51, 66:68),c(rep(c(3, 4, 4, 3, 3), c(5,6,8,3,3))))]<-NA

#introduce "gap"- NAs
set.seed(123); df1$x1[rbinom(68,1,0.1)==1]<-NA; df1$x2[rbinom(68,1,0.1)==1]<-NA

setDT(df1)

## all the same result
df1[filter.at(c('x1','x2'), !(rleid(x) %in% c(1, max(rleid(x))) & is.na(x)), ID)]
df1[filter.at(x1:x2, !(rleid(x) %in% c(1, max(rleid(x))) & is.na(x)), ID)]
df1[filter.at(patterns('x'), !(rleid(x) %in% c(1, max(rleid(x))) & is.na(x)), ID)]

fx = function (x) {!(rleid(x) %in% c(1, max(rleid(x))) & is.na(x))}
df1[filter.at(x1:x2, fx, ID)]

top.n

dt <- as.data.table(iris)
options(datatable.verbose=TRUE)

##dt[top.n(n, wt, by, ties = FALSE)]

## no wt
dt[top.n(3)] 
dt[top.n(-3)] 

dt[top.n(3, by = Species)]
dt[top.n(-3, by = Species)]
dt[top.n(3, by = Species, ties = TRUE)] #ties ignored

## wt
dt[top.n(3, Sepal.Length)]
dt[top.n(3, Sepal.Length, ties = TRUE)]

dt[top.n(3, Sepal.Length, Species)]
dt[top.n(3, Sepal.Length, Species, ties = TRUE)]

#0.5 GB groupby benchmark from h2o.ai
N = 10000000
K = 100
set.seed(108)

DT = data.table(
  id1 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
  id2 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
  id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char)
  id4 = sample(K, N, TRUE),                          # large groups (int)
  id5 = sample(K, N, TRUE),                          # large groups (int)
  id6 = sample(N/K, N, TRUE),                        # small groups (int)
  v1 =  sample(5, N, TRUE),                          # int in range [1,5]
  v2 =  sample(5, N, TRUE),                          # int in range [1,5]
  v3 =  round(runif(N,max=100),4)                    # numeric e.g. 23.5749
)

bench::mark(
DT[order(-v3), .(largest2_v3 = head(v3, 2L)), by = id6],
DT[top.n(2L, v3, id6), .(id6, largest2_v3 = v3)]
)

# A tibble: 2 x 13
  expression                                                min median `itr/sec` mem_alloc
  <bch:expr>                                              <bch> <bch:>     <dbl> <bch:byt>
1 DT[order(-v3), .(largest2_v3 = head(v3, 2L)), by = id6] 3.55s  3.55s     0.282     157MB
2 DT[top.n(2L, v3, id6), .(id6, largest2_v3 = v3)]        1.63s  1.63s     0.615     162MB

Example uses for filter.at

jangorecki · 2019-12-27T07:38:45Z

R/data.table.R

+  # for anonymous functions vs. functions in the environment
+  is_fx = any(as.character(logic) == 'x') || length(as.character(logic)) > 1
+
+  # Predicates wtih rle and cumsum do not work with chaining i.e. 


what if there will be a user defined function that behaves like rle or cumsum, how we are going to handle that?

The is_fx is poorly named. Mainly, the optimization is for if the user supplies an anonymous function. A previously defined function would be evaluated using Reduce. That is, fx = cumsum(x) > 3; dt[filter.at(TRUE, fx)] would not be optimized in a for loop. However, dt[filter.at(TRUE, some_function(x))], would be optimized as long as some_function(x) is not 'rle|cumsum|min|max|sum'. TO DO: change is_fx to is_anon_fx for clarity but I am interested in what you are working on before resolving.

tests/filter.at.R

R/data.table.R

jangorecki · 2019-12-27T07:52:47Z

R/data.table.R

+        if (getOption("datatable.optimize")>=1L) assign("order", forder, ienv)
+        isub = tryCatch(eval(.massagei(l[['i']]), x, ienv), error=function(e) .checkTypos(e, names_x))
+      } else if (is.null(l$by)) {
+        isub = do.call(`[.data.table`, l)


This is generally a bad idea to call [.data.table from inside [.data.table. I recall there was a single(?) place where we already did that, and we wanted to get rid of it. If there is another way, then we should consider to use that instead.

line 2003 - i = x[i, which=TRUE]. I can untangle the calls without by - just need to include extra code to handle the .SDcols argument. I would need to dig very deeply to remove the call with the by but I will look into it. It would be easier if #852 were further along to make the various by and j checks more modular.

@jangorecki I have taken this comment under consideration. I am starting to look into C methods to do a quicker topn which should assist me with filter.at. Plus, the former is an actual feature that has upvotes.

For example, using forderv can get head by group with:

dt = data.table(V1 = sample(50, 500, TRUE)) tmp = data.table:::forderv(dt[['V1']], sort = FALSE, retGrp = T) dt[tmp[attr(tmp, 'starts')]]

But obviously, we allocate a large integer vector just to immediately subset. A new method that only returns a vector of what is used should be more performant.

Also, I appreciate your time and comments. I will likely close this is in a couple of days as while this is speedy as is, I acknowledge it is somewhat hacky.

jangorecki · 2019-12-27T07:57:20Z

Thanks for PR, looks interesting. I put some initial comments just by looking at the code. I like your test scripts. I am doing exactly the same in the initial development of new features. Removing bench dependency from those script will help to resolve fail CI builds.
I have a WIP branch that is optimizing some of the cases covered by filter.at, will try to push it soon so we can work out on the overlapping scope of both branches.

codecov · 2019-12-27T14:40:55Z

Codecov Report

Merging #4139 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #4139      +/-   ##
==========================================
+ Coverage   99.41%   99.42%   +<.01%     
==========================================
  Files          72       72              
  Lines       13909    13977      +68     
==========================================
+ Hits        13828    13896      +68     
  Misses         81       81

Impacted Files	Coverage Δ
R/data.table.R	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb22a2c...f3082c2. Read the comment docs.

jangorecki

I am not sure about those helper functions. They just makes an API to efficient operations more user friendly, but eventually when we optimize current API well, they will became redundant.

jangorecki · 2019-12-30T14:42:20Z

R/data.table.R

+
+  #' @param cols required; can accept all values that .SDcols accepts
+  #' @param logic required; a function or unquoted text that results in logic evaluation. The
+  #'              unquoted text must include `x` as an argument.


unquoted text must include x as an argument.

why you chose such an API? I haven't seen it anywhere in base R and package that I am using. IMO it is better to at least expect function(x) ...logic... then x is a naturally expected object in the logic. User may have x variable defined in their code and expect it to be found in current scenario. This is how scoping in R should behave, proposed API works against it.

purrr uses ~ .x shorthand - I admit that I may have have taken it to the extreme.

It is counterinituitive, the purr approach, because ~ has quite different context in base R. And yes, your approach is even more...

ColeMiller1 added 5 commits December 26, 2019 21:01

Update data.table.R

0d3432f

Create filter.at.R

4270917

Example uses for filter.at

Update filter.at.R

d292175

Update filter.at.R

c745e58

Create top.n.R

dd5732e

jangorecki added the WIP label Dec 27, 2019

jangorecki self-requested a review December 27, 2019 06:20