Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subset.i #4139

Closed
wants to merge 10 commits into from
Closed

Subset.i #4139

wants to merge 10 commits into from

Conversation

ColeMiller1
Copy link
Contributor

Allows for two new helper functions in i: filter.at(cols, logic, by, all.vars = TRUE) and top.n(n, wt, by, ties = FALSE). Closes #4133 and #3804. WIP

The helper functions assist end-users by helping make the by function available in i while also allowing developers to optimize for the most likely use cases. For example, depending on the arguments provided, top.n allows for similarities to head(dt, n), optimized non-grouping with dt[order(-wt)[1:n]], and then performant groupings with dt[dt[order(-wt), .I[1:n], by = grp]$V1].

See the filter.at.R and top.n.R for many use cases which I would use to make a vignette if this is ultimately merged, but here are a few examples:

filter.at:

set.seed(123)
dt <- data.table(replicate(3, sample(c(T, F), 1E2, replace = T)))
cols <- c('V1', 'V2', 'V3')

# to show all identical and not performance:
bench::mark(dt[filter.at(cols = TRUE, logic = x)],
            dt[filter.at(c('V1','V2','V3'), x)],
            dt[filter.at(patterns('V'), logic = x)],
            dt[filter.at(cols, x)],
            dt[filter.at(V1:V3, x)],
            dt[V1 & V2 & V3] #creates index with default options
            )

# see https://stackoverflow.com/questions/58570110/how-to-delete-rows-for-leading-and-trailing-nas-by-group-in-r
df1<-data.frame(ID=(rep(c("C1001","C1008","C1009","C1012"),each=17)),
                Year=(rep(c(1996:2012),4)),x1=(floor(runif(68,20,75))),x2= 
                  (floor(runif(68,1,100))))

#Introduce leading / tailing NAs
df1[cbind(c(1:5, 18:23, 35:42, 49:51, 66:68),c(rep(c(3, 4, 4, 3, 3), c(5,6,8,3,3))))]<-NA

#introduce "gap"- NAs
set.seed(123); df1$x1[rbinom(68,1,0.1)==1]<-NA; df1$x2[rbinom(68,1,0.1)==1]<-NA

setDT(df1)

## all the same result
df1[filter.at(c('x1','x2'), !(rleid(x) %in% c(1, max(rleid(x))) & is.na(x)), ID)]
df1[filter.at(x1:x2, !(rleid(x) %in% c(1, max(rleid(x))) & is.na(x)), ID)]
df1[filter.at(patterns('x'), !(rleid(x) %in% c(1, max(rleid(x))) & is.na(x)), ID)]

fx = function (x) {!(rleid(x) %in% c(1, max(rleid(x))) & is.na(x))}
df1[filter.at(x1:x2, fx, ID)]

top.n

dt <- as.data.table(iris)
options(datatable.verbose=TRUE)

##dt[top.n(n, wt, by, ties = FALSE)]

## no wt
dt[top.n(3)] 
dt[top.n(-3)] 

dt[top.n(3, by = Species)]
dt[top.n(-3, by = Species)]
dt[top.n(3, by = Species, ties = TRUE)] #ties ignored

## wt
dt[top.n(3, Sepal.Length)]
dt[top.n(3, Sepal.Length, ties = TRUE)]

dt[top.n(3, Sepal.Length, Species)]
dt[top.n(3, Sepal.Length, Species, ties = TRUE)]

#0.5 GB groupby benchmark from h2o.ai
N = 10000000
K = 100
set.seed(108)

DT = data.table(
  id1 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
  id2 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
  id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char)
  id4 = sample(K, N, TRUE),                          # large groups (int)
  id5 = sample(K, N, TRUE),                          # large groups (int)
  id6 = sample(N/K, N, TRUE),                        # small groups (int)
  v1 =  sample(5, N, TRUE),                          # int in range [1,5]
  v2 =  sample(5, N, TRUE),                          # int in range [1,5]
  v3 =  round(runif(N,max=100),4)                    # numeric e.g. 23.5749
)

bench::mark(
DT[order(-v3), .(largest2_v3 = head(v3, 2L)), by = id6],
DT[top.n(2L, v3, id6), .(id6, largest2_v3 = v3)]
)

# A tibble: 2 x 13
  expression                                                min median `itr/sec` mem_alloc
  <bch:expr>                                              <bch> <bch:>     <dbl> <bch:byt>
1 DT[order(-v3), .(largest2_v3 = head(v3, 2L)), by = id6] 3.55s  3.55s     0.282     157MB
2 DT[top.n(2L, v3, id6), .(id6, largest2_v3 = v3)]        1.63s  1.63s     0.615     162MB

@jangorecki jangorecki added the WIP label Dec 27, 2019
@jangorecki jangorecki self-requested a review December 27, 2019 06:20
# for anonymous functions vs. functions in the environment
is_fx = any(as.character(logic) == 'x') || length(as.character(logic)) > 1

# Predicates wtih rle and cumsum do not work with chaining i.e.
Copy link
Member

@jangorecki jangorecki Dec 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if there will be a user defined function that behaves like rle or cumsum, how we are going to handle that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is_fx is poorly named. Mainly, the optimization is for if the user supplies an anonymous function. A previously defined function would be evaluated using Reduce. That is, fx = cumsum(x) > 3; dt[filter.at(TRUE, fx)] would not be optimized in a for loop. However, dt[filter.at(TRUE, some_function(x))], would be optimized as long as some_function(x) is not 'rle|cumsum|min|max|sum'. TO DO: change is_fx to is_anon_fx for clarity but I am interested in what you are working on before resolving.

tests/filter.at.R Outdated Show resolved Hide resolved
R/data.table.R Outdated Show resolved Hide resolved
R/data.table.R Outdated Show resolved Hide resolved
if (getOption("datatable.optimize")>=1L) assign("order", forder, ienv)
isub = tryCatch(eval(.massagei(l[['i']]), x, ienv), error=function(e) .checkTypos(e, names_x))
} else if (is.null(l$by)) {
isub = do.call(`[.data.table`, l)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generally a bad idea to call [.data.table from inside [.data.table. I recall there was a single(?) place where we already did that, and we wanted to get rid of it. If there is another way, then we should consider to use that instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 2003 - i = x[i, which=TRUE]. I can untangle the calls without by - just need to include extra code to handle the .SDcols argument. I would need to dig very deeply to remove the call with the by but I will look into it. It would be easier if #852 were further along to make the various by and j checks more modular.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jangorecki I have taken this comment under consideration. I am starting to look into C methods to do a quicker topn which should assist me with filter.at. Plus, the former is an actual feature that has upvotes.

For example, using forderv can get head by group with:

dt = data.table(V1 = sample(50, 500, TRUE))
tmp = data.table:::forderv(dt[['V1']], sort = FALSE, retGrp = T)
dt[tmp[attr(tmp, 'starts')]]

But obviously, we allocate a large integer vector just to immediately subset. A new method that only returns a vector of what is used should be more performant.

Also, I appreciate your time and comments. I will likely close this is in a couple of days as while this is speedy as is, I acknowledge it is somewhat hacky.

@jangorecki
Copy link
Member

jangorecki commented Dec 27, 2019

Thanks for PR, looks interesting. I put some initial comments just by looking at the code. I like your test scripts. I am doing exactly the same in the initial development of new features. Removing bench dependency from those script will help to resolve fail CI builds.
I have a WIP branch that is optimizing some of the cases covered by filter.at, will try to push it soon so we can work out on the overlapping scope of both branches.

@codecov
Copy link

codecov bot commented Dec 27, 2019

Codecov Report

Merging #4139 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4139      +/-   ##
==========================================
+ Coverage   99.41%   99.42%   +<.01%     
==========================================
  Files          72       72              
  Lines       13909    13977      +68     
==========================================
+ Hits        13828    13896      +68     
  Misses         81       81
Impacted Files Coverage Δ
R/data.table.R 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb22a2c...f3082c2. Read the comment docs.

Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about those helper functions. They just makes an API to efficient operations more user friendly, but eventually when we optimize current API well, they will became redundant.


#' @param cols required; can accept all values that .SDcols accepts
#' @param logic required; a function or unquoted text that results in logic evaluation. The
#' unquoted text must include `x` as an argument.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unquoted text must include x as an argument.

why you chose such an API? I haven't seen it anywhere in base R and package that I am using. IMO it is better to at least expect function(x) ...logic... then x is a naturally expected object in the logic. User may have x variable defined in their code and expect it to be found in current scenario. This is how scoping in R should behave, proposed API works against it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

purrr uses ~ .x shorthand - I admit that I may have have taken it to the extreme.

Copy link
Member

@jangorecki jangorecki Dec 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is counterinituitive, the purr approach, because ~ has quite different context in base R. And yes, your approach is even more...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Filter Helper Functions
3 participants