Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Having arg #4412

Draft
wants to merge 17 commits into
base: master
Choose a base branch
from
Draft

Having arg #4412

wants to merge 17 commits into from

Conversation

ColeMiller1
Copy link
Contributor

@ColeMiller1 ColeMiller1 commented May 2, 2020

Too early to say it addresses #788. Very much WIP. I am hoping for general feedback from @jangorecki before continuing this path.

Per the initial FR, this includes a new having argument that requires each group to return a logical vector of length one. Right now only gForce functions and primitive functions are allowed - I can work on PRs for any, all, which.max, and which.min gforce funs which would be helpful for this.

New vecseq_having

The subsetting workhorse is vecseq_having. The new function returns an integer vector with additional attributes if retGrps is true.

starts = c(1L, 3L, 6L)
lens = c(2L, 3L, 1L)
o = c(1L, 4L, 2L, 5L, 6L, 3L)

.Call(data.table:::Cvecseq_having, starts, lens, having = c(FALSE, TRUE, TRUE), retGrpArg = TRUE, o__ = o)
## [1] 2 5 6 3
## attr(,"starts")
## [1] 1 4
## attr(,"grplen")
## [1] 3 1
## attr(,"maxgrpn")
## [1] 3

.Call(data.table:::Cvecseq_having, starts, lens, having = c(FALSE, TRUE, TRUE), retGrpArg = FALSE, o__ = o)
## [1] 2 5 6 3

New recursive parser

Current GForce optimizations go one deep. That is, mean(x) will be optimized while mean(x == 2L) would not be. To account for this, a new function evaluates an expression to determine if it is a gfun, is.primitive, a name and whether it exists inside or outside of the environment. This allows for mean(x ==2L) to be optimized as well as mean(x) > 3 & .N > 5L

Performance

library(data.table)

n = 1e5
grps = 1e4
set.seed(123L)
dt = data.table(x = sample(grps, n, TRUE), y = runif(n))

invisible(dt[, lapply(.SD, sum), by = x]) ##warm-up

setDTthreads(1L)

bench::mark(
  dt[, .SD, by = x, having = .N > 5L],
  dt[dt[, .I[.N > 5L], by = x]$V1],
  dt[, if (.N > 5L) .SD, by = x],
  dt[, .SD[.N > 5L], by = x],
  time_unit = 's'
)
##   expression                              min  median `itr/sec` mem_alloc
##  <bch:expr>                            <dbl>   <dbl>     <dbl> <bch:byt>
## 1 dt[, .SD, by = x, having = .N > 5L] 0.00487 0.00536   177.       2.15MB
## 2 dt[dt[, .I[.N > 5L], by = x]$V1]    0.0146  0.0154     62.4      3.41MB
## 3 dt[, if (.N > 5L) .SD, by = x]      0.302   0.302       3.31     2.89MB
## 4 dt[, .SD[.N > 5L], by = x]          1.89    1.89        0.529   83.91MB

setDTthreads(2L)

##   expression                              min  median `itr/sec` mem_alloc
##   <bch:expr>                            <dbl>   <dbl>     <dbl> <bch:byt>
## 1 dt[, .SD, by = x, having = .N > 5L] 0.00507 0.00560   144.       2.15MB
## 2 dt[dt[, .I[.N > 5L], by = x]$V1]    0.0147  0.0159     45.7      3.25MB
## 3 dt[, if (.N > 5L) .SD, by = x]      0.655   0.655       1.53     2.89MB
## 4 dt[, .SD[.N > 5L], by = x]          2.15    2.15        0.465    83.9MB

bench::mark(
  new_hav = dt[,
               j = .SD,
               by = x,
               having = .N < 2L | sum(y) > 11 | median(y) < 0.7
               ],
  use_I = dt[dt[, .I[.N < 2L | sum(y) > 11 | median(y) < 0.7], by = x]$V1],
  time_unit = 's'
)

##   expression    min median `itr/sec` mem_alloc
##   <bch:expr>  <dbl>  <dbl>     <dbl> <bch:byt>
## 1 new_hav    0.0102 0.0112    81.4      6.97MB
## 2 use_I      1.21   1.21       0.829    3.21MB

To do:

  • Fix irows subset
  • Fix by cols for correct order and any ad hoc columns
  • Allow for more than just .SD in j
  • Allow more functions to be evaluated
  • Create method to allow for non-grouping evaluations such as rleid(x) < 5 which would evaluate to a logical vector equal to the number of rows in the data.table.
  • Documentation
  • Tests

@ColeMiller1 ColeMiller1 added the WIP label May 2, 2020
@codecov
Copy link

codecov bot commented May 2, 2020

Codecov Report

Merging #4412 into master will decrease coverage by 0.11%.
The diff coverage is 87.59%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4412      +/-   ##
==========================================
- Coverage   99.61%   99.49%   -0.12%     
==========================================
  Files          72       72              
  Lines       13917    14047     +130     
==========================================
+ Hits        13863    13976     +113     
- Misses         54       71      +17     
Impacted Files Coverage Δ
src/vecseq.c 88.65% <84.72%> (-11.35%) ⬇️
R/data.table.R 99.73% <91.07%> (-0.27%) ⬇️
src/init.c 100.00% <100.00%> (ø)
src/assign.c 99.84% <0.00%> (-0.16%) ⬇️
R/foverlaps.R 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dd7609e...11c1fd8. Read the comment docs.

Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks promising but I would wait for closing API discussions better before implementing

R/data.table.R Outdated
Comment on lines 3185 to 3188
if (is.atomic(e) || exists(as.character(e), env)) {
ans = e
} else {
ans = eval(e, env)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please put comments of an example e in both cases

src/vecseq.c Outdated Show resolved Hide resolved
@ColeMiller1
Copy link
Contributor Author

@jangorecki The byval would be best subset in C for this having discussion. The byval is a list. Are there any good routes for subsetting a list in C?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants