Having arg #4412

ColeMiller1 · 2020-05-02T23:11:43Z

Too early to say it addresses #788. Very much WIP. I am hoping for general feedback from @jangorecki before continuing this path.

Per the initial FR, this includes a new having argument that requires each group to return a logical vector of length one. Right now only gForce functions and primitive functions are allowed - I can work on PRs for any, all, which.max, and which.min gforce funs which would be helpful for this.

New `vecseq_having`

The subsetting workhorse is vecseq_having. The new function returns an integer vector with additional attributes if retGrps is true.

starts = c(1L, 3L, 6L)
lens = c(2L, 3L, 1L)
o = c(1L, 4L, 2L, 5L, 6L, 3L)

.Call(data.table:::Cvecseq_having, starts, lens, having = c(FALSE, TRUE, TRUE), retGrpArg = TRUE, o__ = o)
## [1] 2 5 6 3
## attr(,"starts")
## [1] 1 4
## attr(,"grplen")
## [1] 3 1
## attr(,"maxgrpn")
## [1] 3

.Call(data.table:::Cvecseq_having, starts, lens, having = c(FALSE, TRUE, TRUE), retGrpArg = FALSE, o__ = o)
## [1] 2 5 6 3

New recursive parser

Current GForce optimizations go one deep. That is, mean(x) will be optimized while mean(x == 2L) would not be. To account for this, a new function evaluates an expression to determine if it is a gfun, is.primitive, a name and whether it exists inside or outside of the environment. This allows for mean(x ==2L) to be optimized as well as mean(x) > 3 & .N > 5L

Performance

library(data.table)

n = 1e5
grps = 1e4
set.seed(123L)
dt = data.table(x = sample(grps, n, TRUE), y = runif(n))

invisible(dt[, lapply(.SD, sum), by = x]) ##warm-up

setDTthreads(1L)

bench::mark(
  dt[, .SD, by = x, having = .N > 5L],
  dt[dt[, .I[.N > 5L], by = x]$V1],
  dt[, if (.N > 5L) .SD, by = x],
  dt[, .SD[.N > 5L], by = x],
  time_unit = 's'
)
##   expression                              min  median `itr/sec` mem_alloc
##  <bch:expr>                            <dbl>   <dbl>     <dbl> <bch:byt>
## 1 dt[, .SD, by = x, having = .N > 5L] 0.00487 0.00536   177.       2.15MB
## 2 dt[dt[, .I[.N > 5L], by = x]$V1]    0.0146  0.0154     62.4      3.41MB
## 3 dt[, if (.N > 5L) .SD, by = x]      0.302   0.302       3.31     2.89MB
## 4 dt[, .SD[.N > 5L], by = x]          1.89    1.89        0.529   83.91MB

setDTthreads(2L)

##   expression                              min  median `itr/sec` mem_alloc
##   <bch:expr>                            <dbl>   <dbl>     <dbl> <bch:byt>
## 1 dt[, .SD, by = x, having = .N > 5L] 0.00507 0.00560   144.       2.15MB
## 2 dt[dt[, .I[.N > 5L], by = x]$V1]    0.0147  0.0159     45.7      3.25MB
## 3 dt[, if (.N > 5L) .SD, by = x]      0.655   0.655       1.53     2.89MB
## 4 dt[, .SD[.N > 5L], by = x]          2.15    2.15        0.465    83.9MB

bench::mark(
  new_hav = dt[,
               j = .SD,
               by = x,
               having = .N < 2L | sum(y) > 11 | median(y) < 0.7
               ],
  use_I = dt[dt[, .I[.N < 2L | sum(y) > 11 | median(y) < 0.7], by = x]$V1],
  time_unit = 's'
)

##   expression    min median `itr/sec` mem_alloc
##   <bch:expr>  <dbl>  <dbl>     <dbl> <bch:byt>
## 1 new_hav    0.0102 0.0112    81.4      6.97MB
## 2 use_I      1.21   1.21       0.829    3.21MB

To do:

Fix irows subset
Fix by cols for correct order and any ad hoc columns
Allow for more than just .SD in j
Allow more functions to be evaluated
Create method to allow for non-grouping evaluations such as rleid(x) < 5 which would evaluate to a logical vector equal to the number of rows in the data.table.
Documentation
Tests

codecov · 2020-05-02T23:43:29Z

Codecov Report

Merging #4412 into master will decrease coverage by 0.11%.
The diff coverage is 87.59%.

@@            Coverage Diff             @@
##           master    #4412      +/-   ##
==========================================
- Coverage   99.61%   99.49%   -0.12%     
==========================================
  Files          72       72              
  Lines       13917    14047     +130     
==========================================
+ Hits        13863    13976     +113     
- Misses         54       71      +17

Impacted Files	Coverage Δ
src/vecseq.c	`88.65% <84.72%> (-11.35%)`	⬇️
R/data.table.R	`99.73% <91.07%> (-0.27%)`	⬇️
src/init.c	`100.00% <100.00%> (ø)`
src/assign.c	`99.84% <0.00%> (-0.16%)`	⬇️
R/foverlaps.R	`100.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dd7609e...11c1fd8. Read the comment docs.

jangorecki

Looks promising but I would wait for closing API discussions better before implementing

jangorecki · 2020-05-03T06:24:54Z

R/data.table.R

+    if (is.atomic(e) || exists(as.character(e), env)) {
+      ans = e
+    } else {
+      ans = eval(e, env)


please put comments of an example e in both cases

src/vecseq.c

Pretty stable.

ColeMiller1 · 2020-05-20T01:15:39Z

@jangorecki The byval would be best subset in C for this having discussion. The byval is a list. Are there any good routes for subsetting a list in C?

ColeMiller1 added 4 commits May 2, 2020 17:25

Update data.table.R

d8a3230

Add Having vecseq

09f6d1f

Update init.c

024c579

Update data.table.h

4ef8209

ColeMiller1 added the WIP label May 2, 2020

ColeMiller1 added 2 commits May 2, 2020 19:28

Update data.table.Rd

b8ff31e

Update data.table.Rd

64290ee

jangorecki reviewed May 3, 2020

View reviewed changes

ColeMiller1 mentioned this pull request May 3, 2020

forderv grouping order #4418

Closed

ColeMiller1 added 7 commits May 4, 2020 08:26

Update data.table.R

5be2170

Update vecseq.c

9f6266c

More efficient combining of byvals & sdvars

c1df082

Pretty stable.

Skip recalculating f__ and lens__ when all grps TRUE

ca15c61

Initial tests

fd492f2

Allow selection of columns in j with having

d3ef0f2

More tests and updated errors.

bb3c58d

ColeMiller1 added 4 commits May 20, 2020 21:12

allow i subset to be used

3f6f000

include o__ in c; try to make more efficient

8e5f40f

Update tests.Rraw

f74adfd

Update tests.Rraw

11c1fd8

ColeMiller1 mentioned this pull request Feb 2, 2021

.SD[i] could be optimized better and more generally #4886

Open

ColeMiller1 mentioned this pull request Apr 16, 2023

GForce optimisation could be more smart #3815

Open

MichaelChirico removed the WIP label Feb 19, 2024

MichaelChirico marked this pull request as draft February 19, 2024 04:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Having arg #4412

Having arg #4412

ColeMiller1 commented May 2, 2020 •

edited

Loading

codecov bot commented May 2, 2020 •

edited

Loading

jangorecki left a comment

jangorecki May 3, 2020

ColeMiller1 commented May 20, 2020

Having arg #4412

Are you sure you want to change the base?

Having arg #4412

Conversation

ColeMiller1 commented May 2, 2020 • edited Loading

New vecseq_having

New recursive parser

Performance

codecov bot commented May 2, 2020 • edited Loading

Codecov Report

jangorecki left a comment

Choose a reason for hiding this comment

jangorecki May 3, 2020

Choose a reason for hiding this comment

ColeMiller1 commented May 20, 2020

ColeMiller1 commented May 2, 2020 •

edited

Loading

New `vecseq_having`

codecov bot commented May 2, 2020 •

edited

Loading