Speedup DT[,.N,by=]. Currently evals j per group. #1251

mattdowle · 2015-08-07T00:36:24Z

require(data.table)
DT = data.table(a=1:1e8, b=1:2)
DT[,.N,by=a,verbose=TRUE]
# Detected that j uses these columns: <none> 
# Finding groups (bysameorder=FALSE) ... done in 0.757secs. bysameorder=TRUE and o__ is length 0
# Optimization is on but left j unchanged (single plain symbol): '.N'
# Starting dogroups ... 
#   memcpy contiguous groups took 12.143s for 100000000 groups
#   eval(j) took 13.800s for 100000000 calls
# done dogroups in 55.253 secs
#                a N
# 1e+00:         1 1
# 2e+00:         2 1
# 3e+00:         3 1
# 4e+00:         4 1
# 5e+00:         5 1
#    ---            
# 1e+08:  99999996 1
# 1e+08:  99999997 1
# 1e+08:  99999998 1
# 1e+08:  99999999 1
# 1e+08: 100000000 1

arunsrinivasan · 2015-08-11T15:26:32Z

Note that it's not optimised only when j is .N. When combined with other functions which are GForce'able, it works fine (I remember implementing it). I'd missed this case...

require(data.table)
dt = data.table(x=rep(1:3, each=2), y=1:6)
options(datatable.verbose=TRUE)
dt[, .(.N, mean(y)), by=x]
# Detected that j uses these columns: y 
# Finding groups (bysameorder=FALSE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
# lapply optimization is on, j unchanged as 'list(.N, mean(y))'
# GForce optimized j to 'list(.N, gmean(y))'

arunsrinivasan · 2015-09-26T13:49:04Z

Now I get:

require(data.table)
DT = data.table(a=1:1e8, b=1:2)
options(datatable.optimize=1L) # no GForce
system.time(DT[, .(.N), by=a])
#    user  system elapsed
#  25.598   0.801  26.882
system.time(DT[, .N, by=a])
#    user  system elapsed
#  15.395   1.056  16.832

options(datatable.optimize=Inf) # yes GForce
system.time(DT[, .(.N), by=a])
#    user  system elapsed
#   1.620   0.675   2.306
system.time(DT[, .N, by=a])
#    user  system elapsed
#   1.583   0.673   2.259

mattdowle added this to the v1.9.8 milestone Aug 7, 2015

arunsrinivasan added the enhancement label Aug 11, 2015

arunsrinivasan self-assigned this Sep 26, 2015

arunsrinivasan closed this as completed in cd756e2 Sep 26, 2015

arunsrinivasan added a commit that referenced this issue Sep 26, 2015

Moved #1251 to Features in README.

a29390f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup DT[,.N,by=]. Currently evals j per group. #1251

Speedup DT[,.N,by=]. Currently evals j per group. #1251

mattdowle commented Aug 7, 2015

arunsrinivasan commented Aug 11, 2015

arunsrinivasan commented Sep 26, 2015

Speedup DT[,.N,by=]. Currently evals j per group. #1251

Speedup DT[,.N,by=]. Currently evals j per group. #1251

Comments

mattdowle commented Aug 7, 2015

arunsrinivasan commented Aug 11, 2015

arunsrinivasan commented Sep 26, 2015