Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup DT[,.N,by=]. Currently evals j per group. #1251

Closed
mattdowle opened this issue Aug 7, 2015 · 2 comments
Closed

Speedup DT[,.N,by=]. Currently evals j per group. #1251

mattdowle opened this issue Aug 7, 2015 · 2 comments
Assignees
Milestone

Comments

@mattdowle
Copy link
Member

require(data.table)
DT = data.table(a=1:1e8, b=1:2)
DT[,.N,by=a,verbose=TRUE]
# Detected that j uses these columns: <none> 
# Finding groups (bysameorder=FALSE) ... done in 0.757secs. bysameorder=TRUE and o__ is length 0
# Optimization is on but left j unchanged (single plain symbol): '.N'
# Starting dogroups ... 
#   memcpy contiguous groups took 12.143s for 100000000 groups
#   eval(j) took 13.800s for 100000000 calls
# done dogroups in 55.253 secs
#                a N
# 1e+00:         1 1
# 2e+00:         2 1
# 3e+00:         3 1
# 4e+00:         4 1
# 5e+00:         5 1
#    ---            
# 1e+08:  99999996 1
# 1e+08:  99999997 1
# 1e+08:  99999998 1
# 1e+08:  99999999 1
# 1e+08: 100000000 1
@mattdowle mattdowle added this to the v1.9.8 milestone Aug 7, 2015
@arunsrinivasan
Copy link
Member

Note that it's not optimised only when j is .N. When combined with other functions which are GForce'able, it works fine (I remember implementing it). I'd missed this case...

require(data.table)
dt = data.table(x=rep(1:3, each=2), y=1:6)
options(datatable.verbose=TRUE)
dt[, .(.N, mean(y)), by=x]
# Detected that j uses these columns: y 
# Finding groups (bysameorder=FALSE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
# lapply optimization is on, j unchanged as 'list(.N, mean(y))'
# GForce optimized j to 'list(.N, gmean(y))'

@arunsrinivasan arunsrinivasan self-assigned this Sep 26, 2015
@arunsrinivasan
Copy link
Member

Now I get:

require(data.table)
DT = data.table(a=1:1e8, b=1:2)
options(datatable.optimize=1L) # no GForce
system.time(DT[, .(.N), by=a])
#    user  system elapsed
#  25.598   0.801  26.882
system.time(DT[, .N, by=a])
#    user  system elapsed
#  15.395   1.056  16.832

options(datatable.optimize=Inf) # yes GForce
system.time(DT[, .(.N), by=a])
#    user  system elapsed
#   1.620   0.675   2.306
system.time(DT[, .N, by=a])
#    user  system elapsed
#   1.583   0.673   2.259

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants