Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-Forge #5754] GForce functions and row- + col-wise operations on .SD #523

Open
10 of 20 tasks
arunsrinivasan opened this issue Jun 8, 2014 · 13 comments
Open
10 of 20 tasks
Labels
feature request GForce issues relating to optimized grouping calculations (GForce)

Comments

@arunsrinivasan
Copy link
Member

Submitted by: Arun ; Assigned to: Nobody; R-Forge link

For GForce

  • gsum, gmean
  • .N
  • gmin, max
  • median
  • head(.SD, 1), tail(.SD, 1), last(x)
  • [ for length-1 subsets
  • gvar
  • gsd
  • gprod
  • .SD[which.min()], .SD[which.max()]
  • guniqueN
  • gpaste??
  • quantile
  • covariance
  • correlation
  • kurtosis
  • skewness

When GForce is upgraded to work with :=:

  • cumulative functions
  • rolling / window functions

Utility function

  • lead, lag

It should return a list. That is,

x <- 1:5
lag(x, 1:2)
# [[1]]
# [1] NA  1  2  3  4
# 
# [[2]]
# [1] NA NA  1  2  3
@arunsrinivasan arunsrinivasan changed the title [R-Forge #5754] Implement functions for row-wise and col-wise operations on .SD [R-Forge #5754] GForce functions and row- + col-wise operations on .SD Jun 8, 2014
@arunsrinivasan arunsrinivasan mentioned this issue Jun 15, 2014
17 tasks
@arunsrinivasan
Copy link
Member Author

Benchmarks for gmin and gmax on data just big enough to highlight the difference.

Data:

require(data.table)
set.seed(2L)
k = 1e4
n = 1e6
is_na = TRUE
dt <- setDT(lapply(1:100, function(x) sample(c(1:k, if(is_na) NA_integer_), n, TRUE)))

min, no na.rm

# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, min), by=V1])
#  user  system elapsed 
#  0.533   0.012   0.547 

# without
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, min), by=V1])
#   user  system elapsed 
#  4.698   0.025   4.761 

identical(ans1, ans2) # [1] TRUE

min, with na.rm

# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, min, na.rm=TRUE), by=V1])
#   user  system elapsed 
#  0.481   0.016   0.568 

# without
options(datatable.optimize=1L) 
system.time(ans2 <- dt[, lapply(.SD, function(x) min(x, na.rm=TRUE)), by=V1])
#   user  system elapsed 
#  5.623   0.023   5.791 

identical(ans1, ans2) # [1] TRUE

max, no na.rm

# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, max), by=V1])
#   user  system elapsed 
#  0.536   0.014   0.585 

# without 
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, max), by=V1])
#   user  system elapsed 
#  5.069   0.029   5.351 

identical(ans1, ans2) # [1] TRUE

max, with na.rm

# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, max, na.rm=TRUE), by=V1])
#   user  system elapsed 
#  0.517   0.011   0.546 

# without
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, function(x) max(x, na.rm=TRUE)), by=V1])
#   user  system elapsed 
#   5.862   0.025   6.064 
identical(ans1, ans2) # [1] TRUE

And here's a comparison putting everything together:

options(datatable.optimize=2L)
system.time(ans1 <- dt[, c(lapply(.SD, sum), lapply(.SD, mean), 
                             lapply(.SD, min), lapply(.SD, max), .N), by=V1])
#   user  system elapsed 
#  2.463   0.018   2.575 

options(datatable.optimize=1L)
system.time(ans2 <- dt[, c(lapply(.SD, sum), lapply(.SD, mean), 
                    lapply(.SD, min), lapply(.SD, max), .N), by=V1])
 #  user  system elapsed 
# 11.840   0.034  11.987 

identical(ans1, ans2) # [1] TRUE

@arunsrinivasan arunsrinivasan added this to the v1.9.6 milestone Sep 24, 2014
@arunsrinivasan arunsrinivasan modified the milestones: v1.9.6, v1.9.8 Oct 10, 2014
@matthieugomez
Copy link
Contributor

Ideally, quantile, cov & corr would be great.

@arunsrinivasan
Copy link
Member Author

lead/lag implemented as shift(). See #965.

@arunsrinivasan
Copy link
Member Author

gmedian always returns numeric type, so that we don't have to wrap with the annoying as.numeric() and is very fast. Using the same data as above:

without na.rm = TRUE

system.time(ans1 <- dt[, lapply(.SD, median), by=V1])
#    user  system elapsed 
#   1.562   0.007   1.574 
system.time(ans2 <- dt[, lapply(.SD, function(x) as.numeric(median(x))), by=V1])
#    user  system elapsed 
#  23.013   0.336  23.638 
identical(ans1, ans2)
# [1] TRUE

with na.rm = TRUE

system.time(ans1 <- dt[, lapply(.SD, median, na.rm=TRUE), by=V1])
#    user  system elapsed 
#   1.739   0.014   1.787 
system.time(ans2 <- dt[, lapply(.SD, function(x) as.numeric(median(x, na.rm=TRUE))), by=V1])
#    user  system elapsed 
#   24.201   0.749  25.217 
identical(ans1, ans2)
# [1] TRUE

@arunsrinivasan
Copy link
Member Author

Benchmarks for head and tail:

options(datatable.optimize=Inf)
system.time(ans1 <- dt[, head(.SD, 1), by=V1]) # gforce optimised
# 0.03 seconds

options(datatable.optimize=1)
system.time(ans2 <- dt[, head(.SD, 1), by=V1]) # level-1 optimisation
# 10 seconds

options(datatable.optimize=0)
system.time(ans3 <- dt[, head(.SD, 1), by=V1]) # no optimisation
# 45 seconds

# restore optimisation
options(datatable.optimize=Inf)

works with subsets in i as well.

@arunsrinivasan
Copy link
Member Author

Benchmark for [

options(datatable.optimize=Inf)
system.time(ans1 <- dt[, .SD[2], by=V1]) # gforce optimised
# 0.03 seconds

options(datatable.optimize=1L)
system.time(ans2 <- dt[, .SD[2], by=V1]) # level-1 optimisation
# 1.75 seconds

options(datatable.optimize=0L)
system.time(ans3 <- dt[, .SD[2], by=V1]) # no optimisation
# 41 seconds

# restore optimisation
options(datatable.optimize=Inf)

works with subsets in i as well.

@jangorecki
Copy link
Member

Any plans for optimization of head(.SD, 2)? or .SD[1:2].
IMO there could be tons of cases to make optimization, so it may be better to deal with data.table modularity extension, so any future optimization can be cleaner and easier to contribute.

@arunsrinivasan
Copy link
Member Author

var, sd and prod are now GForce optimised as well.

var

# with GForce (default)
system.time(ans1 <- dt[, lapply(.SD, var, na.rm=TRUE), by=V1])
#    user  system elapsed 
#   1.273   0.010   1.294 

# without
system.time(ans2 <- dt[, lapply(.SD, stats::var, na.rm=TRUE), by=V1])
#    user  system elapsed 
#  27.106   0.369  27.635 

all.equal(ans1, ans2) # [1] TRUE

sd

# with GForce (default)
system.time(ans1 <- dt[, lapply(.SD, sd, na.rm=TRUE), by=V1])
#    user  system elapsed 
#   1.227   0.007   1.242 

# without
system.time(ans2 <- dt[, lapply(.SD, stats::sd, na.rm=TRUE), by=V1])
#    user  system elapsed 
#  28.428   0.406  29.172 

all.equal(ans1, ans2) # [1] TRUE

@MichaelChirico
Copy link
Member

Could is.na perhaps be added to the list? I've been seeing a few examples of this recently.

@franknarf1
Copy link
Contributor

This may be a bad fit for GForce, but it would be nice to have an optimized version of :, perhaps. It's common that people want to use that with single-row groups, like: https://stackoverflow.com/a/44664086

@franknarf1
Copy link
Contributor

How about any and all? Example from SO:

library(data.table)
household <-  c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3)
trip      <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9)
brand     <- c(1,2,3,4,5,6,7,5,1,6,8,9,9,2,8,1,3,4,5,6,7,8,9,1,1,2,3,4,1,5,6,7,1,8,9,2)
DT <- data.table(household,trip,brand)

DT[, loyal_brand := 
  .SD[.(household = household, trip = trip - 1L, brand = brand), on=.(household, trip, brand), .N, by=.EACHI]$N > 0L
]

DT[, .(loyal = any(loyal_brand)), by=.(household, trip)]

Seems any is analogous to max; and all to min.

@MichaelChirico MichaelChirico added the GForce issues relating to optimized grouping calculations (GForce) label Feb 25, 2019
@jangorecki jangorecki removed the Medium label Apr 2, 2020
@kdkavanagh
Copy link

Any idea if seq would be possible to convert to a GForce function? Common usecase would be:

df[,list(positionInGroup = 1:.N), by=list(grp)]

@franknarf1
Copy link
Contributor

@kdkavanagh Fyi, for that you can create a column with df[, v := rowid(grp)]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request GForce issues relating to optimized grouping calculations (GForce)
Projects
None yet
Development

No branches or pull requests

7 participants