[R-Forge #5754] GForce functions and row- + col-wise operations on .SD #523

arunsrinivasan · 2014-06-08T13:15:25Z

arunsrinivasan · 2014-06-18T20:22:42Z

Benchmarks for gmin and gmax on data just big enough to highlight the difference.

Data:

require(data.table)
set.seed(2L)
k = 1e4
n = 1e6
is_na = TRUE
dt <- setDT(lapply(1:100, function(x) sample(c(1:k, if(is_na) NA_integer_), n, TRUE)))

min, no na.rm

# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, min), by=V1])
#  user  system elapsed 
#  0.533   0.012   0.547 

# without
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, min), by=V1])
#   user  system elapsed 
#  4.698   0.025   4.761 

identical(ans1, ans2) # [1] TRUE

min, with na.rm

# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, min, na.rm=TRUE), by=V1])
#   user  system elapsed 
#  0.481   0.016   0.568 

# without
options(datatable.optimize=1L) 
system.time(ans2 <- dt[, lapply(.SD, function(x) min(x, na.rm=TRUE)), by=V1])
#   user  system elapsed 
#  5.623   0.023   5.791 

identical(ans1, ans2) # [1] TRUE

max, no na.rm

# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, max), by=V1])
#   user  system elapsed 
#  0.536   0.014   0.585 

# without 
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, max), by=V1])
#   user  system elapsed 
#  5.069   0.029   5.351 

identical(ans1, ans2) # [1] TRUE

max, with na.rm

# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, max, na.rm=TRUE), by=V1])
#   user  system elapsed 
#  0.517   0.011   0.546 

# without
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, function(x) max(x, na.rm=TRUE)), by=V1])
#   user  system elapsed 
#   5.862   0.025   6.064 
identical(ans1, ans2) # [1] TRUE

And here's a comparison putting everything together:

options(datatable.optimize=2L)
system.time(ans1 <- dt[, c(lapply(.SD, sum), lapply(.SD, mean), 
                             lapply(.SD, min), lapply(.SD, max), .N), by=V1])
#   user  system elapsed 
#  2.463   0.018   2.575 

options(datatable.optimize=1L)
system.time(ans2 <- dt[, c(lapply(.SD, sum), lapply(.SD, mean), 
                    lapply(.SD, min), lapply(.SD, max), .N), by=V1])
 #  user  system elapsed 
# 11.840   0.034  11.987 

identical(ans1, ans2) # [1] TRUE

matthieugomez · 2014-11-12T04:10:02Z

Ideally, quantile, cov & corr would be great.

arunsrinivasan · 2015-01-07T19:19:18Z

lead/lag implemented as shift(). See #965.

arunsrinivasan · 2015-10-30T17:33:47Z

gmedian always returns numeric type, so that we don't have to wrap with the annoying as.numeric() and is very fast. Using the same data as above:

without `na.rm = TRUE`

system.time(ans1 <- dt[, lapply(.SD, median), by=V1])
#    user  system elapsed 
#   1.562   0.007   1.574 
system.time(ans2 <- dt[, lapply(.SD, function(x) as.numeric(median(x))), by=V1])
#    user  system elapsed 
#  23.013   0.336  23.638 
identical(ans1, ans2)
# [1] TRUE

with `na.rm = TRUE`

system.time(ans1 <- dt[, lapply(.SD, median, na.rm=TRUE), by=V1])
#    user  system elapsed 
#   1.739   0.014   1.787 
system.time(ans2 <- dt[, lapply(.SD, function(x) as.numeric(median(x, na.rm=TRUE))), by=V1])
#    user  system elapsed 
#   24.201   0.749  25.217 
identical(ans1, ans2)
# [1] TRUE

arunsrinivasan · 2015-11-08T14:09:44Z

Benchmarks for head and tail:

options(datatable.optimize=Inf)
system.time(ans1 <- dt[, head(.SD, 1), by=V1]) # gforce optimised
# 0.03 seconds

options(datatable.optimize=1)
system.time(ans2 <- dt[, head(.SD, 1), by=V1]) # level-1 optimisation
# 10 seconds

options(datatable.optimize=0)
system.time(ans3 <- dt[, head(.SD, 1), by=V1]) # no optimisation
# 45 seconds

# restore optimisation
options(datatable.optimize=Inf)

works with subsets in i as well.

arunsrinivasan · 2015-11-08T14:50:31Z

Benchmark for [

options(datatable.optimize=Inf)
system.time(ans1 <- dt[, .SD[2], by=V1]) # gforce optimised
# 0.03 seconds

options(datatable.optimize=1L)
system.time(ans2 <- dt[, .SD[2], by=V1]) # level-1 optimisation
# 1.75 seconds

options(datatable.optimize=0L)
system.time(ans3 <- dt[, .SD[2], by=V1]) # no optimisation
# 41 seconds

# restore optimisation
options(datatable.optimize=Inf)

works with subsets in i as well.

jangorecki · 2015-12-08T16:21:59Z

Any plans for optimization of head(.SD, 2)? or .SD[1:2].
IMO there could be tons of cases to make optimization, so it may be better to deal with data.table modularity extension, so any future optimization can be cleaner and easier to contribute.

arunsrinivasan · 2016-02-04T22:29:01Z

var, sd and prod are now GForce optimised as well.

var

# with GForce (default)
system.time(ans1 <- dt[, lapply(.SD, var, na.rm=TRUE), by=V1])
#    user  system elapsed 
#   1.273   0.010   1.294 

# without
system.time(ans2 <- dt[, lapply(.SD, stats::var, na.rm=TRUE), by=V1])
#    user  system elapsed 
#  27.106   0.369  27.635 

all.equal(ans1, ans2) # [1] TRUE

sd

# with GForce (default)
system.time(ans1 <- dt[, lapply(.SD, sd, na.rm=TRUE), by=V1])
#    user  system elapsed 
#   1.227   0.007   1.242 

# without
system.time(ans2 <- dt[, lapply(.SD, stats::sd, na.rm=TRUE), by=V1])
#    user  system elapsed 
#  28.428   0.406  29.172 

all.equal(ans1, ans2) # [1] TRUE

…523.

MichaelChirico · 2017-01-13T22:30:02Z

Could is.na perhaps be added to the list? I've been seeing a few examples of this recently.

franknarf1 · 2017-06-21T16:29:03Z

This may be a bad fit for GForce, but it would be nice to have an optimized version of :, perhaps. It's common that people want to use that with single-row groups, like: https://stackoverflow.com/a/44664086

franknarf1 · 2018-02-20T16:45:36Z

How about any and all? Example from SO:

library(data.table)
household <-  c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3)
trip      <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9)
brand     <- c(1,2,3,4,5,6,7,5,1,6,8,9,9,2,8,1,3,4,5,6,7,8,9,1,1,2,3,4,1,5,6,7,1,8,9,2)
DT <- data.table(household,trip,brand)

DT[, loyal_brand := 
  .SD[.(household = household, trip = trip - 1L, brand = brand), on=.(household, trip, brand), .N, by=.EACHI]$N > 0L
]

DT[, .(loyal = any(loyal_brand)), by=.(household, trip)]

Seems any is analogous to max; and all to min.

kdkavanagh · 2020-09-29T02:41:55Z

Any idea if seq would be possible to convert to a GForce function? Common usecase would be:

df[,list(positionInGroup = 1:.N), by=list(grp)]

franknarf1 · 2020-09-29T07:16:38Z

@kdkavanagh Fyi, for that you can create a column with df[, v := rowid(grp)]

arunsrinivasan changed the title ~~[R-Forge #5754] Implement functions for row-wise and col-wise operations on .SD~~ [R-Forge #5754] GForce functions and row- + col-wise operations on .SD Jun 8, 2014

arunsrinivasan mentioned this issue Jun 15, 2014

Homepage #695

Closed

17 tasks

arunsrinivasan added a commit that referenced this issue Jun 18, 2014

gmin and gmax done. Partially address #5754 (git #523)

0af4511

arunsrinivasan added this to the v1.9.6 milestone Sep 24, 2014

arunsrinivasan added the Medium label Sep 25, 2014

arunsrinivasan modified the milestones: v1.9.6, v1.9.8 Oct 10, 2014

arunsrinivasan added a commit that referenced this issue Jan 26, 2015

GForce min/max for characters, #523.

a18d624

arunsrinivasan added a commit that referenced this issue Oct 30, 2015

gforce now optimises 'median' as well, #523.

d2f7d63

arunsrinivasan added a commit that referenced this issue Oct 30, 2015

Minor: s/max/median in error messages, #523.

a6950a2

arunsrinivasan added a commit that referenced this issue Nov 8, 2015

head(.SD, 1) and tail(.SD,1) are gforce optimised, #523.

e615532

arunsrinivasan added a commit that referenced this issue Nov 8, 2015

.SD[val] and col[val] optimised with GForce, #523.

751baff

arunsrinivasan added a commit that referenced this issue Feb 4, 2016

var, sd and prod functions are all GForce optimised for speed/memory, #…

adc139f

…523.

arunsrinivasan modified the milestones: v2.0.0, v1.9.8 Mar 20, 2016

mattdowle removed this from the Candidate milestone May 10, 2018

AndreMikulec mentioned this issue Jan 10, 2019

period.apply on xts object takes about 50x longer than running it on it's core data and recasting. joshuaulrich/xts#278

Open

MichaelChirico added the GForce issues relating to optimized grouping calculations (GForce) label Feb 25, 2019

brodieG mentioned this issue Jun 10, 2019

GForce should be able to work with := as well. #1414

Closed

kdkavanagh mentioned this issue Oct 30, 2019

Column subset GForce optimization not applied #4014

Closed

jangorecki removed the Medium label Apr 2, 2020

myoung3 mentioned this issue Jul 1, 2021

GForce optimization for head/tail with arbitrary n #5060

Closed

ben-schwen mentioned this issue Oct 9, 2021

gshift as gforce optimized shift #5205

Merged

7 tasks

ben-schwen mentioned this issue Feb 17, 2022

Fast 'groups' of individual rows #1004

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-Forge #5754] GForce functions and row- + col-wise operations on .SD #523

[R-Forge #5754] GForce functions and row- + col-wise operations on .SD #523

arunsrinivasan commented Jun 8, 2014

arunsrinivasan commented Jun 18, 2014

matthieugomez commented Nov 12, 2014

arunsrinivasan commented Jan 7, 2015

arunsrinivasan commented Oct 30, 2015

arunsrinivasan commented Nov 8, 2015

arunsrinivasan commented Nov 8, 2015

jangorecki commented Dec 8, 2015

arunsrinivasan commented Feb 4, 2016

MichaelChirico commented Jan 13, 2017

franknarf1 commented Jun 21, 2017

franknarf1 commented Feb 20, 2018

kdkavanagh commented Sep 29, 2020

franknarf1 commented Sep 29, 2020

[R-Forge #5754] GForce functions and row- + col-wise operations on .SD #523

[R-Forge #5754] GForce functions and row- + col-wise operations on .SD #523

Comments

arunsrinivasan commented Jun 8, 2014

For GForce

Utility function

arunsrinivasan commented Jun 18, 2014

Data:

min, no na.rm

min, with na.rm

max, no na.rm

max, with na.rm

matthieugomez commented Nov 12, 2014

arunsrinivasan commented Jan 7, 2015

arunsrinivasan commented Oct 30, 2015

without na.rm = TRUE

with na.rm = TRUE

arunsrinivasan commented Nov 8, 2015

arunsrinivasan commented Nov 8, 2015

jangorecki commented Dec 8, 2015

arunsrinivasan commented Feb 4, 2016

var

sd

MichaelChirico commented Jan 13, 2017

franknarf1 commented Jun 21, 2017

franknarf1 commented Feb 20, 2018

kdkavanagh commented Sep 29, 2020

franknarf1 commented Sep 29, 2020

without `na.rm = TRUE`

with `na.rm = TRUE`