Further optimisation of `.SD` in `j` #735

arunsrinivasan · 2014-07-15T20:46:00Z

In #370 .SD was optimised internally for cases like:

require(data.table)
DT = data.table(id=c(1,1,1,2,2,2), x=1:6, y=7:12, z=13:18)
DT[, c(sum(x), lapply(.SD, mean)), by=id]
#    id V1 x  y  z
#1:  1  6 2  8 14
#2:  2 15 5 11 17

You can see that it's optimised by turning verbose on:

options(datatable.verbose=TRUE)
DT[, c(sum(x), lapply(.SD, mean)), by=id]
# Finding groups (bysameorder=FALSE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
# lapply optimization changed j from 'c(sum(x), lapply(.SD, mean))' to 'list(sum(x), mean(x), mean(y), mean(z))'
# GForce optimized j to 'list(gsum(x), gmean(x), gmean(y), gmean(z))'
options(datatable.verbose=FALSE)

However, this expression is not always optimised. For example,

options(datatable.verbose=TRUE)
DT[, c(.SD[1], lapply(.SD, mean)), by=id]
options(datatable.verbose=FALSE)
#    id x  y  z x  y  z
#1:  1 1  7 13 2  8 14
#2:  2 4 10 16 5 11 17

# Finding groups (bysameorder=FALSE) ... done in 0.001secs. bysameorder=TRUE and o__ is length 0
# lapply optimization is on, j unchanged as 'c(.SD[1], lapply(.SD, mean))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# ...

This is because .SD cases are a little trickier to optimise. To begin with, if .SD has j as well, then it can't be optimised:

DT[, c(xx=.SD[1, x], lapply(.SD, mean)), by=id]
#    id xx x  y  z
#1:  1  1 2  8 14
#2:  2  4 5 11 17

The above expression can not be changed to list(..) (in my understanding).

And even when there's no j, .SD can have i arguments of type integer, numeric, logical, expressions and even data.tables. For example:

DT[, c(.SD[x > 1 & y > 9][1], lapply(.SD, mean)), by=id]
#    id  x  y  z x  y  z
#1:  1 NA NA NA 2  8 14
#2:  2  4 10 16 5 11 17

If we optimise this as such, it'd turn to:

DT[, list(x=x[x>1 & y > 9][1], y=y[x>1 & y>9][1], z=z[x>1 & y>9][1], x=mean(x), y=mean(y), z=mean(z)), by=id]
#    id  x  y  z x  y  z
#1:  1 NA NA NA 2  8 14
#2:  2  4 10 16 5 11 17

which is not really efficient as it evaulates the expression (vector scan) as many times as there are columns, which would be quite slow when there are more and more columns. A better way to do it would be:

DT[, {tmp = x > 1 & y > 9; list(x=x[tmp][1], y=y[tmp][1], z=z[tmp][1], x=mean(x), y=mean(y), z=mean(z))}, by=id]
#    id  x  y  z x  y  z
#1:  1 NA NA NA 2  8 14
#2:  2  4 10 16 5 11 17

which is a little tricky to implement.

If it's a join on i, then it must not be optimised as well, etc..

Basically, .SD and .SD[...] should be optimised one-by-one, optimising for each scenario:

Optimise (for possible cases):

All of these throws error at the moment:

DT[, c(data.table(.), lapply(.SD, ...)), by=.]
DT[, c(as.data.table(.), lapply(.SD, ...)), by=.]
DT[, c(data.frame(.), lapply(.SD, ...)), by=.]
DT[, c(as.data.frame(.), lapply(.SD, ...)), by=.]

Note that all these can occur on the right side of lapply(.SD, ...) as well.

The text was updated successfully, but these errors were encountered:

.SD[1], .SD[1L], head(.SD, 1) in `j` alone or along with c(..) are now optimised for speed internally.

arunsrinivasan · 2014-10-09T12:40:47Z

Fixed #861.

arunsrinivasan · 2014-11-16T00:07:19Z

Refer to #952 for example from @mgahan where .SD optimisation using .I is faster.

eantonya · 2015-05-04T20:14:33Z

Some .SD[i, j] expressions can also be optimized (not sure how worth they are though). E.g. I think this works:

d[a, .SD[i, j], b] is equivalent to d[d[a, .I[i], b]$V1, j, b]

franknarf1 · 2017-01-25T15:40:22Z

A further idea: .SD[, ..cols] could be treated in the same way as .SD for purposes of applying GForce..?

I ran into this on SO:

library(data.table)
set.seed(1)
DT <- data.table(C1=c("a","b","b"),
                 C2=round(rnorm(4),4),
                 C3=1:12,
                 C4=9:12)

sum_cols <- c("C2","C3")
mean_cols <- c("C3","C4")

# this gets optimized:
DT[, c(
  .N, 
  sum = lapply(.SD, sum)
), by=C1, .SDcols=sum_cols, verbose = TRUE]

# but this does not:
DT[, c(
  .N, 
  sum = lapply(.SD[, ..sum_cols], sum), 
  mean = lapply(.SD[, ..mean_cols], mean)
), by=C1, verbose = TRUE]

Hm, just noticed that the "lapply optimization" strips my sum = prefixes for the output columns in the first case above. It would be nice to have those prefixes put back in after-the-fact. Not sure if that's a worthwhile feature request or not...

arunsrinivasan added a commit that referenced this issue Aug 5, 2014

DT[, c(.SD, …), by=.] optimised, #735 task list 1,2.

f7f2cf3

arunsrinivasan added a commit that referenced this issue Aug 5, 2014

More .SD optimisations #735 task list 3, 4 and 5

b3a29b9

.SD[1], .SD[1L], head(.SD, 1) in `j` alone or along with c(..) are now optimised for speed internally.

arunsrinivasan mentioned this issue Aug 29, 2014

add a 'having' parameter to [.data.table #788

Open

arunsrinivasan added a commit that referenced this issue Oct 9, 2014

Fixed bug in mean -> fastmean during #735. Added test.

f1cad52

arunsrinivasan mentioned this issue Nov 16, 2014

[Question] Speed of .SD[1] #952

Closed

arunsrinivasan mentioned this issue Oct 20, 2015

subsetting n rows by .SD is O(e^n) #1400

Closed

arunsrinivasan added the performance label Oct 20, 2015

jangorecki mentioned this issue Nov 28, 2015

[R-Forge #2330] Optimize .SD[i] query to keep the elegance but make it faster unchanged. #613

Closed

arunsrinivasan mentioned this issue Mar 18, 2016

Bug report - RSession Hangs #1470

Closed

franknarf1 mentioned this issue Feb 16, 2017

gfirst(.SD) throws an error about not using head(.SD, n), but the latter works #2030

Closed

franknarf1 mentioned this issue Dec 27, 2017

Siginificant performance difference between tail(.SD,1) and .SD[.N] #2538

Closed

MichaelChirico added the GForce issues relating to optimized grouping calculations (GForce) label Feb 25, 2019

franknarf1 mentioned this issue Jun 25, 2019

translation vignette feedback tidyverse/dtplyr#73

Closed

smingerson mentioned this issue Aug 31, 2020

New symbol .D to refer to x in i #4685

Open

ColeMiller1 mentioned this issue Nov 17, 2020

Slow .SD[.N] compare to last(.SD) with groupby #4809

Open

This was referenced Jan 31, 2021

Improve translation of grouped filter() tidyverse/dtplyr#176

Closed

.SD[i] could be optimized better and more generally #4886

Open

ben-schwen mentioned this issue Oct 9, 2021

gshift as gforce optimized shift #5205

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further optimisation of `.SD` in `j` #735

Further optimisation of `.SD` in `j` #735

arunsrinivasan commented Jul 15, 2014 •

edited by ben-schwen

Loading

arunsrinivasan commented Oct 9, 2014

arunsrinivasan commented Nov 16, 2014

eantonya commented May 4, 2015

franknarf1 commented Jan 25, 2017 •

edited

Loading

Further optimisation of .SD in j #735

Further optimisation of .SD in j #735

Comments

arunsrinivasan commented Jul 15, 2014 • edited by ben-schwen Loading

All of these throws error at the moment:

arunsrinivasan commented Oct 9, 2014

arunsrinivasan commented Nov 16, 2014

eantonya commented May 4, 2015

franknarf1 commented Jan 25, 2017 • edited Loading

Further optimisation of `.SD` in `j` #735

Further optimisation of `.SD` in `j` #735

arunsrinivasan commented Jul 15, 2014 •

edited by ben-schwen

Loading

franknarf1 commented Jan 25, 2017 •

edited

Loading