[Question] Speed of .SD[1] #952

mgahan · 2014-11-14T16:01:45Z

One feature of data.table that I find myself using quite often is the .SD[1], or .SD[.N]
functionality. It can be quite useful (especially in tandem with the .SDcols paramter). However, I notice that it is one of the slower operations I use in data.table. Here is an example:

Build some fake data

require(data.table) #data.table_1.9.5
set.seed(1)
data <- matrix(rnorm(50000000),ncol=5)
data <- as.data.table(data)

Create ID

data[ , ID := sample(1:2000000, nrow(data), replace=T)]

Slow way

slow.way <- data[ ,.SD[1], by=ID]

Faster way

data[ , Indx := 1:.N, by=ID]
fast.way <- data[Indx==1]

Timings

- Method | Time1 | Time2 | Time3
- slow.way | 12.04091 | 11.38218 | 10.96066
- fast.way | 5.65221 | 5.76312 | 6.38286

I much prefer method number 1 (the slow way) for documentation purposes, but it can really slow down my workflow, so I usually use method number 2. I understand this might be unavoidable since .SD[1] offers more flexibility than filtering on the first row. Does anyone else run into this situation?

sessionInfo()
#R version 3.0.1 (2013-05-16)
#Platform: x86_64-redhat-linux-gnu (64-bit)
#
# locale:
# [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       LC_COLLATE=en_US.UTF-8
# [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=C                 LC_NAME=C              
# [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C     
#
#attached base packages:
#[1] stats     graphics  grDevices utils     datasets  methods   base 
#
#other attached packages:
#[1] data.table_1.9.5
#
#loaded via a namespace (and not attached):
#[1] chron_2.3-44 tools_3.0.1

The text was updated successfully, but these errors were encountered:

arunsrinivasan · 2014-11-14T17:24:51Z

Hi, did you read this?

arunsrinivasan · 2014-11-14T18:50:31Z

Why delete? I was pointing more or less to "What should the report contain?" here - familiarise yourself with writing code blocks using github flavoured markdown. Comment the lines that should be commented etc..

arunsrinivasan · 2014-11-16T00:05:11Z

Thanks @mgahan. There have been some internal optimisations of .SD. But seems like those can be further improved by using .I. Have linked to #735. Will close this one (as a close duplicate).

mgahan changed the title ~~Speed of .SD[1]~~ [Question] Speed of .SD[1] Nov 14, 2014

arunsrinivasan closed this as completed Nov 16, 2014

arunsrinivasan added the feature request label Nov 16, 2014

arunsrinivasan mentioned this issue Nov 16, 2014

Further optimisation of .SD in j #735

Open

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Speed of .SD[1] #952

[Question] Speed of .SD[1] #952

mgahan commented Nov 14, 2014

arunsrinivasan commented Nov 14, 2014

arunsrinivasan commented Nov 14, 2014

arunsrinivasan commented Nov 16, 2014

[Question] Speed of .SD[1] #952

[Question] Speed of .SD[1] #952

Comments

mgahan commented Nov 14, 2014

Build some fake data

Create ID

Slow way

Faster way

Timings

arunsrinivasan commented Nov 14, 2014

arunsrinivasan commented Nov 14, 2014

arunsrinivasan commented Nov 16, 2014