Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Speed of .SD[1] #952

Closed
mgahan opened this issue Nov 14, 2014 · 3 comments
Closed

[Question] Speed of .SD[1] #952

mgahan opened this issue Nov 14, 2014 · 3 comments

Comments

@mgahan
Copy link

mgahan commented Nov 14, 2014

One feature of data.table that I find myself using quite often is the .SD[1], or .SD[.N]
functionality. It can be quite useful (especially in tandem with the .SDcols paramter). However, I notice that it is one of the slower operations I use in data.table. Here is an example:

Build some fake data

require(data.table) #data.table_1.9.5
set.seed(1)
data <- matrix(rnorm(50000000),ncol=5)
data <- as.data.table(data)

Create ID

data[ , ID := sample(1:2000000, nrow(data), replace=T)]

Slow way

slow.way <- data[ ,.SD[1], by=ID]

Faster way

data[ , Indx := 1:.N, by=ID]
fast.way <- data[Indx==1]

Timings

- Method | Time1 | Time2 | Time3
- slow.way | 12.04091 | 11.38218 | 10.96066
- fast.way | 5.65221 | 5.76312 | 6.38286

I much prefer method number 1 (the slow way) for documentation purposes, but it can really slow down my workflow, so I usually use method number 2. I understand this might be unavoidable since .SD[1] offers more flexibility than filtering on the first row. Does anyone else run into this situation?

sessionInfo()
#R version 3.0.1 (2013-05-16)
#Platform: x86_64-redhat-linux-gnu (64-bit)
#
# locale:
# [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       LC_COLLATE=en_US.UTF-8
# [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=C                 LC_NAME=C              
# [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C     
#
#attached base packages:
#[1] stats     graphics  grDevices utils     datasets  methods   base 
#
#other attached packages:
#[1] data.table_1.9.5
#
#loaded via a namespace (and not attached):
#[1] chron_2.3-44 tools_3.0.1 
@arunsrinivasan
Copy link
Member

Hi, did you read this?

@mgahan mgahan changed the title Speed of .SD[1] [Question] Speed of .SD[1] Nov 14, 2014
@arunsrinivasan
Copy link
Member

Why delete? I was pointing more or less to "What should the report contain?" here - familiarise yourself with writing code blocks using github flavoured markdown. Comment the lines that should be commented etc..

@arunsrinivasan
Copy link
Member

Thanks @mgahan. There have been some internal optimisations of .SD. But seems like those can be further improved by using .I. Have linked to #735. Will close this one (as a close duplicate).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants