-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test and confirm new parallel subset performance #3175
Comments
Following script tests subset by integer row ids. It also measures the timing of vim dt-parallel-subset.R args = as.integer(commandArgs(TRUE))
th = args[1L]
N = args[2L]
K = 100L
get_i = function(n.out, n.in) {
n.out = as.integer(n.out)
n.in = as.integer(n.in)
set.seed(n.out)
sample(n.in, n.out)
}
library(data.table)
cat(sprintf("# datagen %s rows\n", N))
set.seed(108)
DT = data.table(
id1 = sample(sprintf("id%03d",1:K), N, TRUE), # large groups (char)
id2 = sample(sprintf("id%03d",1:K), N, TRUE), # large groups (char)
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char)
id4 = sample(K, N, TRUE), # large groups (int)
id5 = sample(K, N, TRUE), # large groups (int)
id6 = sample(N/K, N, TRUE), # small groups (int)
v1 = sample(5, N, TRUE), # int in range [1,5]
v2 = sample(5, N, TRUE), # int in range [1,5]
v3 = sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749
)
cat(sprintf("# setDTthreads(%s)\n", th))
setDTthreads(th)
cat("# 0 row (first `[`` call overhead):\n")
system.time(ans<-DT[0L])
cat("# 1 row:\n")
i = get_i(1L, nrow(DT))
system.time(ans<-DT[i])
cat("# 2 rows:\n")
i = get_i(2L, nrow(DT))
system.time(ans<-DT[i])
cat("# 5 rows:\n")
i = get_i(5L, nrow(DT))
system.time(ans<-DT[i])
cat("# 10% of rows:\n")
i = get_i(nrow(DT)*0.1, nrow(DT))
system.time(ans<-DT[i])
q("no") Rscript dt-parallel-subset.R 1 1e6 timings coming soon |
1th 1e7
20th 1e7
1th 1e8
20th 1e8
1th 1e9
20th 1e9
|
During the timings above I observed that team of threads was started even for 1, 2, 5 rows. Still it did not result in noticeable overhead. All subsets of 1, 2, 5 rows were 0.000-0.001. |
Above checks were using single subset operation. I encounter some noticeable difference when I loop over subset operation. library(data.table)
m = matrix(1L, nrow=1e8, ncol=10)
DT = as.data.table(m)
setDTthreads(20)
system.time(for (i in 1:1000) DT[i,])
# user system elapsed
# 4.210 0.000 0.229
setDTthreads(1)
system.time(for (i in 1:1000) DT[i,])
# user system elapsed
# 0.107 0.007 0.114 @mattdowle does it quality for reopen? |
PR #4484 closes this one. v1.12.8 to confirm Jan's result: > m = matrix(1L, nrow=1e8, ncol=10)
> DT = as.data.table(m)
> setDTthreads(0)
> system.time(for (i in 1:1000) DT[i,])
user system elapsed
1.512 0.000 0.143
> setDTthreads(1)
> system.time(for (i in 1:1000) DT[i,])
user system elapsed
0.083 0.000 0.083 With #4484 : > setDTthreads(0)
> system.time(for (i in 1:1000) DT[i,])
user system elapsed
0.071 0.000 0.071
> setDTthreads(1)
> system.time(for (i in 1:1000) DT[i,])
user system elapsed
0.072 0.000 0.072 |
Matt commented :
data.table/src/subset.c
Lines 27 to 30 in 1847500
The text was updated successfully, but these errors were encountered: