-
Notifications
You must be signed in to change notification settings - Fork 985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Selecting from data.table by row is very slow #3735
Comments
The main reason is not about using row number to select the rows or not. It's because the loop invokes the data.table's function call too many times. data.table is fast due to internal optimization, which comes with a cost. It means the If loop on all the rows is unavoidable, I suggest you to use df <- data.frame(v1 = runif(1e3), v2 = runif(1e3))
cal <- function(row) {
row$v1 + 1
}
res1 <- res2 <- res3 <- res4 <- double(nrow(df))
t <- proc.time()
for (r in 1:nrow(df)) {
res1[r] <- cal(df[r, ])
}
data.table::timetaken(t)
#> [1] "0.110s elapsed (0.090s cpu)"
dt <- data.table::as.data.table(df)
t <- proc.time()
for (r in 1:nrow(dt)) {
res2[r] <- cal(dt[r, ])
}
data.table::timetaken(t)
#> [1] "0.510s elapsed (0.470s cpu)"
t <- proc.time()
res3 <- purrr::pmap_dbl(dt, function(...) {
cal(list(...))
})
data.table::timetaken(t)
#> [1] "0.030s elapsed (0.010s cpu)"
all.equal(res2, res1)
#> [1] TRUE
all.equal(res3, res1)
#> [1] TRUE Created on 2019-07-31 by the reprex package (v0.2.1) |
Confirming what @shrektan wrote. Anyway I think we should be able to speed up such things pretty easily. |
When selecting single row by its integer index it make sense to switch to single threaded mode, so setting library(data.table)
set.seed(108)
n = 1e5
df = data.frame(v1 = runif(n), v2 = runif(n))
dt1 = data.table(v1 = runif(n), v2 = runif(n))
dt2 = data.table(v1 = runif(n), v2 = runif(n))
frow = function(x, irows, safe=FALSE) {
stopifnot(is.data.table(x), is.integer(irows), length(irows)>0L, is.logical(safe), length(safe)==1L, !is.na(safe))
if (safe) stopifnot(all(between(irows, 1L, nrow(x))))
.Call(data.table:::CsubsetDT, x, irows, seq_along(x))
}
do = function(row) row[["v1"]]+1
system.time(for (r in 1:n) do(df[r, ]))
# user system elapsed
# 3.693 0.003 3.697
setDTthreads(4L)
system.time(for (r in 1:n) do(dt1[r, ]))
# user system elapsed
# 73.497 0.299 19.205
system.time(for (r in 1:n) do(frow(dt2, r)))
# user system elapsed
# 21.125 0.128 5.488
system.time(for (r in 1:n) do(frow(dt2, r, safe=TRUE)))
# user system elapsed
# 28.016 0.179 7.294
setDTthreads(1L)
system.time(for (r in 1:n) do(dt1[r, ]))
# user system elapsed
# 12.619 0.128 12.749
system.time(for (r in 1:n) do(frow(dt2, r)))
# user system elapsed
# 3.538 0.040 3.579
system.time(for (r in 1:n) do(frow(dt2, r, safe=TRUE)))
# user system elapsed
# 4.923 0.088 5.012 It could be handled internally transparently but requires a little bit of rewrite |
Some progress towards this issue has been made in #4484, but the overhead of
|
Just promoting the idea - using Also, @chnynf, are you on Windows? Your high system.times reflect my experience on Windows. library(data.table) ##1.12.8
setDTthreads(1L)
df <- data.frame(v1 = runif(1e3), v2 = runif(1e3))
cal <- function(row) row$v1 + 1
res1 <- res2 <- res3 <- res4 <- double(nrow(df))
t <- proc.time()
for (r in 1:nrow(df)) {
res1[r] <- cal(df[r, ])
}
data.table::timetaken(t)
#> [1] "0.050s elapsed (0.030s cpu)"
dt <- data.table::as.data.table(df)
t <- proc.time()
for (r in 1:nrow(dt)) {
res2[r] <- cal(dt[r, ])
}
data.table::timetaken(t)
#> [1] "0.240s elapsed (0.210s cpu)"
t <- proc.time()
res3 <- purrr::pmap_dbl(dt, function(...) {
cal(list(...))
})
data.table::timetaken(t)
#> [1] "0.060s elapsed (0.040s cpu)"
t <- proc.time()
res4 <- dt[, cal(.SD), by = 1:nrow(dt)]$V1
data.table::timetaken(t)
#> [1] "0.010s elapsed (0.000s cpu)"
all.equal(res2, res1)
#> [1] TRUE
all.equal(res3, res1)
#> [1] TRUE
all.equal(res4, res1)
#> [1] TRUE |
Yes, the test was on windows. I tried your approach on my windows machine and it is much faster. Thank you guys for working on this! |
I have a similar problem. In this code: library(data.table)
parameters <- list(types = c(p1 = "r", p2 = "r", p3 = "r", dummy = "c"),
digits = 4)
n <- 10000
newConfigurations <- data.table(p1 = runif(n), p2 = runif(n), p3 = runif(n),
dummy = sample(c("d1", "d2"), n, replace=TRUE))
repair_sum2one <- function(configuration, parameters)
{
isreal <- names(which(parameters$types[colnames(configuration)] == "r"))
digits <- parameters$digits[isreal]
c_real <- unlist(configuration[isreal])
c_real <- c_real / sum(c_real)
c_real[-1] <- round(c_real[-1], digits[-1])
c_real[1] <- 1 - sum(c_real[-1])
configuration[isreal] <- c_real
return(configuration)
}
j <- colnames(newConfigurations)
for (i in seq_len(nrow(newConfigurations)))
set(newConfigurations, i, j = j, value = repair_sum2one(as.data.frame(newConfigurations[i]), parameters)) More than half the time is spent in |
I'm working on a R project that involves applying fairly complicated functions across data.table or data.frame by rows.
In cases where vectorizing is not a good option, one might need to loop through rows, and that's when I realized selecting by row number from a data.table is actually much slower than from a data.frame.
I guess selecting by row number is not a recommended practice for data.table? Or would the team be interested in looking into this and optimize the performance?
I have more details about my test here.
The text was updated successfully, but these errors were encountered: