-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows 10 is faster with -fno-openmp than setDTthreads(1) on many repeated calls #4527
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Still working on this - tried the throttle threads PR but it still did not work. What is an example where OpenMP should shine? Using 1.12.8 and library(data.table)
x = sample(1e7L)
setDTthreads(1L)
bench::mark(data.table:::forder(x),
order(x),
min_iterations = 10L)
### A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
## <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl>
##1 data.table:::forder(x) 923ms 929ms 1.07 38.1MB 0.459 7 3
##2 order(x) 354ms 355ms 2.79 38.1MB 1.19 7 3
setDTthreads(2L)
## expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
## <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl>
##1 data.table:::forder(x) 614ms 657ms 1.50 38.1MB 0.376 8 2
##2 order(x) 359ms 368ms 2.73 38.1MB 1.17 7 3 |
And I just tried replacing |
Could you please try this? Makevars as remotes::install_gitlab("jangorecki/data.table@omp-overhead")
library(data.table)
setDTthreads(0L)
getDTthreads()
fcopy = data.table:::fcopy
x = sample.int(1e8L)
system.time(a<-fcopy(x, 1L)) ## schedule(static) num_threads(1)
system.time(a<-fcopy(x, 2L)) ## if (1<0) schedule(static) num_threads(getDTthreads())
system.time(a<-fcopy(x, 3L)) ## schedule(dynamic) num_threads(1)
system.time(a<-fcopy(x, 4L)) ## if (1<0) schedule(dynamic) num_threads(getDTthreads()) timings on linux
|
I copied your Makevars suggestions - it still compiled with
|
and just to double check system.time(for (i in 1:10) a<-fcopy(x, 1L))
system.time(for (i in 1:10) a<-fcopy(x, 2L)) |
|
I would say, that there is no extra overhead on windows caused by the way how we escape openmp. Any degrade of performance must be caused by other factor. You can try running this last two calls having On linux
|
This comment has been minimized.
This comment has been minimized.
I just replace |
It took some time to figure out how to clone your gitlab repo: On Windows
On Windows
I agree for this use case that there is no appreciable difference. Here is a case where there is a difference. The implication is that while no one will be subsetting a row at a time, the ideal API for subsetting by group is library(data.table)
mat = matrix(0L, nrow =1000L, ncol= 100L)
DT = as.data.table(mat)
DF = as.data.frame(mat)
##-fopenmp
setDTthreads(1L)
system.time(for (i in seq_len(nrow(DT))) {DT[i]})
## user system elapsed
## 0.59 0.38 1.00
system.time(for (i in seq_len(nrow(DF))) {DF[i,]})
## user system elapsed
## 1.12 0.02 1.19
##-fno-openmp
system.time(for (i in seq_len(nrow(DT))) {DT[i]})
## user system elapsed
## 0.19 0.01 0.22
system.time(for (i in seq_len(nrow(DF))) {DF[i,]})
## user system elapsed
## 1.01 0.00 1.03 edit: using ##-fopenmp
setDTthreads(1L)
system.time(for (i in seq_len(nrow(DT))) {DT[i]})
## user system elapsed
## 19.96 4.00 24.56
system.time(for (i in seq_len(nrow(DF))) {DF[i,]})
## user system elapsed
## 12.09 0.00 12.21
##-fno-openmp
system.time(for (i in seq_len(nrow(DT))) {DT[i]})
## user system elapsed
## 14.64 0.01 14.77
system.time(for (i in seq_len(nrow(DF))) {DF[i,]})
## user system elapsed
## 12.10 0.00 12.26 |
I see it does have impact then. This is on linux, having nrow scaled up to 1e5 and reduce cols to 10. So it seems to be Windows specific overhead. library(data.table)
mat = matrix(0L, nrow =1e5L, ncol= 10L)
DT = as.data.table(mat)
DF = as.data.frame(mat) ##-fopenmp
setDTthreads(1L)
system.time(for (i in seq_len(nrow(DT))) {DT[i]})
# user system elapsed
# 11.864 0.286 12.150
system.time(for (i in seq_len(nrow(DF))) {DF[i,]})
# user system elapsed
# 6.82 0.00 6.82 ##-fno-openmp
system.time(for (i in seq_len(nrow(DT))) {DT[i]})
# user system elapsed
# 12.148 0.008 12.157
system.time(for (i in seq_len(nrow(DF))) {DF[i,]})
# user system elapsed
# 7.292 0.000 7.293 |
Great info @ColeMiller1, many thanks for investigating here. Seems like the new throttle works on Linux but not as well on WIndows. So that's something that needs further work then, agree. Can I clear up a few sentences on this one.
Why the especially? This issue is just about repeated calls, and repeated calls on small data, right?
Any of the examples in https://h2oai.github.io/db-benchmark/. They are single calls on large data that take more than a few seconds, and in some cases minutes, to run. In generally, And in general data.table doesn't like doing things one row at a time, which is true in R is general, and Python too. Always work to do to make it easier for user but it's just not recommended practice in high level languages like R. So yet we do want to work on one-row-at-a-time benchmarks like this one, but I wonder how high it should be on the list. |
I agree that single row subsetting itself is not a large concern - the original example is related to #3735. But that's where I found another Windows user had slower subsets than Linux users for many repeated calls. Why should we care about this? Operations by group can be affected, especially if when we subset by group (e.g. I hope my choice of words doesn't make us overlook that there are opportunities to make FWIW - this may be centered around where the OpenMP loop is. While it is extremely buggy, putting the |
Old but related topic: http://forum.openmp.org/forum/viewtopic.php?f=3&t=1722 |
Just hit a work issue where it was pretty natural to do rowwise The basic idea is I need to collapse some columns into a metadata column, e.g.
I tried:
And it's indeed rather slow.
|
@MichaelChirico thanks for the example. Using library(data.table) ##1.12.8
setDTthreads(1L)
NN = 1e4
DT = data.table(ID = 1:NN, V1 = rnorm(NN), V2=rnorm(NN))
system.time(DT[ , metadata := lapply(seq_len(.N), function(ii) .SD[ii]), .SDcols = c('V1', 'V2')])
#> user system elapsed
#> 1.86 0.12 1.98
system.time(DT[ , metadata2 := .(.(copy(.SD))), by = ID, .SDcols = c('V1', 'V2')])
#> user system elapsed
#> 0.26 0.02 0.28
all.equal(DT$metadata, DT$metadata2)
#> [1] TRUE I had to use |
The |
@MichaelChirico I just downloaded master - does not appear to be fixed. TBH, I'm not aware of an issue related to nesting DT[, metadata3 := .(.(.SD)), by = ID, .SDcols = c("V1", "V2")]
attributes(DT$metadata3[[1L]])
##$row.names
##[1] 1
##$class
##[1] "data.table" "data.frame"
##$.internal.selfref
##<pointer: 0x05772498>
##$names
##[1] "V1" "V2"
##$.data.table.locked
##[1] TRUE
all.equal(DT$metadata2[[1L]], DT$metadata3[[1L]])
##[1] "Datasets has different number of (non-excluded) attributes: target 2, current 3" |
This can be observed on Linux as well: |
@ColeMiller1 could confirm if #4558 resolves this issue? |
@jangorecki sorry for the delay. Yes! (I even like that I do not have to add library(data.table)
mat = matrix(0L, nrow =1e5L, ncol= 10L)
DT = as.data.table(mat)
DF = as.data.frame(mat)
system.time(for (i in seq_len(nrow(DT))) {DT[i]})
#> user system elapsed
#> 13.42 0.43 14.15
system.time(for (i in seq_len(nrow(DF))) {DF[i,]})
#> user system elapsed
#> 12.66 0.03 12.97 The But, that's small overall and could be addressed by a follow-up issue. The subsetting performance was the only place I noticed the OpenMP issue. Please feel free to close this when the PR is merged. Thanks for all of your help! |
OpenMP support may cause issues with Windows performance, especially with many repeated calls.
Note, I would expect using
setDTthreads(1L)
would minimize any impacts to performance but that does not appear to be the case on Windows 10.The text was updated successfully, but these errors were encountered: