Windows 10 is faster with -fno-openmp than setDTthreads(1) on many repeated calls #4527

ColeMiller1 · 2020-06-05T03:10:10Z

OpenMP support may cause issues with Windows performance, especially with many repeated calls.

library(data.table)
setDTthreads(1L)
allIterations <- data.table(v1 = runif(1e5), v2 = runif(1e5))
DoSomething <- function(row) someCalculation <- row[["v1"]] + 1

system.time(for (r in 1:nrow(allIterations)) DoSomething(allIterations[r, ]))
##1.12.8 
##   user  system elapsed 
##  17.09    4.48   22.14 

##master
##   user  system elapsed 
##  17.50    4.53   22.51 

##master w/ OpenMP lines deleted from subset.c
##   user  system elapsed 
##  12.37    0.01   12.85

Note, I would expect using setDTthreads(1L) would minimize any impacts to performance but that does not appear to be the case on Windows 10.

The text was updated successfully, but these errors were encountered:

ColeMiller1 · 2020-06-13T14:10:09Z

Still working on this - tried the throttle threads PR but it still did not work.

What is an example where OpenMP should shine? Using 1.12.8 and x = sample(1e7L), order(x) is still 3x faster on 1 thread than forder(x). Note, using multiple threads helps but it still does not catch up to order(x).

library(data.table)

x = sample(1e7L)

setDTthreads(1L)

 bench::mark(data.table:::forder(x),
 order(x),
 min_iterations = 10L)

### A tibble: 2 x 13
##  expression               min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
 ## <bch:expr>             <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>
##1 data.table:::forder(x) 923ms  929ms      1.07    38.1MB    0.459     7     3
##2 order(x)               354ms  355ms      2.79    38.1MB    1.19      7     3

setDTthreads(2L)

##  expression               min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
##  <bch:expr>             <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>
##1 data.table:::forder(x) 614ms  657ms      1.50    38.1MB    0.376     8     2
##2 order(x)               359ms  368ms      2.73    38.1MB    1.17      7     3

ColeMiller1 · 2020-06-13T17:37:31Z

And I just tried replacing $(SHLIB_OPENMP_CFLAGS) / @openmp_cflags@ with -fno-openmp. This worked and there are no performance penalties in subsetDT.

jangorecki · 2020-06-13T22:40:09Z

Could you please try this? Makevars as -O3 -mtune=native so openmp will work.
Should not require more than 1.2 GB of mem.

remotes::install_gitlab("jangorecki/data.table@omp-overhead")
library(data.table)
setDTthreads(0L)
getDTthreads()
fcopy = data.table:::fcopy
x = sample.int(1e8L)
system.time(a<-fcopy(x, 1L)) ## schedule(static) num_threads(1)
system.time(a<-fcopy(x, 2L)) ## if (1<0) schedule(static) num_threads(getDTthreads())
system.time(a<-fcopy(x, 3L)) ## schedule(dynamic) num_threads(1)
system.time(a<-fcopy(x, 4L)) ## if (1<0) schedule(dynamic) num_threads(getDTthreads())

timings on linux

> getDTthreads()
[1] 40
> system.time(a<-fcopy(x, 1L))                                                                                                                       
fcopy: integer length 100000000 took 0.200s
   user  system elapsed 
  0.068   0.132   0.201 
> system.time(a<-fcopy(x, 2L))                                                                                                                       
fcopy: integer length 100000000 took 0.210s
   user  system elapsed 
  0.131   0.095   0.226 
> system.time(a<-fcopy(x, 3L))                                                                                                                       
fcopy: integer length 100000000 took 1.659s
   user  system elapsed 
  1.480   0.180   1.659 
> system.time(a<-fcopy(x, 4L))                                                                                                                       
fcopy: integer length 100000000 took 1.722s
   user  system elapsed 
  1.574   0.147   1.722

ColeMiller1 · 2020-06-14T03:47:35Z

I copied your Makevars suggestions - it still compiled with -02 and without the mtune-native. Here is the performance:

> setDTthreads(0L)
> getDTthreads()
[1] 4
> fcopy = data.table:::fcopy
> x = sample.int(1e8L)
> system.time(a<-fcopy(x, 1L)) ## schedule(static) num_threads(1)
fcopy: integer length 100000000 took 0.145s
   user  system elapsed 
   0.11    0.05    0.16 
> system.time(a<-fcopy(x, 2L)) ## if (1<0) schedule(static) num_threads(getDTthreads())
fcopy: integer length 100000000 took 0.151s
   user  system elapsed 
   0.15    0.08    0.24 
> system.time(a<-fcopy(x, 3L)) ## schedule(dynamic) num_threads(1)
fcopy: integer length 100000000 took 3.523s
   user  system elapsed 
   3.42    0.01    3.52 
> system.time(a<-fcopy(x, 4L)) ## if (1<0) schedule(dynamic) num_threads(getDTthreads())
fcopy: integer length 100000000 took 3.546s
   user  system elapsed 
   3.47    0.03    3.53

jangorecki · 2020-06-14T10:32:21Z

and just to double check

system.time(for (i in 1:10) a<-fcopy(x, 1L))
system.time(for (i in 1:10) a<-fcopy(x, 2L))

ColeMiller1 · 2020-06-14T10:36:12Z

> system.time(for (i in 1:10) a<-fcopy(x, 1L))
fcopy: integer length 100000000 took 0.207s
fcopy: integer length 100000000 took 0.256s
fcopy: integer length 100000000 took 0.160s
fcopy: integer length 100000000 took 0.160s
fcopy: integer length 100000000 took 0.146s
fcopy: integer length 100000000 took 0.142s
fcopy: integer length 100000000 took 0.148s
fcopy: integer length 100000000 took 0.141s
fcopy: integer length 100000000 took 0.138s
fcopy: integer length 100000000 took 0.156s
   user  system elapsed 
   1.06    1.03    2.22 
> system.time(for (i in 1:10) a<-fcopy(x, 2L))
fcopy: integer length 100000000 took 0.136s
fcopy: integer length 100000000 took 0.148s
fcopy: integer length 100000000 took 0.144s
fcopy: integer length 100000000 took 0.153s
fcopy: integer length 100000000 took 0.185s
fcopy: integer length 100000000 took 0.145s
fcopy: integer length 100000000 took 0.148s
fcopy: integer length 100000000 took 0.157s
fcopy: integer length 100000000 took 0.146s
fcopy: integer length 100000000 took 0.145s
   user  system elapsed 
   1.14    0.89    2.04

jangorecki · 2020-06-14T10:44:35Z

I would say, that there is no extra overhead on windows caused by the way how we escape openmp. Any degrade of performance must be caused by other factor. You can try running this last two calls having -fno-openmp. If difference will be visible, then I think we should just document it.

On linux -fno-openmp

   user  system elapsed 
  1.240   1.836   3.077 
   user  system elapsed 
  1.140   1.728   2.869

-fopenmp

   user  system elapsed 
  1.049   1.681   2.730 
   user  system elapsed 
  0.797   1.587   2.383

jangorecki · 2020-06-14T11:03:19Z

I just replace @openmp_cflags@ with -fno-openmp in Makevars.in. For windows replace $(SHLIB_OPENMP_CFLAGS) in Makevars.win.
Just checked and putting -fno-openmp in ~/.R/Makevars is ignored, so the above seems to be the most straightforward way.

ColeMiller1 · 2020-06-14T11:58:14Z

It took some time to figure out how to clone your gitlab repo:

On Windows -fno-openmp

   user  system elapsed 
   1.10    0.77    1.90 
   user  system elapsed 
   1.11    0.94    2.13

On Windows -fopenmp

   user  system elapsed 
   0.96    0.87    1.89
   user  system elapsed 
   1.04    0.89    2.07

I agree for this use case that there is no appreciable difference.

Here is a case where there is a difference. The implication is that while no one will be subsetting a row at a time, the ideal API for subsetting by group is DT[, .SD[1:2], grp] and this overhead can really add up as the number of groups increase.

library(data.table)
mat = matrix(0L, nrow =1000L, ncol= 100L)

DT = as.data.table(mat)
DF = as.data.frame(mat)

##-fopenmp
setDTthreads(1L)
system.time(for (i in seq_len(nrow(DT))) {DT[i]})
##   user  system elapsed 
##   0.59    0.38    1.00 
system.time(for (i in seq_len(nrow(DF))) {DF[i,]})
##   user  system elapsed 
##   1.12    0.02    1.19 

##-fno-openmp
system.time(for (i in seq_len(nrow(DT))) {DT[i]})
##   user  system elapsed 
 ##  0.19    0.01    0.22 
system.time(for (i in seq_len(nrow(DF))) {DF[i,]})
##   user  system elapsed 
##   1.01    0.00    1.03

edit: using mat = matrix(0L, nrow = 1e5L, ncol = 10L)

##-fopenmp
setDTthreads(1L)
system.time(for (i in seq_len(nrow(DT))) {DT[i]})
##   user  system elapsed 
##  19.96    4.00   24.56 
system.time(for (i in seq_len(nrow(DF))) {DF[i,]})
##   user  system elapsed 
##  12.09    0.00   12.21 


##-fno-openmp
system.time(for (i in seq_len(nrow(DT))) {DT[i]})
##   user  system elapsed 
##  14.64    0.01   14.77 
system.time(for (i in seq_len(nrow(DF))) {DF[i,]})
 ##  user  system elapsed 
 ## 12.10    0.00   12.26

jangorecki · 2020-06-14T13:18:31Z

I see it does have impact then. This is on linux, having nrow scaled up to 1e5 and reduce cols to 10. So it seems to be Windows specific overhead.

library(data.table)
mat = matrix(0L, nrow =1e5L, ncol= 10L)
DT = as.data.table(mat)
DF = as.data.frame(mat)

##-fopenmp
setDTthreads(1L)
system.time(for (i in seq_len(nrow(DT))) {DT[i]})
#   user  system elapsed 
# 11.864   0.286  12.150 
system.time(for (i in seq_len(nrow(DF))) {DF[i,]})
#   user  system elapsed 
#   6.82    0.00    6.82

##-fno-openmp
system.time(for (i in seq_len(nrow(DT))) {DT[i]})
#   user  system elapsed 
# 12.148   0.008  12.157 
system.time(for (i in seq_len(nrow(DF))) {DF[i,]})
#   user  system elapsed 
#  7.292   0.000   7.293

mattdowle · 2020-06-15T17:46:21Z

Great info @ColeMiller1, many thanks for investigating here. Seems like the new throttle works on Linux but not as well on WIndows. So that's something that needs further work then, agree.

Can I clear up a few sentences on this one.

OpenMP support may cause issues with Windows performance, especially with many repeated calls.

Why the especially? This issue is just about repeated calls, and repeated calls on small data, right?

What is an example where OpenMP should shine?

Any of the examples in https://h2oai.github.io/db-benchmark/. They are single calls on large data that take more than a few seconds, and in some cases minutes, to run.

In generally, [.data.table doesn't like being repeated (it has overhead). That's why we have set() low-overhead alternative and why we use [[ and $ with data.table for ordinary operations. We'd like to one day to get to a state where the user doesn't need to know (for example by reworking [.data.table) but that hasn't happened yet.

And in general data.table doesn't like doing things one row at a time, which is true in R is general, and Python too. Always work to do to make it easier for user but it's just not recommended practice in high level languages like R.

So yet we do want to work on one-row-at-a-time benchmarks like this one, but I wonder how high it should be on the list.

ColeMiller1 · 2020-06-16T04:04:08Z

I agree that single row subsetting itself is not a large concern - the original example is related to #3735. But that's where I found another Windows user had slower subsets than Linux users for many repeated calls.

Why should we care about this? Operations by group can be affected, especially if when we subset by group (e.g. dt[, .SD[1:2], grp]. For Windows users, there can be negative impacts even with setDTthreads(1L).

I hope my choice of words doesn't make us overlook that there are opportunities to make data.table the best tool for both large data and small data regardless of platform. In working towards #852, I have a branch that takes down 14s to 5.3s for matrix(0L, nrow =1e5L, ncol= 10L) without OpenMP. With OpenMP, it goes from 24.5s to 13.6 s. So, 80us at a time can add up with optimizations.

FWIW - this may be centered around where the OpenMP loop is. While it is extremely buggy, putting the pragma omp parallel for ... in the subsetDT instead of the subsetVectorRaw changes the time on this branch to 8.3s. The main difference being that whereas Linux has no overhead, Windows seems to have overhead anytime the process is forked and unforked. As subsetDT is currently written, every column will get a new set of teams to go out and subset the vector.

jangorecki · 2020-06-16T23:58:26Z

Old but related topic: http://forum.openmp.org/forum/viewtopic.php?f=3&t=1722
Also few hits on SO mentions visual studio, so looks to be windows specific.
https://stackoverflow.com/questions/16635814/reasons-for-omp-set-num-threads1-slower-than-no-openmp
https://stackoverflow.com/questions/16807766/may-compiler-optimizations-be-inhibited-by-multi-threading

MichaelChirico · 2020-06-18T02:40:35Z

Just hit a work issue where it was pretty natural to do rowwise [:

The basic idea is I need to collapse some columns into a metadata column, e.g.

NN = 1e5
DT = data.table(ID = 1:NN, V1 = rnorm(NN), V2=rnorm(NN))

I tried:

DT[ , metadata := lapply(seq_len(.N), function(ii) .SD[ii]), .SDcols = c('V1', 'V2')]

And it's indeed rather slow.

transpose essentially works in this case because V1 and V2 are the same type, but generally they could be different types (-> coercion) and include non-transpose-compatible types (e.g. list)

ColeMiller1 · 2020-06-18T11:56:37Z

@MichaelChirico thanks for the example. Using by would help speed this up some (maybe a dedicated nest() feature would be better):

library(data.table) ##1.12.8
setDTthreads(1L)
NN = 1e4
DT = data.table(ID = 1:NN, V1 = rnorm(NN), V2=rnorm(NN))

system.time(DT[ , metadata := lapply(seq_len(.N), function(ii) .SD[ii]), .SDcols = c('V1', 'V2')])
#>    user  system elapsed 
#>    1.86    0.12    1.98
system.time(DT[ , metadata2 := .(.(copy(.SD))), by = ID, .SDcols = c('V1', 'V2')])
#>    user  system elapsed 
#>    0.26    0.02    0.28
all.equal(DT$metadata, DT$metadata2)
#> [1] TRUE

I had to use copy() because otherwise the result has the locked attribute.

MichaelChirico · 2020-06-18T12:10:47Z

The copy() part should be fixed in dev right?

ColeMiller1 · 2020-06-18T12:43:31Z

@MichaelChirico I just downloaded master - does not appear to be fixed. TBH, I'm not aware of an issue related to nesting .SD by group:

DT[, metadata3 := .(.(.SD)), by = ID, .SDcols = c("V1", "V2")]

attributes(DT$metadata3[[1L]])
##$row.names
##[1] 1

##$class
##[1] "data.table" "data.frame"

##$.internal.selfref
##<pointer: 0x05772498>

##$names
##[1] "V1" "V2"

##$.data.table.locked
##[1] TRUE

all.equal(DT$metadata2[[1L]], DT$metadata3[[1L]])
##[1] "Datasets has different number of (non-excluded) attributes: target 2, current 3"

jangorecki · 2020-06-18T18:06:31Z

This can be observed on Linux as well:
this branch removes overhead: https://github.com/Rdatatable/data.table/compare/dogroups-overhead?expand=1
while this branch still has the overhead: https://github.com/Rdatatable/data.table/compare/dogroups-overhead2?expand=1

jangorecki · 2020-06-18T22:02:23Z

@ColeMiller1 could confirm if #4558 resolves this issue?

ColeMiller1 · 2020-06-19T01:11:07Z

@jangorecki sorry for the delay. Yes! (I even like that I do not have to add setDTthreads(1L) at the top!)

library(data.table)
mat = matrix(0L, nrow =1e5L, ncol= 10L)
DT = as.data.table(mat)
DF = as.data.frame(mat)

system.time(for (i in seq_len(nrow(DT))) {DT[i]})
#>    user  system elapsed 
#>   13.42    0.43   14.15
system.time(for (i in seq_len(nrow(DF))) {DF[i,]})
#>    user  system elapsed 
#>   12.66    0.03   12.97

The system.time is still a little higher - it would be great if we could add an argument to subsetDT about whether the indices need to be checked. From [.data.table, we already checked once so there's no need to convert zeros and negative IDs again.

But, that's small overall and could be addressed by a follow-up issue. The subsetting performance was the only place I noticed the OpenMP issue. Please feel free to close this when the PR is merged.

Thanks for all of your help!

ColeMiller1 added performance platform-specific openmp labels Jun 5, 2020

This comment has been minimized.

Sign in to view

jangorecki changed the title ~~OpenMP Performance with Windows 10~~ Windows 10 is faster with -fno-openmp than setDTthreads(1) Jun 14, 2020

jangorecki changed the title ~~Windows 10 is faster with -fno-openmp than setDTthreads(1)~~ Windows 10 is faster with -fno-openmp than setDTthreads(1) on many repeated calls Jun 14, 2020

jangorecki mentioned this issue Jun 17, 2020

throttle threads for iterated small data tasks #4484

Merged

mattdowle added this to the 1.12.11 milestone Jun 18, 2020

jangorecki mentioned this issue Jun 18, 2020

towards #4200, overhead in dogroups #4558

Merged

jangorecki linked a pull request Jun 21, 2020 that will close this issue

towards #4200, overhead in dogroups #4558

Merged

mattdowle closed this as completed in #4558 Jun 26, 2020

jangorecki modified the milestones: 1.12.11, 1.12.9 Jul 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows 10 is faster with -fno-openmp than setDTthreads(1) on many repeated calls #4527

Windows 10 is faster with -fno-openmp than setDTthreads(1) on many repeated calls #4527

ColeMiller1 commented Jun 5, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

ColeMiller1 commented Jun 13, 2020

ColeMiller1 commented Jun 13, 2020

jangorecki commented Jun 13, 2020 •

edited

Loading

ColeMiller1 commented Jun 14, 2020

jangorecki commented Jun 14, 2020

ColeMiller1 commented Jun 14, 2020

jangorecki commented Jun 14, 2020 •

edited

Loading

This comment has been minimized.

jangorecki commented Jun 14, 2020

ColeMiller1 commented Jun 14, 2020 •

edited

Loading

jangorecki commented Jun 14, 2020 •

edited

Loading

mattdowle commented Jun 15, 2020 •

edited

Loading

ColeMiller1 commented Jun 16, 2020

jangorecki commented Jun 16, 2020 •

edited

Loading

MichaelChirico commented Jun 18, 2020

ColeMiller1 commented Jun 18, 2020 •

edited

Loading

MichaelChirico commented Jun 18, 2020

ColeMiller1 commented Jun 18, 2020

jangorecki commented Jun 18, 2020

jangorecki commented Jun 18, 2020

ColeMiller1 commented Jun 19, 2020

Windows 10 is faster with -fno-openmp than setDTthreads(1) on many repeated calls #4527

Windows 10 is faster with -fno-openmp than setDTthreads(1) on many repeated calls #4527

Comments

ColeMiller1 commented Jun 5, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

ColeMiller1 commented Jun 13, 2020

ColeMiller1 commented Jun 13, 2020

jangorecki commented Jun 13, 2020 • edited Loading

ColeMiller1 commented Jun 14, 2020

jangorecki commented Jun 14, 2020

ColeMiller1 commented Jun 14, 2020

jangorecki commented Jun 14, 2020 • edited Loading

This comment has been minimized.

jangorecki commented Jun 14, 2020

ColeMiller1 commented Jun 14, 2020 • edited Loading

jangorecki commented Jun 14, 2020 • edited Loading

mattdowle commented Jun 15, 2020 • edited Loading

ColeMiller1 commented Jun 16, 2020

jangorecki commented Jun 16, 2020 • edited Loading

MichaelChirico commented Jun 18, 2020

ColeMiller1 commented Jun 18, 2020 • edited Loading

MichaelChirico commented Jun 18, 2020

ColeMiller1 commented Jun 18, 2020

jangorecki commented Jun 18, 2020

jangorecki commented Jun 18, 2020

ColeMiller1 commented Jun 19, 2020

jangorecki commented Jun 13, 2020 •

edited

Loading

jangorecki commented Jun 14, 2020 •

edited

Loading

ColeMiller1 commented Jun 14, 2020 •

edited

Loading

jangorecki commented Jun 14, 2020 •

edited

Loading

mattdowle commented Jun 15, 2020 •

edited

Loading

jangorecki commented Jun 16, 2020 •

edited

Loading

ColeMiller1 commented Jun 18, 2020 •

edited

Loading