Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

var %in% vec bmerge is slow #2366

Open
franknarf1 opened this issue Sep 16, 2017 · 4 comments
Open

var %in% vec bmerge is slow #2366

franknarf1 opened this issue Sep 16, 2017 · 4 comments

Comments

@franknarf1
Copy link
Contributor

I was looking at the benchmark in ycw's answer on SO, and ran into ...

library(data.table)
set.seed(1234)
n=1e8
DT <- data.table(x = sample(c("A", "B", "C", NA), n, replace=TRUE))
setindex(DT, x)

microbenchmark(times = 10,
  ycwj = {
    DT[.("A"), on=.(x), nomatch = 0 ]
    DT[.("B"), on=.(x), nomatch = 0 ]
}, onej = {
    DT[.(c("A", "B")), on=.(x), nomatch = 0]
}, ycw = {
    DT[x == "A"]
    DT[x == "B"]
}, one = {
    DT[x %in% c("A", "B")]
})

Unit: milliseconds
 expr       min        lq      mean    median        uq      max neval
 ycwj  714.1907  720.9873  804.6031  725.2130  926.8572 1050.291    10
 onej  719.9388  724.2073  861.0850  823.0981  954.4869 1204.918    10
  ycw  882.3501  887.7069  998.9103  893.8587 1140.7628 1215.155    10
  one 2586.7012 2623.0054 2771.0655 2782.0509 2854.6969 2994.438    10

I would expect one join to be better than two, but actually it's worse (onej vs ycwj).

The bigger issues is that x %in% c("A", "B") is so slow, considering it uses the index for a bmerge just like the others do.

Maybe I'm missing something here..?

@franknarf1 franknarf1 changed the title var %in% vec join is slow var %in% vec bmerge is slow Sep 16, 2017
@MarkusBonsch
Copy link
Contributor

MarkusBonsch commented Oct 31, 2017

I have executed some tests and come to the following conclusion why DT[x %in% c("A", "B")] is so slow:

it has nothing to do with bmerge.
Instead, there is an additional expensive call to fsort to restore original order in data.table.R.

Details:
if you compare the output of
"one" (DT[x %in% c("A", "B")])
to
"two" (DT[.(c("A", "B")), on=.(x), nomatch = 0])
you notice a sligth but important difference: "two" is ordered in x while "one" is not.
The policy for subsets like "one" is that they don't change order. Since calculating the subsetting index with bmerge and an existing index implies changing the order of the data.table, this has to be reverted in case of "one" in line 470 of data.table.R : i = fsort(xo[i], internal=TRUE). This is the culprit as the code below shows:

library(data.table)
library(microbenchmark)
set.seed(1234)
n=1e8
DT <- data.table(x = sample(c("A", "B", "C", NA), n, replace=TRUE))
setindex(DT, x)

## reproduce what's happening on line 470 of data.table.R

xo <- attr(attr(DT, "index"), "__x") 
i <- which(DT$x[xo] %in% c("A", "B")) ## this is the i after bmerge and vecseq

## now comes the sorting:
microbenchmark(sorting = inew <- fsort(xo[i], internal=TRUE), times = 10)

# Unit: seconds
# expr      min       lq     mean   median       uq      max neval
# sorting 2.392226 2.396727 2.498919 2.480043 2.541278 2.710429    10

## Proof that this sorting is the difference:
print(identical(DT[xo[i]], DT[.(c("A", "B")), on=.(x), nomatch = 0]))
print(identical(DT[inew], DT[x %in% c("A", "B")]))

What can be done about it?

  1. Live with it.
  2. Drop the policy of restoring the original order. Will brake existing code in a nasty way.
  3. Implement faster fsort for integer. Currently it only supports double and falls back to forderv for integer. The speed improvement is not necessarily massive.

Cheers,
Markus

@MichaelChirico
Copy link
Member

Closely related: #1880

@MarkusBonsch
Copy link
Contributor

I have done more testting and found a possible solution:
replacing fsort(xo[i], internal = TRUE) with as.integer(fsort(as.numeric(xo[i]), internal = TRUE)) adds support for the fast fsort for numeric values until the c-code is updated to support integers. The benchmark below for different subset sizes shows that this creates overhead for small subsets up to O(1e5) but becomes very beneficial starting from O(1e6).

## testing fastest way to reorder result set
library(data.table)
n <- 1e7L
subset <- sample(1L:(2L * n), n)

print(identical(fsort(subset, internal = TRUE), sort(subset)))
print(identical(fsort(subset, internal = TRUE), as.integer(fsort(as.numeric(subset)))))
test1e2 <- microbenchmark::microbenchmark(fsort = o <- fsort(subset[1:1e2], internal = TRUE),
                                       sort  = o <- sort(subset[1:1e2]),
                                       fsortOptimized = o <- as.integer(fsort(as.numeric(subset[1:1e2]))),
                                       test  = o <- as.numeric(subset[1:1e2]), times = 10L, unit = "ms")

test1e4 <- microbenchmark::microbenchmark(fsort = o <- fsort(subset[1:1e4], internal = TRUE),
                                          sort  = o <- sort(subset[1:1e4]),
                                          fsortOptimized = o <- as.integer(fsort(as.numeric(subset[1:1e4]))),
                                          test  = o <- as.numeric(subset[1:1e4]), times = 10L, unit = "ms")

test1e5 <- microbenchmark::microbenchmark(fsort = o <- fsort(subset[1:1e5], internal = TRUE),
                                          sort  = o <- sort(subset[1:1e5]),
                                          fsortOptimized = o <- as.integer(fsort(as.numeric(subset[1:1e5]))),
                                          test  = o <- as.numeric(subset[1:1e5]), times = 10L, unit = "ms")

test1e6 <- microbenchmark::microbenchmark(fsort = o <- fsort(subset[1:1e6], internal = TRUE),
                                          sort  = o <- sort(subset[1:1e6]),
                                          fsortOptimized = o <- as.integer(fsort(as.numeric(subset[1:1e6]))),
                                          test  = o <- as.numeric(subset[1:1e6]), times = 10L, unit = "ms")

test1e7 <- microbenchmark::microbenchmark(fsort = o <- fsort(subset, internal = TRUE),
                                          sort  = o <- sort(subset),
                                          fsortOptimized = o <- as.integer(fsort(as.numeric(subset))),
                                          test  = o <- as.numeric(subset), times = 10L, unit = "ms")

print(test1e2)
#Unit: milliseconds
#expr      min       lq      mean   median       uq      max neval
#fsort 0.027817 0.034053 0.0434724 0.043416 0.052803 0.062205    10
#fsortOptimized 4.468232 4.556770 4.6455729 4.574293 4.668773 5.034856    10
print(test1e4)
#Unit: milliseconds
#expr      min       lq       mean    median        uq       max neval
#fsort 0.487048 0.534597  0.6750671 0.6576905  0.819855  0.838753    10
#fsortOptimized 8.632494 8.670641 11.5050448 8.9400835 13.061439 22.452558    10
print(test1e5)
#Unit: milliseconds
#expr       min        lq      mean     median        uq       max neval
#fsort  4.659642  4.797182  5.078745  4.9919770  5.462983  5.684060    10
#fsortOptimized 12.249606 12.338187 12.797941 12.6789210 13.036106 14.248122    10
print(test1e6)
#Unit: milliseconds
#expr      min       lq     mean   median       uq      max neval
#fsort 75.28854 76.16471 78.37525 77.90051 78.70789 84.85893    10
#fsortOptimized 48.97353 49.18598 54.32664 51.95766 56.33747 71.48857    10
print(test1e7)
#Unit: milliseconds
#expr        min         lq       mean     median         uq       max neval
#fsort 1214.03971 1230.94651 1261.11665 1247.76575 1299.27924 1327.0760    10
#fsortOptimized  316.03810  333.81516  380.56737  380.48449  426.75788  437.0680    10

@jangorecki
Copy link
Member

jangorecki commented Feb 4, 2019

I re-run @MarkusBonsch benchmark as forder is now parallel so fsort optimization trick is less relevant.

20 threads

Unit: milliseconds                                      
           expr      min       lq      mean   median       uq      max neval
          fsort 0.028527 0.029616 0.0365553 0.035491 0.038218 0.058170    10
           sort 0.060433 0.067219 0.0773722 0.073900 0.080525 0.125251    10
 fsortOptimized 3.086451 3.089331 3.1265746 3.103557 3.128582 3.320715    10
           test 0.002319 0.002864 0.0033516 0.003616 0.003644 0.004066    10
Unit: milliseconds
           expr      min       lq      mean    median       uq      max neval
          fsort 0.717504 0.718875 0.7367064 0.7281595 0.729110 0.838689    10
           sort 0.360127 0.463003 0.5196514 0.5579330 0.562491 0.621206    10
 fsortOptimized 6.724597 6.736766 7.0652790 6.8471155 6.970747 8.751183    10
           test 0.039430 0.050797 0.0497499 0.0513795 0.051937 0.055393    10
Unit: milliseconds
           expr       min        lq       mean     median        uq       max neval
          fsort  1.539908  1.704812  1.7803572  1.7438665  1.824405  2.049489    10
           sort  3.502955  3.524958  4.0049440  4.1796465  4.366027  4.450751    10
 fsortOptimized 32.466168 32.776277 34.0092139 33.7020540 34.257354 38.381815    10
           test  0.373817  0.482562  0.5702046  0.4838425  0.523677  1.031146    10
Unit: milliseconds # 1e6
           expr       min        lq      mean   median        uq       max neval
          fsort 13.933737 14.253005 15.306668 14.81691 16.204900 17.548676    10
           sort 36.267029 36.384058 37.785216 36.78868 38.768515 41.494431    10
 fsortOptimized 41.415792 42.147571 44.751146 43.97727 48.573800 49.357947    10
           test  3.473465  3.494174  4.038444  3.64550  3.814589  6.818629    10
Unit: milliseconds # 1e7
           expr       min       lq      mean    median        uq      max neval
          fsort 528.84492 535.4213 560.78308 549.29176 556.14988 655.1271    10
           sort 855.59190 858.0887 866.82243 861.99417 872.93246 893.9491    10
 fsortOptimized 149.86492 152.6287 166.48977 164.41783 171.63586 192.1721    10
           test  31.02696  31.3131  39.04847  36.66294  38.53849  56.0895    10

1 thread

Unit: milliseconds
           expr      min       lq      mean    median       uq      max neval
          fsort 0.013017 0.013698 0.0179605 0.0167955 0.018927 0.033556    10
           sort 0.038519 0.040876 0.0491537 0.0450700 0.049961 0.090209    10
 fsortOptimized 2.051662 2.056465 2.0625215 2.0577070 2.061673 2.104726    10
           test 0.001334 0.001526 0.0020999 0.0016945 0.002196 0.005210    10
Unit: milliseconds
           expr      min       lq      mean    median       uq      max neval
          fsort 0.460293 0.467769 0.4725897 0.4721560 0.477900 0.486422    10
           sort 0.368366 0.369582 0.3765206 0.3710735 0.380312 0.407713    10
 fsortOptimized 2.699074 2.703183 2.7618907 2.7389220 2.802619 2.876459    10
           test 0.035237 0.036866 0.0385392 0.0370980 0.037768 0.051653    10
Unit: milliseconds
           expr      min       lq     mean    median       uq      max neval
          fsort 3.918633 4.056194 4.128784 4.1676795 4.171484 4.248074    10
           sort 3.709687 3.933179 4.059518 4.0072045 4.139392 4.727774    10
 fsortOptimized 6.222069 6.697995 6.923315 6.9548555 7.079603 7.711000    10
           test 0.336627 0.410020 0.633216 0.7360525 0.770565 0.856267    10
Unit: milliseconds # 1e6
           expr       min        lq      mean    median        uq      max neval
          fsort 43.017098 43.029300 43.830707 43.236085 43.913679 47.12445    10
           sort 36.075789 36.206072 36.773577 36.544943 36.798483 39.15622    10
 fsortOptimized 43.844805 43.901243 44.801943 44.534576 45.238849 47.16886    10
           test  3.485052  3.536703  4.321637  3.664668  4.753521  7.47418    10
Unit: milliseconds # 1e7
           expr       min       lq     mean    median        uq       max neval
          fsort 793.86184 800.4241 807.0938 804.56981 816.53160 826.68451    10
           sort 829.67557 845.8225 848.2392 849.80098 854.19339 857.33497    10
 fsortOptimized 386.63043 409.9950 419.5962 426.37757 432.84981 433.26807    10
           test  16.29353  33.0105  39.3342  36.08008  54.40421  57.91515    10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants