X[Y, on = ...] significantly slower than X[Y] for the same join fields #1825

sz-cgt · 2016-08-24T20:48:26Z

Following recommended practice, I've begun swapping from X[Y] to X[Y, on=...] so that the join conditions are clear in my code.

After making this change to some code using a 40 million row data.table, I was surprised by how much performance degraded. It appears that on =... makes no use of the existing keyed table structure, even when the list of columns solely and completely matches the key structure.

Here's an example:

set.seed(1L)
DT = data.table(a = as.integer(runif(1e6L, 1, 10000)), b = 1:1e6L, key = 'a')
DT2 = data.table(a = sample(DT$a, 100), , key = 'a')
setkey(DT2, a)
> microbenchmark::microbenchmark(key = DT[DT2], on_key = DT[DT2, on = key(DT)])
Unit: microseconds
expr      min       lq     mean   median       uq      max neval
 key  901.234 1017.096 1271.474 1096.639 1374.944 3718.228   100
  on 4728.020 5489.904 5721.031 5643.663 5796.828 7635.809   100
> identical(DT[DT2], DT[DT2, on = key(DT)])
[1] TRUE

Neither changing to on = c(a = 'a') nor changing to on = key(DT2) had any effect on performance.

A five-time decrease in performance for semantically identical code seems a bit much. While I appreciate the flexibility that the new join system brings, I don't understand why it cannot fall back to the existing merge join gracefully when the full set of key columns are specified.

In case it matters...
data.table 1.9.7 IN DEVELOPMENT built 2016-08-24 11:53:54 UTC
For help type ?data.table or https://github.com/Rdatatable/data.table/wiki
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way

sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] stringi_1.1.1 magrittr_1.5 data.table_1.9.7

loaded via a namespace (and not attached):
[1] tools_3.3.1 memoise_1.0.0 digest_0.6.10

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2016-08-24T20:54:54Z

Probably related to #1232.

eantonya · 2016-08-24T20:57:58Z

@sz-cgt That's not a good example. Come up with one where timings are at least 1 second (instead of single digit millis) and then it'll be interesting.

There is (obviously) more processing that gets done with more arguments in [.data.table, so there is more overhead which is probably what you're measuring.

arunsrinivasan · 2016-08-24T21:03:55Z

Nice catch. When columns to join on are key columns of a data.table, the order vector need not be computed. Similarly when a secondary index already exists, it could be reused. However, the logic was, till now, only implemented for secondary indices. So, on keyed joins, the order vector was computed again.. and the first step before computing order is to check if the vector is sorted.. and in this case, that step will tell that the vector is sorted.. and that's the extra time you see in your benchmark.

arunsrinivasan · 2016-08-24T21:12:05Z

Now I get:

require(data.table)
set.seed(1L)
DT = data.table(a = as.integer(runif(40e6L, 1, 10000)), b = 1:1e6L, key = 'a')
DT2 = data.table(a = sample(DT$a, 100), key = 'a')
setkey(DT2, a)

Before:

system.time(DT[DT2]) # 0.005s
system.time(DT[DT2, on=key(DT)]) # 0.170s

Now:

system.time(DT[DT2])
#    user  system elapsed 
#   0.008   0.000   0.003 
system.time(DT[DT2, on=key(DT)])
#    user  system elapsed 
#   0.010   0.000   0.004 

identical(DT[DT2], DT[DT2, on=key(DT)]) # [1] TRUE

Fix on its way.

arunsrinivasan added the enhancement label Aug 24, 2016

arunsrinivasan added this to the v1.9.8 milestone Aug 24, 2016

arunsrinivasan self-assigned this Aug 24, 2016

arunsrinivasan added bug and removed enhancement labels Aug 24, 2016

arunsrinivasan closed this as completed in 86a4859 Aug 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

X[Y, on = ...] significantly slower than X[Y] for the same join fields #1825

X[Y, on = ...] significantly slower than X[Y] for the same join fields #1825

sz-cgt commented Aug 24, 2016 •

edited

Loading

MichaelChirico commented Aug 24, 2016

eantonya commented Aug 24, 2016 •

edited

Loading

arunsrinivasan commented Aug 24, 2016 •

edited

Loading

arunsrinivasan commented Aug 24, 2016

X[Y, on = ...] significantly slower than X[Y] for the same join fields #1825

X[Y, on = ...] significantly slower than X[Y] for the same join fields #1825

Comments

sz-cgt commented Aug 24, 2016 • edited Loading

MichaelChirico commented Aug 24, 2016

eantonya commented Aug 24, 2016 • edited Loading

arunsrinivasan commented Aug 24, 2016 • edited Loading

arunsrinivasan commented Aug 24, 2016

sz-cgt commented Aug 24, 2016 •

edited

Loading

eantonya commented Aug 24, 2016 •

edited

Loading

arunsrinivasan commented Aug 24, 2016 •

edited

Loading