-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Following recommended practice, I've begun swapping from X[Y]
to X[Y, on=...]
so that the join conditions are clear in my code.
After making this change to some code using a 40 million row data.table, I was surprised by how much performance degraded. It appears that on =...
makes no use of the existing keyed table structure, even when the list of columns solely and completely matches the key structure.
Here's an example:
set.seed(1L)
DT = data.table(a = as.integer(runif(1e6L, 1, 10000)), b = 1:1e6L, key = 'a')
DT2 = data.table(a = sample(DT$a, 100), , key = 'a')
setkey(DT2, a)
> microbenchmark::microbenchmark(key = DT[DT2], on_key = DT[DT2, on = key(DT)])
Unit: microseconds
expr min lq mean median uq max neval
key 901.234 1017.096 1271.474 1096.639 1374.944 3718.228 100
on 4728.020 5489.904 5721.031 5643.663 5796.828 7635.809 100
> identical(DT[DT2], DT[DT2, on = key(DT)])
[1] TRUE
Neither changing to on = c(a = 'a')
nor changing to on = key(DT2)
had any effect on performance.
A five-time decrease in performance for semantically identical code seems a bit much. While I appreciate the flexibility that the new join system brings, I don't understand why it cannot fall back to the existing merge join gracefully when the full set of key columns are specified.
In case it matters...
data.table 1.9.7 IN DEVELOPMENT built 2016-08-24 11:53:54 UTC
For help type ?data.table or https://github.com/Rdatatable/data.table/wiki
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringi_1.1.1 magrittr_1.5 data.table_1.9.7
loaded via a namespace (and not attached):
[1] tools_3.3.1 memoise_1.0.0 digest_0.6.10