-
Notifications
You must be signed in to change notification settings - Fork 977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
X[Y, on = ...] significantly slower than X[Y] for the same join fields #1825
Comments
Probably related to #1232. |
@sz-cgt That's not a good example. Come up with one where timings are at least 1 second (instead of single digit millis) and then it'll be interesting. There is (obviously) more processing that gets done with more arguments in |
Nice catch. When columns to join on are key columns of a data.table, the order vector need not be computed. Similarly when a secondary index already exists, it could be reused. However, the logic was, till now, only implemented for secondary indices. So, on keyed joins, the order vector was computed again.. and the first step before computing order is to check if the vector is sorted.. and in this case, that step will tell that the vector is sorted.. and that's the extra time you see in your benchmark. |
Now I get: require(data.table)
set.seed(1L)
DT = data.table(a = as.integer(runif(40e6L, 1, 10000)), b = 1:1e6L, key = 'a')
DT2 = data.table(a = sample(DT$a, 100), key = 'a')
setkey(DT2, a) Before: system.time(DT[DT2]) # 0.005s
system.time(DT[DT2, on=key(DT)]) # 0.170s Now: system.time(DT[DT2])
# user system elapsed
# 0.008 0.000 0.003
system.time(DT[DT2, on=key(DT)])
# user system elapsed
# 0.010 0.000 0.004
identical(DT[DT2], DT[DT2, on=key(DT)]) # [1] TRUE Fix on its way. |
Following recommended practice, I've begun swapping from
X[Y]
toX[Y, on=...]
so that the join conditions are clear in my code.After making this change to some code using a 40 million row data.table, I was surprised by how much performance degraded. It appears that
on =...
makes no use of the existing keyed table structure, even when the list of columns solely and completely matches the key structure.Here's an example:
Neither changing to
on = c(a = 'a')
nor changing toon = key(DT2)
had any effect on performance.A five-time decrease in performance for semantically identical code seems a bit much. While I appreciate the flexibility that the new join system brings, I don't understand why it cannot fall back to the existing merge join gracefully when the full set of key columns are specified.
In case it matters...
data.table 1.9.7 IN DEVELOPMENT built 2016-08-24 11:53:54 UTC
For help type ?data.table or https://github.com/Rdatatable/data.table/wiki
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringi_1.1.1 magrittr_1.5 data.table_1.9.7
loaded via a namespace (and not attached):
[1] tools_3.3.1 memoise_1.0.0 digest_0.6.10
The text was updated successfully, but these errors were encountered: