You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think there may be a documentation gap specifically with on and binary operators. (Or maybe there's a vignette somewhere that I'm missing?)
Details
Specifically, it wasn't clear to me from the documentation:
Only column names are acceptable on either side of the binary operator, not arbitrary expressions.
The left side of the binary operator must be a column from the outer datatable, the right side must come from i.
This will result in the two columns being combined into one with the name from the left side and the values from the right side. (This is shown in the examples on the help page and SO, but I didn't see it explained in words in any official place.)
All of these diverge from how SQL ON works so IMO worth pointing out to the user.
What Currently Exists
In attempt to understand the proper usage of on, I read the on section of the data table help and skimmed the linked Secondary indices and auto indexing looking for more explanation of non-equi joins.
After I resolved the problem, I did see non-equal joins are covered here: https://rdatatable.gitlab.io/data.table/articles/datatable-joins.html#non-equi-join However, the example doesn't make the three points above terribly clear as it has columns with the same name on either side of the join. Although, they are certainly shown implicitly.
Example
Here is a slightly simplified version of my use-case. Trying to join on y and foo such that foo - 2 < y < foo.
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9, cj=1)
X = data.table(x=c("c","b"), v=8:7, foo=c(4,2), cj=1)
# Goal - this works, but want to avoid cross join
DT[X, on = 'cj', allow.cartesian = TRUE][foo >= y & foo - 2 <= y]
# Attempt - this completely failed, unclear why from docs and error
DT[X, on = .(foo >= y, foo - 2 <= y)]
# Success
DT[
# Add temp columns for boundaries
,c(.SD, .(y.min = y, y.max = y + 2))
][
# Actual join
X, on = .(y.min <= foo, y.max >= foo)
][,
# Add back foo
c(.SD, .(foo = y.min)),
][
# Remove temp columns
,-c('y.max', 'y.min', 'i.cj'), with = FALSE
]
Suggested Remedy
Expand the Non-equi joins section of the joins vignette to include an example with different column names.
Expand the Non-equi joins section of the joins vignette to explicitly clarify the three points above. (Maybe include a link to the SO question.)
Link from the data.table( help page bullet point under on that describes non-eq joins to the Non-equi Joins section of the Joins vignette
The text was updated successfully, but these errors were encountered:
katrinabrock
changed the title
Doc fix? Clarify behavior of unequal joins/binary operators
Doc fix? Clarify behavior of unequal joins/on binary operators
Nov 21, 2024
Summary
I think there may be a documentation gap specifically with
on
and binary operators. (Or maybe there's a vignette somewhere that I'm missing?)Details
Specifically, it wasn't clear to me from the documentation:
i
.All of these diverge from how SQL
ON
works so IMO worth pointing out to the user.What Currently Exists
In attempt to understand the proper usage of
on
, I read theon
section of the data table help and skimmed the linked Secondary indices and auto indexing looking for more explanation of non-equi joins.After I resolved the problem, I did see non-equal joins are covered here: https://rdatatable.gitlab.io/data.table/articles/datatable-joins.html#non-equi-join However, the example doesn't make the three points above terribly clear as it has columns with the same name on either side of the join. Although, they are certainly shown implicitly.
Example
Here is a slightly simplified version of my use-case. Trying to join on y and foo such that
foo - 2 < y < foo
.Suggested Remedy
data.table(
help page bullet point underon
that describes non-eq joins to the Non-equi Joins section of the Joins vignetteThe text was updated successfully, but these errors were encountered: