Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc fix? Clarify behavior of unequal joins/on binary operators #6623

Open
katrinabrock opened this issue Nov 21, 2024 · 0 comments
Open

Doc fix? Clarify behavior of unequal joins/on binary operators #6623

katrinabrock opened this issue Nov 21, 2024 · 0 comments

Comments

@katrinabrock
Copy link

Summary

I think there may be a documentation gap specifically with on and binary operators. (Or maybe there's a vignette somewhere that I'm missing?)

Details

Specifically, it wasn't clear to me from the documentation:

  1. Only column names are acceptable on either side of the binary operator, not arbitrary expressions.
  2. The left side of the binary operator must be a column from the outer datatable, the right side must come from i.
  3. This will result in the two columns being combined into one with the name from the left side and the values from the right side. (This is shown in the examples on the help page and SO, but I didn't see it explained in words in any official place.)

All of these diverge from how SQL ON works so IMO worth pointing out to the user.

What Currently Exists

In attempt to understand the proper usage of on, I read the on section of the data table help and skimmed the linked Secondary indices and auto indexing looking for more explanation of non-equi joins.

After I resolved the problem, I did see non-equal joins are covered here: https://rdatatable.gitlab.io/data.table/articles/datatable-joins.html#non-equi-join However, the example doesn't make the three points above terribly clear as it has columns with the same name on either side of the join. Although, they are certainly shown implicitly.

Example

Here is a slightly simplified version of my use-case. Trying to join on y and foo such that foo - 2 < y < foo.

DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9, cj=1)
X = data.table(x=c("c","b"), v=8:7, foo=c(4,2), cj=1)

# Goal - this works, but want to avoid cross join
DT[X, on = 'cj', allow.cartesian = TRUE][foo >= y & foo - 2 <= y]

# Attempt - this completely failed, unclear why from docs and error
DT[X, on = .(foo >= y, foo - 2 <= y)]

# Success
DT[
  # Add temp columns for boundaries
  ,c(.SD, .(y.min = y, y.max = y + 2))
][
  # Actual join
  X, on = .(y.min <= foo, y.max >= foo)
][,
  # Add back foo
  c(.SD, .(foo = y.min)),
][
  # Remove temp columns
  ,-c('y.max', 'y.min', 'i.cj'), with = FALSE
]

Suggested Remedy

  1. Expand the Non-equi joins section of the joins vignette to include an example with different column names.
  2. Expand the Non-equi joins section of the joins vignette to explicitly clarify the three points above. (Maybe include a link to the SO question.)
  3. Link from the data.table( help page bullet point under on that describes non-eq joins to the Non-equi Joins section of the Joins vignette
@katrinabrock katrinabrock changed the title Doc fix? Clarify behavior of unequal joins/binary operators Doc fix? Clarify behavior of unequal joins/on binary operators Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant