Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

join in [.data.table could be consistent to SQL #1615

Open
jangorecki opened this issue Mar 29, 2016 · 8 comments · May be fixed by #3093
Open

join in [.data.table could be consistent to SQL #1615

jangorecki opened this issue Mar 29, 2016 · 8 comments · May be fixed by #3093
Labels
breaking-change issues whose solution would require breaking existing behavior enhancement joins question
Milestone

Comments

@jangorecki
Copy link
Member

jangorecki commented Mar 29, 2016

Currently data.table joins are consistent with base R.
This is somehow awkward for some queries.

library(data.table)
x = data.table(a=1:3, w=letters[1:3])
y = data.table(b=3:5, z=6:4)
x[y, on=c(a="b")]
#   a  w z
#1: 3  c 6
#2: 4 NA 5
#3: 5 NA 4
x[y, .(a, b), on=c(a="b")]
#   a b
#1: 3 3
#2: 4 4
#3: 5 5

Join consistency to base R could be kept in merge.data.table method for base R merge generic, while the joins within [.data.table could be consistent to SQL - which does not impose limitation as base R. [.data.frame does not allow joins so it wouldn’t break consistency here.
Change would generally break the code which relies on invalid base R join behavior.

For reference SQL output from postgres:

#$`SELECT * FROM x RIGHT OUTER JOIN y ON x.a = y.b;`
#    a  w b z
#1:  3  c 3 6
#2: NA NA 4 5
#3: NA NA 5 4
#
#$`SELECT a, b FROM x RIGHT OUTER JOIN y ON x.a = y.b;`
#    a b
#1:  3 3
#2: NA 4
#3: NA 5

Just to link related issues: #1700, #1761, #1469

@franknarf1
Copy link
Contributor

So you want to keep both of the on columns in the result?

(I'm not so SQL-literate as to immediately know where the difference between data.table joins and SQL joins is. Maybe the title could be made more specific?)

What would you want to see for rolling joins? Maybe...

# currently
x[y, on=c(a="b"), roll=TRUE]
#    a w z
# 1: 3 c 6
# 2: 4 c 5
# 3: 5 c 4

# new behavior
#    a w b z
# 1: 3 c 3 6
# 2: 3 c 4 5
# 3: 3 c 5 4

If #1494 is done, we'll also be able to show which rows of x have been matched.

I guess the behavior also makes sense for overlap joins...? (I'm not very familiar with them.)

@jangorecki
Copy link
Member Author

I don't much bother about which columns are being returned by default, and in which order. This can be kept as is, and be easily controlled in j. The problem is the content of the columns, see a and b in above outputs dt vs sql. If the column has b content really, it should not be named a.
Your rolling join example looks correct.

arunsrinivasan added a commit that referenced this issue Apr 3, 2016
* master:
  Allow x's cols to be referred to using 'x.' prefix, addresses #1615
  adding some helpful info on dev branch checks
  fixes invalid vignette urls in manual
  Remove old vignette.
  More clarifications to secondary indices.
  Clarify that 'on' doesn't *create* secondary indices.
  Remove unwanted text in secondary indices vignette.
  Rename vignette.
  Fix size of header for FAQ  vignette.
  New vignette on secondary indices and auto indexing based subsets.
  More minor fix to vignette.
  Minor formatting fixes to vignettes.
  Fix Typos & rm. trailing whitespace in vignettes
@sz-cgt
Copy link

sz-cgt commented Apr 21, 2016

Arun, does commit 620276b also resolve #733?

@arunsrinivasan
Copy link
Member

@sz-cgt no they are two different issues.

@mbacou
Copy link

mbacou commented Jul 2, 2016

@jangorecki as another struggling user coming from SQL and STATA, SAS, I can attest that data.table's join behavior seems strange at first, but then you need to remember that in R [ means "subset by" or "index by" -- it doesn't mean "join". With that in mind, data.table's behavior starts making more sense.

Regardless of the class of A and B (works with vectors, matrices, lists, even with spatial objects like points and polygons) A[B] in R will return A "indexed by" B or "at the positions of" [B] -- so we expect A's structure to not mutate. Modifying that behavior by default would be a risky slope.

With that in mind if you want to preserve information contained in B then using B$something (with R's native classes) or the more generic i.something (as in data.table) becomes the logical approach.

@jangorecki
Copy link
Member Author

jangorecki commented Jul 3, 2016

@mbacou thanks for input.

R [ means "subset by" or "index by" -- it doesn't mean "join".

In practice it does right outer join, so confusion may comes from the fact that people usually expects left outer join. The only difference to SQL is match on NA vs NULL which is consistent to base R merge.
I don't mind leaving current behavior as is. We can use x. prefixes now so we can query expected fields from join, it wasn't possible before x.. The only thing is that user need to be aware of it, but it shouldn't be big issue as long as we keep consistency to base R merge.
If others agree we could close this issue and advocate to use x. and i. in such cases.

@ywhcuhk
Copy link

ywhcuhk commented Jul 3, 2016

It seems that [ is much more powerful than the merge.data.table because it allows inequality join and is faster. Any plan to make inequality join possible in merge.data.table ? Or simply have a separate API for data.table joins with all the powerful features from [ type join? IMHO, [ syntax has too much to carry.

@arunsrinivasan
Copy link
Member

Non-equi joins return a lot more rows usually than equi joins. And more
often than not, one wants to not return the entire result but to compute on
it directly. We find DT syntax convenient for that.

At this time there are no plans to overload DT merge syntax or implement
extra functions.

Arun

Sent from my phone.

On 3 July 2016 at 08:20:34, Wenhao Yang (notifications@github.com) wrote:

It seems that [ is much more powerful than the merge.data.table because
it allows inequality join and is faster. Any plan to make inequality join
possible in merge.data.table ? Or simply have a separate API for
data.table joins with all the powerful features from [ type join? IMHO, [
syntax has too much to carry.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1615 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AC9VXa0EqLwOpd8zkE-WVrRgZy8AA5HPks5qR9NCgaJpZM4H7Kji
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change issues whose solution would require breaking existing behavior enhancement joins question
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants