join in `[.data.table` could be consistent to SQL #1615

jangorecki · 2016-03-29T21:44:10Z

Currently data.table joins are consistent with base R.
This is somehow awkward for some queries.

library(data.table)
x = data.table(a=1:3, w=letters[1:3])
y = data.table(b=3:5, z=6:4)
x[y, on=c(a="b")]
#   a  w z
#1: 3  c 6
#2: 4 NA 5
#3: 5 NA 4
x[y, .(a, b), on=c(a="b")]
#   a b
#1: 3 3
#2: 4 4
#3: 5 5

Join consistency to base R could be kept in merge.data.table method for base R merge generic, while the joins within [.data.table could be consistent to SQL - which does not impose limitation as base R. [.data.frame does not allow joins so it wouldn’t break consistency here.
Change would generally break the code which relies on invalid base R join behavior.

For reference SQL output from postgres:

#$`SELECT * FROM x RIGHT OUTER JOIN y ON x.a = y.b;`
#    a  w b z
#1:  3  c 3 6
#2: NA NA 4 5
#3: NA NA 5 4
#
#$`SELECT a, b FROM x RIGHT OUTER JOIN y ON x.a = y.b;`
#    a b
#1:  3 3
#2: NA 4
#3: NA 5

Just to link related issues: #1700, #1761, #1469

The text was updated successfully, but these errors were encountered:

franknarf1 · 2016-03-29T22:03:38Z

So you want to keep both of the on columns in the result?

(I'm not so SQL-literate as to immediately know where the difference between data.table joins and SQL joins is. Maybe the title could be made more specific?)

What would you want to see for rolling joins? Maybe...

# currently
x[y, on=c(a="b"), roll=TRUE]
#    a w z
# 1: 3 c 6
# 2: 4 c 5
# 3: 5 c 4

# new behavior
#    a w b z
# 1: 3 c 3 6
# 2: 3 c 4 5
# 3: 3 c 5 4

If #1494 is done, we'll also be able to show which rows of x have been matched.

I guess the behavior also makes sense for overlap joins...? (I'm not very familiar with them.)

jangorecki · 2016-03-29T22:12:48Z

I don't much bother about which columns are being returned by default, and in which order. This can be kept as is, and be easily controlled in j. The problem is the content of the columns, see a and b in above outputs dt vs sql. If the column has b content really, it should not be named a.
Your rolling join example looks correct.

* master: Allow x's cols to be referred to using 'x.' prefix, addresses #1615 adding some helpful info on dev branch checks fixes invalid vignette urls in manual Remove old vignette. More clarifications to secondary indices. Clarify that 'on' doesn't *create* secondary indices. Remove unwanted text in secondary indices vignette. Rename vignette. Fix size of header for FAQ vignette. New vignette on secondary indices and auto indexing based subsets. More minor fix to vignette. Minor formatting fixes to vignettes. Fix Typos & rm. trailing whitespace in vignettes

sz-cgt · 2016-04-21T20:13:56Z

Arun, does commit 620276b also resolve #733?

arunsrinivasan · 2016-04-21T20:18:07Z

@sz-cgt no they are two different issues.

mbacou · 2016-07-02T05:05:26Z

@jangorecki as another struggling user coming from SQL and STATA, SAS, I can attest that data.table's join behavior seems strange at first, but then you need to remember that in R [ means "subset by" or "index by" -- it doesn't mean "join". With that in mind, data.table's behavior starts making more sense.

Regardless of the class of A and B (works with vectors, matrices, lists, even with spatial objects like points and polygons) A[B] in R will return A "indexed by" B or "at the positions of" [B] -- so we expect A's structure to not mutate. Modifying that behavior by default would be a risky slope.

With that in mind if you want to preserve information contained in B then using B$something (with R's native classes) or the more generic i.something (as in data.table) becomes the logical approach.

jangorecki · 2016-07-03T11:58:44Z

@mbacou thanks for input.

R [ means "subset by" or "index by" -- it doesn't mean "join".

In practice it does right outer join, so confusion may comes from the fact that people usually expects left outer join. The only difference to SQL is match on NA vs NULL which is consistent to base R merge.
I don't mind leaving current behavior as is. We can use x. prefixes now so we can query expected fields from join, it wasn't possible before x.. The only thing is that user need to be aware of it, but it shouldn't be big issue as long as we keep consistency to base R merge.
If others agree we could close this issue and advocate to use x. and i. in such cases.

ywhcuhk · 2016-07-03T15:20:32Z

It seems that [ is much more powerful than the merge.data.table because it allows inequality join and is faster. Any plan to make inequality join possible in merge.data.table ? Or simply have a separate API for data.table joins with all the powerful features from [ type join? IMHO, [ syntax has too much to carry.

arunsrinivasan · 2016-07-03T15:50:18Z

Non-equi joins return a lot more rows usually than equi joins. And more
often than not, one wants to not return the entire result but to compute on
it directly. We find DT syntax convenient for that.

At this time there are no plans to overload DT merge syntax or implement
extra functions.

Arun

Sent from my phone.

On 3 July 2016 at 08:20:34, Wenhao Yang (notifications@github.com) wrote:

It seems that [ is much more powerful than the merge.data.table because
it allows inequality join and is faster. Any plan to make inequality join
possible in merge.data.table ? Or simply have a separate API for
data.table joins with all the powerful features from [ type join? IMHO, [
syntax has too much to carry.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1615 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AC9VXa0EqLwOpd8zkE-WVrRgZy8AA5HPks5qR9NCgaJpZM4H7Kji
.

jangorecki added the enhancement label Mar 29, 2016

arunsrinivasan added a commit that referenced this issue Apr 2, 2016

Allow x's cols to be referred to using 'x.' prefix, addresses #1615

620276b

aryoda mentioned this issue May 4, 2016

Should be able to refer to i's .SD during a join. #935

Open

jangorecki mentioned this issue May 11, 2016

join column naming with on syntax #1700

Open

jangorecki mentioned this issue Jun 28, 2016

binary search extensions to <, <=, >, >= #1452

Closed

5 tasks

jangorecki added the question label Jul 5, 2016

jangorecki mentioned this issue Aug 11, 2016

non-equi join returning incorrect column name? #1807

Closed

jangorecki mentioned this issue Mar 5, 2017

Inconsistent and confusing naming convention for unequal join #2006

Open

franknarf1 mentioned this issue Feb 1, 2018

non-equi-join: foo is i.bar instead of x.foo if on = "foo>=bar" #2602

Closed

jangorecki mentioned this issue Feb 3, 2018

non-equi join masks column names from x and i in the j expression #2595

Closed

sritchie73 mentioned this issue Mar 26, 2018

SQL-like column return for non-equi and rolling joins #2706

Closed

10 tasks

mattdowle added this to the v1.10.6 milestone Apr 10, 2018

mattdowle modified the milestones: v1.11.0, v1.11.2 Apr 29, 2018

jangorecki modified the milestones: 1.12.0, 1.11.6 Jun 6, 2018

mattdowle modified the milestones: 1.11.6, 1.12.0 Sep 20, 2018

mattdowle linked a pull request Oct 3, 2018 that will close this issue

Both columns for rolling and non-equi joins #3093

Open

4 tasks

mattdowle removed this from the 1.12.0 milestone Jan 11, 2019

jangorecki added the joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins label Apr 10, 2020

jangorecki mentioned this issue May 16, 2021

Warn user about executing rolling join on differing column types #1913

Closed

jangorecki added this to the 2.0.0 milestone Nov 24, 2023

MichaelChirico added the breaking-change issues whose solution would require breaking existing behavior label Sep 12, 2024

myoung3 mentioned this issue Sep 13, 2024

Conditional left join | left table column data is wrong #6495

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

join in `[.data.table` could be consistent to SQL #1615

join in `[.data.table` could be consistent to SQL #1615

jangorecki commented Mar 29, 2016 •

edited

Loading

franknarf1 commented Mar 29, 2016

jangorecki commented Mar 29, 2016

sz-cgt commented Apr 21, 2016

arunsrinivasan commented Apr 21, 2016

mbacou commented Jul 2, 2016

jangorecki commented Jul 3, 2016 •

edited

Loading

ywhcuhk commented Jul 3, 2016

arunsrinivasan commented Jul 3, 2016

join in [.data.table could be consistent to SQL #1615

join in [.data.table could be consistent to SQL #1615

Comments

jangorecki commented Mar 29, 2016 • edited Loading

franknarf1 commented Mar 29, 2016

jangorecki commented Mar 29, 2016

sz-cgt commented Apr 21, 2016

arunsrinivasan commented Apr 21, 2016

mbacou commented Jul 2, 2016

jangorecki commented Jul 3, 2016 • edited Loading

ywhcuhk commented Jul 3, 2016

arunsrinivasan commented Jul 3, 2016

join in `[.data.table` could be consistent to SQL #1615

join in `[.data.table` could be consistent to SQL #1615

jangorecki commented Mar 29, 2016 •

edited

Loading

jangorecki commented Jul 3, 2016 •

edited

Loading