Implement joins using secondary keys #1130

arunsrinivasan · 2015-04-28T14:23:19Z

X[Y] gains an on = argument (as proposed by Matt).
merge will not copy anymore once implemented.

Update posts:

Update this SO post as well.
And this

The text was updated successfully, but these errors were encountered:

jangorecki · 2015-04-28T14:34:31Z

Maybe on can be further extended to handle left, right, full joins?
The possible extension which comes to my mind (kind of oracle old non-ansi join syntax):

X[Y, on=col] # default join - currently right outer, from 2.0 inner join(?)
X[Y, on=col+] # right join
X[Y, on=+col] # left join
X[Y, on=+col+] # full outer join

I'm sure we can figured out the better, this would finally closes #614

clarkdk · 2015-04-29T06:06:17Z

! or - for a non-join?

X[Y, on=!col]
X[Y, on=-col]

eantonya · 2015-04-29T18:45:27Z

Honestly, I think these suggested "on" decorators (i.e. the +,-,!) are not a good idea. As I see them right now, they violate the intuition of thinking of X[Y] as a kind of extended subset operation.

For your reference, I read X[Y] as "take the Y rows of X" and X[!Y] as "take the rows of X that are not Y". You can't keep that subset reading for e.g. a full join as far as I can tell.

Convince me that my intuition should change and provide a new reading for the suggested syntax.

stephlocke · 2015-05-06T08:50:40Z

on is useful in that it allows on the fly key setting for joins, reducing lines of code needed.

I see the benefit in that it extends @eantonya 's reading by saying "take the Y rows of X where we define Y as being unique based on columns/attributes A,B,C" (currently that last bit is implied not explicit in basic syntax).

Not sure I like the plus operator as a means of specifying behaviour - I'd prefer an explicit join parameter which could take left, right, etc

I'd also really love to see 1.9.6 on CRAN soon so I can use it in production ASAP - will adding this feature delay significantly?

clarkdk · 2015-05-06T09:59:49Z

For me, the point of this FR is 1) to be able to explicitly specify the columns used for the join, and 2) that the specified columns will automatically use/create secondary keys, if they are not the primary key.

To be fully general, the syntax should also 3) allow join columns to be given separately for X and Y, in case they don't have the same names in both tables [equivalent to by.x= and by.y= in base merge]. "by" has a different use in data.table (grouping), so there is a need for a different keyword. Hence, on=, and perhaps on.x= and on.y=, too.

Being able to specify left, right, full, outer, non-join etc. would be nice, but that is a separate FR (#614) which has not been given priority or milestone yet. I assume the "join" keyword is reserved for that FR.

MichaelChirico · 2015-07-16T19:41:01Z

Throwing my hat in the ring of support for this, would clean up this process of mine substantially.

dselivanov · 2015-07-24T10:49:46Z

@arunsrinivasan, I just wondering, is the syntax for on option already fixed?
I think, it is good idea to implement it in a little different and (at least IMO) more intuitive way - do not use c(), but use list() and .() instead (as it implemented in by argument):

DT1[DT2,  on = list(...) ]
# or 
DT1[DT2,  on = .(...) ]

jangorecki · 2015-07-24T13:22:47Z

Quite interesting idea, less typing of quotes ("), and maybe a possibility for future extension to join on expression - similar way as by argument.
I see one inconsistency with by. For example: in by you can use list to rename cols dt[,.SD,by=.(by_col = some_col)] while in on it would have a totally different meaning: dt1[dt2, on = .(dt1_col = dt2_col)]. Looking forward for Arun's opinion.

arunsrinivasan · 2015-08-10T08:07:58Z

All done.

@dselivanov good point. by= allows both character vectors and lists. It'd be nice to have the same for on= as well. Do you mind opening a separate FR?

jan-glx · 2015-08-18T15:13:43Z

Nice work, @arunsrinivasan !
Related and "created" error messages might want some tuning so that this awesome stuff gets used:

dt1=data.table(x=1:3,y=as.character(2:4),z=1:3);dt2=data.table(x=1:3,y=as.character(2:4),z=3:5)
> dt1[dt2]  # bla..bla.. we know we are just to lazy to write setkey twice every three lines
Error in `[.data.table`(dt1, dt2) : 
  When i is a data.table (or character vector), x must be keyed (i.e. sorted, and, marked as sorted) so data.table knows which columns to join to and take advantage of x being sorted. Call setkey(x,...) first, see ?setkey.
> dt1[dt2,on=c("y"="y")] # nice
   x y z i.x i.z
1: 1 2 1   1   3
2: 2 3 2   2   4
3: 3 4 3   3   5
> dt1[dt2,on=c("x"="x")] # nice :)
   x y z i.y i.z
1: 1 2 1   2   3
2: 2 3 2   3   4
3: 3 4 3   4   5
> dt1[dt2,on=c("ex"="x")] # erm..
Error in forderv(x, by = rightcols) : 
  'by' value -2147483648 out of range [1,3]
> dt1[dt2,on=c("x"="ex")] # ...hm
Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
  Internal error. icols[0] is NA
> dt1[dt2,on=c("y"="why")] # that's what she said
Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
  x.'y' is a character column being joined to i.'NA' which is type 'NULL'. Character columns must join to factor or character columns.

reference to on in keyless join attempt error message
nice error if LHS in on columns do not exist in data table
nice error if RHS in on columns do not exist in i data.table

copyable code for above output:

dt1=data.table(x=1:3,y=as.character(2:4),z=1:3);dt2=data.table(x=1:3,y=as.character(2:4),z=3:5)
dt1[dt2]
dt1[dt2,on=c("y"="y")]
dt1[dt2,on=c("x"="x")]
dt1[dt2,on=c("ex"="x")]
dt1[dt2,on=c("x"="ex")]
dt1[dt2,on=c("y"="why")]

Please feel free to copy this where it belongs if you think it's worth it and that this was not the right place.
thanks again,
jan

arunsrinivasan added the feature request label Apr 28, 2015

arunsrinivasan added this to the v1.9.6 milestone Apr 28, 2015

arunsrinivasan added the High label Apr 28, 2015

arunsrinivasan mentioned this issue Jun 15, 2015

Joins depend on column order. Normal behaviour? #1179

Closed

mattdowle modified the milestones: v1.9.8, v1.9.6 Jun 19, 2015

arunsrinivasan modified the milestones: v1.9.6, v1.9.8 Jul 17, 2015

arunsrinivasan self-assigned this Jul 17, 2015

arunsrinivasan added a commit that referenced this issue Jul 17, 2015

Part of #1130. Joins without setting keys using 'on' argument.

f87a8aa

arunsrinivasan closed this as completed in 44b1e00 Aug 10, 2015

dselivanov mentioned this issue Aug 10, 2015

[Request] allow on= to work with both character vectors and lists #1257

Closed

arunsrinivasan added a commit that referenced this issue Aug 11, 2015

thinko on #1130. Added tests.

b4ceb5f

jangorecki mentioned this issue Nov 20, 2015

unkeyed join could use existing indexes #1439

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement joins using secondary keys #1130

Implement joins using secondary keys #1130

arunsrinivasan commented Apr 28, 2015 •

edited

Loading

jangorecki commented Apr 28, 2015

clarkdk commented Apr 29, 2015

eantonya commented Apr 29, 2015

stephlocke commented May 6, 2015

clarkdk commented May 6, 2015

MichaelChirico commented Jul 16, 2015

dselivanov commented Jul 24, 2015

jangorecki commented Jul 24, 2015

arunsrinivasan commented Aug 10, 2015

jan-glx commented Aug 18, 2015

Implement joins using secondary keys #1130

Implement joins using secondary keys #1130

Comments

arunsrinivasan commented Apr 28, 2015 • edited Loading

jangorecki commented Apr 28, 2015

clarkdk commented Apr 29, 2015

eantonya commented Apr 29, 2015

stephlocke commented May 6, 2015

clarkdk commented May 6, 2015

MichaelChirico commented Jul 16, 2015

dselivanov commented Jul 24, 2015

jangorecki commented Jul 24, 2015

arunsrinivasan commented Aug 10, 2015

jan-glx commented Aug 18, 2015

arunsrinivasan commented Apr 28, 2015 •

edited

Loading