Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement joins using secondary keys #1130

Closed
4 tasks done
arunsrinivasan opened this issue Apr 28, 2015 · 10 comments
Closed
4 tasks done

Implement joins using secondary keys #1130

arunsrinivasan opened this issue Apr 28, 2015 · 10 comments
Assignees
Milestone

Comments

@arunsrinivasan
Copy link
Member

arunsrinivasan commented Apr 28, 2015

  • X[Y] gains an on = argument (as proposed by Matt).
  • merge will not copy anymore once implemented.

Update posts:

@jangorecki
Copy link
Member

Maybe on can be further extended to handle left, right, full joins?
The possible extension which comes to my mind (kind of oracle old non-ansi join syntax):

X[Y, on=col] # default join - currently right outer, from 2.0 inner join(?)
X[Y, on=col+] # right join
X[Y, on=+col] # left join
X[Y, on=+col+] # full outer join

I'm sure we can figured out the better, this would finally closes #614

@clarkdk
Copy link

clarkdk commented Apr 29, 2015

! or - for a non-join?

X[Y, on=!col]
X[Y, on=-col]

@eantonya
Copy link
Contributor

Honestly, I think these suggested "on" decorators (i.e. the +,-,!) are not a good idea. As I see them right now, they violate the intuition of thinking of X[Y] as a kind of extended subset operation.

For your reference, I read X[Y] as "take the Y rows of X" and X[!Y] as "take the rows of X that are not Y". You can't keep that subset reading for e.g. a full join as far as I can tell.

Convince me that my intuition should change and provide a new reading for the suggested syntax.

@stephlocke
Copy link

on is useful in that it allows on the fly key setting for joins, reducing lines of code needed.

I see the benefit in that it extends @eantonya 's reading by saying "take the Y rows of X where we define Y as being unique based on columns/attributes A,B,C" (currently that last bit is implied not explicit in basic syntax).

Not sure I like the plus operator as a means of specifying behaviour - I'd prefer an explicit join parameter which could take left, right, etc

I'd also really love to see 1.9.6 on CRAN soon so I can use it in production ASAP - will adding this feature delay significantly?

@clarkdk
Copy link

clarkdk commented May 6, 2015

For me, the point of this FR is 1) to be able to explicitly specify the columns used for the join, and 2) that the specified columns will automatically use/create secondary keys, if they are not the primary key.

To be fully general, the syntax should also 3) allow join columns to be given separately for X and Y, in case they don't have the same names in both tables [equivalent to by.x= and by.y= in base merge]. "by" has a different use in data.table (grouping), so there is a need for a different keyword. Hence, on=, and perhaps on.x= and on.y=, too.

Being able to specify left, right, full, outer, non-join etc. would be nice, but that is a separate FR (#614) which has not been given priority or milestone yet. I assume the "join" keyword is reserved for that FR.

@MichaelChirico
Copy link
Member

Throwing my hat in the ring of support for this, would clean up this process of mine substantially.

@dselivanov
Copy link

@arunsrinivasan, I just wondering, is the syntax for on option already fixed?
I think, it is good idea to implement it in a little different and (at least IMO) more intuitive way - do not use c(), but use list() and .() instead (as it implemented in by argument):

DT1[DT2,  on = list(...) ]
# or 
DT1[DT2,  on = .(...) ]

@jangorecki
Copy link
Member

Quite interesting idea, less typing of quotes ("), and maybe a possibility for future extension to join on expression - similar way as by argument.
I see one inconsistency with by. For example: in by you can use list to rename cols dt[,.SD,by=.(by_col = some_col)] while in on it would have a totally different meaning: dt1[dt2, on = .(dt1_col = dt2_col)]. Looking forward for Arun's opinion.

@arunsrinivasan
Copy link
Member Author

All done.

@dselivanov good point. by= allows both character vectors and lists. It'd be nice to have the same for on= as well. Do you mind opening a separate FR?

@jan-glx
Copy link
Contributor

jan-glx commented Aug 18, 2015

Nice work, @arunsrinivasan !
Related and "created" error messages might want some tuning so that this awesome stuff gets used:

dt1=data.table(x=1:3,y=as.character(2:4),z=1:3);dt2=data.table(x=1:3,y=as.character(2:4),z=3:5)
> dt1[dt2]  # bla..bla.. we know we are just to lazy to write setkey twice every three lines
Error in `[.data.table`(dt1, dt2) : 
  When i is a data.table (or character vector), x must be keyed (i.e. sorted, and, marked as sorted) so data.table knows which columns to join to and take advantage of x being sorted. Call setkey(x,...) first, see ?setkey.
> dt1[dt2,on=c("y"="y")] # nice
   x y z i.x i.z
1: 1 2 1   1   3
2: 2 3 2   2   4
3: 3 4 3   3   5
> dt1[dt2,on=c("x"="x")] # nice :)
   x y z i.y i.z
1: 1 2 1   2   3
2: 2 3 2   3   4
3: 3 4 3   4   5
> dt1[dt2,on=c("ex"="x")] # erm..
Error in forderv(x, by = rightcols) : 
  'by' value -2147483648 out of range [1,3]
> dt1[dt2,on=c("x"="ex")] # ...hm
Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
  Internal error. icols[0] is NA
> dt1[dt2,on=c("y"="why")] # that's what she said
Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
  x.'y' is a character column being joined to i.'NA' which is type 'NULL'. Character columns must join to factor or character columns.
  • reference to on in keyless join attempt error message
  • nice error if LHS in on columns do not exist in data table
  • nice error if RHS in on columns do not exist in i data.table

copyable code for above output:

dt1=data.table(x=1:3,y=as.character(2:4),z=1:3);dt2=data.table(x=1:3,y=as.character(2:4),z=3:5)
dt1[dt2]
dt1[dt2,on=c("y"="y")]
dt1[dt2,on=c("x"="x")]
dt1[dt2,on=c("ex"="x")]
dt1[dt2,on=c("x"="ex")]
dt1[dt2,on=c("y"="why")]

Please feel free to copy this where it belongs if you think it's worth it and that this was not the right place.
thanks again,
jan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants