Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error with data.table when joining on different variables in x and y tables #694

Closed
cderv opened this issue Oct 14, 2014 · 8 comments
Closed
Assignees
Labels
feature a feature request or enhancement
Milestone

Comments

@cderv
Copy link
Contributor

cderv commented Oct 14, 2014

Hi,

Using dplyr 0.3.0.2, I like the new features for *_join :

  • Joins (e.g. left_join(), inner_join(), semi_join(), anti_join())
    now allow you to join on different variables in x and y tables by
    supplying a named vector to by. For example, by = c("a" = "b") joins
    x.a to y.b.

However, it seems not to work the same on data.frame and data.table. For data.table, it gives me an error :
#Error: Data table joins must be on same key


Here's a reproductible example :

library(dplyr)
A<-mtcars %>% select(vs,mpg,disp) %>% sample_n(size = 3,replace = T)
B<-mtcars %>% select(vs,mpg,disp) %>% sample_n(size = 3,replace = T)
names(B)<-paste0(names(B),".B")
A_DT<-as.data.table(A)
B_DT<-as.data.table(B)

When i tried to do an inner join with data.frame, it's ok

A %>>%
  inner_join(B,by=c("vs"="vs.B"))
#  vs  mpg  disp mpg.B disp.B
#1  0 26.0 120.3  15.8    351
#2  0 15.2 304.0  15.8    351
#3  1 33.9  71.1  22.8    108
#4  0 26.0 120.3  14.7    440
#5  0 15.2 304.0  14.7    440

If I tried to do the same with data.table,I get an error.

A_DT %>>%
  anti_join(B_DT,by=c("vs"="vs.B"))
#Error: Data table joins must be on same key

I was expected the same result whether my tables are data.frame or data.table.
Apparently, new feature for joining on different variables in x and y tables are not working with data.table. I do not know if it should, but at least I report it.

If you could do something about it (probably with setkey), it would be great.

Thank you for your great work.

Christophe

@hadley
Copy link
Member

hadley commented Oct 14, 2014

I couldn't figure out how to do that (hence the error message), so if you know how to do it with data.table some sample code would be very helpful.

@cderv
Copy link
Contributor Author

cderv commented Oct 14, 2014

I went to see in you code (join-dt.R) what's the issue. I understand it is a limite of data.table::merge function.
With data.table, I often use the other way to merge table - when both data.table are keyed, DT1[DT2] are merged not on common key but on 1rst element keyed, then on second element keyed and so on.
With my example :

library(dplyr)
library(testthat)
# Sorted to be equivalent to key
A<-mtcars %>% select(vs,mpg,disp) %>% sample_n(size = 3,replace = T) %>>% arrange(vs)
B<-mtcars %>% select(vs,mpg,disp) %>% sample_n(size = 3,replace = T) %>>% arrange(vs)
names(B)<-paste0(names(B),".B")
A_DT<-as.data.table(A)
B_DT<-as.data.table(B)
#Both table are keyed on different columns
setkeyv(A_DT,"vs")
setkeyv(B_DT,"vs.B")
M1 <- A %>>%  inner_join(B,by=c("vs"="vs.B"))
M2 <- A_DT[B_DT,allow.cartesian=T]
setkey(M2,NULL) #unkeyed to be compare
expect_equal(as.data.table(M1),M2)

this way it work.
It's similar from the merge you use for semi_join and anti_join, with no use of data.table::merge.
With no.match=0, I think we have a inner_join and with the default no.match=NA, it should be a left join. To be verified though.

Some insight on stackoverflow here and here

Does it help or did you tried all this?

@hadley
Copy link
Member

hadley commented Oct 30, 2014

The challenge with setting keys is that it modifies the original data tables, which I want to avoid in dplyr.

@hadley hadley added the feature a feature request or enhancement label Oct 30, 2014
@hadley hadley added this to the 0.4 milestone Oct 30, 2014
@hadley hadley self-assigned this Oct 30, 2014
@ghost
Copy link

ghost commented Jun 14, 2015

Think the data.table guys now fixed this in devel, see Rdatatable/data.table#637. However, it might take some time until we see a new CRAN release of data.table.

@lasseklokker
Copy link

Am I correct in thinking that data.table from v1.9.6 now supports what have been the problem for this issue to be solved, as no deep copies are made anymore, or do I not fully understand the problem?

@Dripdrop12
Copy link

I can confirm that v1.9.6 is still throwing this error.

@siddhesh1991
Copy link

I am still getting this error. I have 2 data.tables and I am trying left_join on 2 different keys

a <- b %>% left_join(c, by= c('LEADID.2'='LEADID'))
Error: Data table joins must be on same key

@Giannettig
Copy link

Me too

@lock lock bot locked as resolved and limited conversation to collaborators Jun 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

6 participants