Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-equi-join: foo is i.bar instead of x.foo if on = "foo>=bar" #2602

Closed
MarkusBonsch opened this issue Feb 1, 2018 · 4 comments
Closed
Labels
consistency duplicate non-equi joins rolling, overlapping, non-equi joins
Milestone

Comments

@MarkusBonsch
Copy link
Contributor

I find the here-described behaviour very confusing. It is closely related and probably the root-cause of the following issues:
#2595, #2569

library(data.table)
a <- data.table(foo = 1:5)
b <- data.table(bar = 1:2)
## the following returns unexpected stuff to me:
a[b, on = "foo>=bar"]
#    foo
#1:   1
#2:   1
#3:   1
#4:   1
#5:   1
#6:   2
#7:   2
#8:   2
#9:   2
## adding x.foo and i.bar columns explicetly reveals that the 'foo' column of the result is
## in reality the 'bar' column from i:
a[b, .(foo, x.foo, bar), on = "foo>=bar"]
#   foo x.foo bar
#1:   1     1   1
#2:   1     2   1
#3:   1     3   1
#4:   1     4   1
#5:   1     5   1
#6:   2     2   2
#7:   2     3   2
#8:   2     4   2
#9:   2     5   2
## to me, this is a clear bug. 
## I believe the default behaviour should be to keep both join 
## columns in the result since they are not identical:
a[b, on = "foo>=bar"]
#    foo bar
#1:   1  1
#2:   2  1
#3:   3  1
#4:   4  1
#5:   5  1
#6:   2  2
#7:   3  2
#8:   4  2
#9:   5  2

@MarkusBonsch MarkusBonsch added bug non-equi joins rolling, overlapping, non-equi joins labels Feb 1, 2018
@HughParsonage
Copy link
Member

HughParsonage commented Feb 1, 2018

A fix would certainly involve a breaking change.

A similar thing occurs with rolling joins. The confusion arises because with 'equi' joins only one column is needed, so the duplicate column is dropped. There it is a little less unexpected because the columns have the same name.

Consider

a[b, on = "foo==bar"]
merge(a, b, by.x = "foo", by.y = "bar")

@MarkusBonsch
Copy link
Contributor Author

@HughParsonage that is very likely true. Maybe less breaking: return only one column as before, but the real x.foo instead of i.bar. Or leave as it is and document clearly somewhere or put a warning.

@franknarf1
Copy link
Contributor

franknarf1 commented Feb 1, 2018

Or leave as it is and document clearly somewhere or put a warning.

I like this idea, raising it to an error if "strict scope" (#633) is on, so that the user has to write i.* and x.* to refer to columns in j inside a join. There could also be a "verbose" message like

Found symbol foo. Treating as i.bar; see ?data.table and vignette("datatable-joins") for details.

Also related, I think: #1615 (which illustrates that the problem is present in equi-joins too)

@MarkusBonsch
Copy link
Contributor Author

MarkusBonsch commented Feb 3, 2018

Definitely a duplicate, e.g. #2307.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consistency duplicate non-equi joins rolling, overlapping, non-equi joins
Projects
None yet
Development

No branches or pull requests

4 participants