Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent and confusing naming convention for unequal join #2006

Open
Guozhong-Gordon opened this issue Jan 30, 2017 · 2 comments · May be fixed by #3093
Open

Inconsistent and confusing naming convention for unequal join #2006

Guozhong-Gordon opened this issue Jan 30, 2017 · 2 comments · May be fixed by #3093
Labels
consistency non-equi joins rolling, overlapping, non-equi joins

Comments

@Guozhong-Gordon
Copy link

It is great to have unequal join in data.table.
However, when I start to test the functionality. I found the following issue:

DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)

X = data.table(x=c("c","b"), v=8:7, foo=c(4,2))

DT[X, .( x,foo,v,i.v,x.v, x.y,y), on=.(x, y<=foo)]

The result includes all rows from the second table (X here).
In general, the default naming conventions for the columns in the results are:
names without prefix refers to columns and values from the first table in the join. In the above example, from the table DT.
When we pay attention to the values of v and y, we will see the following inconsistency.
Namely, when the unequal join columns having different names,
eg, y with foo, as the above. It uses the name y from the first table
and actually take the value of "foo" which is from the second table, in the above example (X).
This is rather confusing. Is this a bug or the intention?

When the unequal join for two columns with the same name. It gets even more confusing.
A second example:
DT[X, .(x,y,v,foo,DT.v=x.v, i.v), on=.(x, v>=v)]
Just as noted previously, all columns without
prefix should have the values from the first table!
For those columns that are included in the unequal join
this does not apply! The name v is from the first table,
but the value is from the second table!
This is very confusing.

As a consequence of this, the following syntax
we can get completely confused with the following, when we
do not explicitly specify which tables the v columns comes from:
DT[X, on=.(x, v>=v)]
Look at the v column, Is it the value of the second table, or the first?
DT[X, on=.(x, v>=v), sum(y)*foo, by=.EACHI]
Which y is being summed?

@jangorecki
Copy link
Member

@Guozhong-Gordon Columns on which you join are taken from i. so you need to take care of expected ones using x.. This behavior is discussed in #1615.

@Guozhong-Gordon
Copy link
Author

Guozhong-Gordon commented Mar 5, 2017 via email

@mattdowle mattdowle added this to the v1.10.6 milestone Apr 10, 2018
@mattdowle mattdowle modified the milestones: v1.11.0, v1.11.2 Apr 29, 2018
@jangorecki jangorecki modified the milestones: 1.12.0, 1.11.6 Jun 6, 2018
@mattdowle mattdowle modified the milestones: 1.11.6, 1.12.0 Sep 20, 2018
@mattdowle mattdowle linked a pull request Oct 3, 2018 that will close this issue
4 tasks
@mattdowle mattdowle modified the milestones: 1.12.0, 1.12.2 Jan 11, 2019
@jangorecki jangorecki added the non-equi joins rolling, overlapping, non-equi joins label Apr 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consistency non-equi joins rolling, overlapping, non-equi joins
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants