-
Notifications
You must be signed in to change notification settings - Fork 977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent and confusing naming convention for unequal join #2006
Labels
Comments
@Guozhong-Gordon Columns on which you join are taken from |
Thanks for pointing to early related discussion. I know what is used as
colum name. The issue remains, confusion in the case of unequal join. It is
against the convention used in sql, also against common sense intuition.
One has to twist the brain to avoid being confused.
…On Sunday, March 5, 2017, Jan Gorecki ***@***.***> wrote:
@Guozhong-Gordon <https://github.com/Guozhong-Gordon> Columns on which
you join are taken from i. so you need to take care of expected ones
using x.. This behavior is discussed in #1615
<#1615>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2006 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AYQGzPRPON_4uXrT9i1fPegqOhGzkZSVks5rirG7gaJpZM4LxPfU>
.
|
10 tasks
4 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
It is great to have unequal join in data.table.
However, when I start to test the functionality. I found the following issue:
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
X = data.table(x=c("c","b"), v=8:7, foo=c(4,2))
DT[X, .( x,foo,v,i.v,x.v, x.y,y), on=.(x, y<=foo)]
The result includes all rows from the second table (X here).
In general, the default naming conventions for the columns in the results are:
names without prefix refers to columns and values from the first table in the join. In the above example, from the table DT.
When we pay attention to the values of v and y, we will see the following inconsistency.
Namely, when the unequal join columns having different names,
eg, y with foo, as the above. It uses the name y from the first table
and actually take the value of "foo" which is from the second table, in the above example (X).
This is rather confusing. Is this a bug or the intention?
When the unequal join for two columns with the same name. It gets even more confusing.
A second example:
DT[X, .(x,y,v,foo,DT.v=x.v, i.v), on=.(x, v>=v)]
Just as noted previously, all columns without
prefix should have the values from the first table!
For those columns that are included in the unequal join
this does not apply! The name v is from the first table,
but the value is from the second table!
This is very confusing.
As a consequence of this, the following syntax
we can get completely confused with the following, when we
do not explicitly specify which tables the v columns comes from:
DT[X, on=.(x, v>=v)]
Look at the v column, Is it the value of the second table, or the first?
DT[X, on=.(x, v>=v), sum(y)*foo, by=.EACHI]
Which y is being summed?
The text was updated successfully, but these errors were encountered: