-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Substrait: Fix incorrect join key fields (indices) when same table is being used more than once #6135
Conversation
Looking into failed checks. |
Failed CI checks look likely related to #6132 -- I will restart them as that issue has been resolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good to me; I have a few test suggestions but otherwise 👍 from me
@@ -278,6 +278,18 @@ mod tests { | |||
roundtrip("SELECT a,b,c,d,e FROM datafusion.public.data;").await | |||
} | |||
|
|||
#[tokio::test] | |||
async fn roundtrip_inner_join_table_reuse() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given there are offsets it might help to also test columns other than a
(which is at offset 0 in each schema)
Perhaps separate tests for the following cases would add better coverage:
SELECT d1.b, d2.c FROM data d1 JOIN data d2 ON d1.b = d2.b
SELECT d1.b, d2.c FROM data d1 JOIN data d2 ON d1.b = d2.c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also as I recall USING
is also special so maybe a test for:
SELECT d1.b, d2.c FROM data d1 JOIN data d2 using (b)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @alamb!
I added the first test case.
For the second, since c
is type DATE
, the join condition is invalid. So I tried:
SELECT d1.b, d2.c FROM data d1 JOIN data d2 ON CAST(d1.b AS int) = d2.e
and got an error due to unsupported expression in join condition. I filed an issue to track here.
For the using
case, this requires producer support for the join constraint so I got the error Error: NotImplemented("join constraint")
. Let me know if you want this issue filed as well.
Thank you @nseekhao |
@nseekhao do you think you'll have time to add the other tests? |
"SELECT d1.b, d2.c FROM data d1 JOIN data d2 ON d1.a = d2.a", | ||
"Projection: data.b, data.c\ | ||
\n Inner Join: data.a = data.a\ | ||
\n TableScan: data projection=[a, b]\ | ||
\n TableScan: data projection=[a, c]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this case against the main branch and it can pass. I guess this is not the original query that causes the error?
BTW, if two tables have the same name, does it means they are the same table? (for this case, d1
and d2
are the same table data
). If so I think we need not distinguish "different" columns because they will eventually refer to the same column. Please let me know if I missed anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I recall correctly, this would have failed in the earlier version of DF where DuplicateQualifiedField
gets thrown in DFSchema::new_with_metadata()
. However, in the later version, this error does not get thrown anymore and only DuplicateUnqualifiedField
gets thrown here.
However, even if this error does not get thrown, and the test passes, it also does not mean that the produced Substrait plan is correct. If we join the same table to itself, the JoinRel
in the Substrait plan would expect two input relations. AFAIK, there is no notion of pointer, so we'll need two ReadRel
s created from the same table. And as far as the JoinRel
is concerned, left
and right
are two separate relations. Since Substrait uses indices as opposed to name qualifier, JoinRel
would expect the available input indices to be 0
to size(left output)-1
and size(left output)
to size(left output) + size(right output)-1
. For example, if the input table has 5 columns, it'll expect to get indices from 0
to 9
. And let's say we're trying to join first column from left
to second column from right
, that would be join condition of col_0 = col_6
. Note that without this PR, the join condition would be col_0 = col_1
since from the DF named qualifiers, the producer would find the columns from the left and right to come from the same table, but that will send the wrong message to Substrait.
Apologies for the delay, I have just returned from a short vacation |
Converting back to draft so I don't accidentally merge this as we work out the comments on this PR |
Update datafusion/substrait/src/logical_plan/producer.rs Co-authored-by: Ruihang Xia <waynestxia@gmail.com> Update datafusion/substrait/src/logical_plan/producer.rs Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
17c5ff4
to
9041892
Compare
@waynexia do you have time to help review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @nseekhao -- I am not sure I followed the code 100% and I am not a substrait expert but the code looked good to me and the new tests are 👍
@@ -566,10 +581,33 @@ pub fn make_binary_op_scalar_func( | |||
} | |||
|
|||
/// Convert DataFusion Expr to Substrait Rex | |||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
… being used more than once (apache#6135) * Fix incorrect join key fields (indices) when same table is being used more than once * Addressed comments Update datafusion/substrait/src/logical_plan/producer.rs Co-authored-by: Ruihang Xia <waynestxia@gmail.com> Update datafusion/substrait/src/logical_plan/producer.rs Co-authored-by: Ruihang Xia <waynestxia@gmail.com> * Fixed bugs after rebase --------- Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
Which issue does this PR close?
Closes #.
This is a patch on top of PR#6059.
Rationale for this change
The PR#6059 fixed duplicate schema error. However, after inspecting the generated plans, it turned out that since we use table/col name from DF to translate to Substrait's field index, the producer has no way of distinguishing between the same column name from
left
andright
if the table name is also the same.What changes are included in this PR?
To fix the problem described, instead of constructing the join condition as one big expression then parse it, each side gets parsed separately so the producer can keep track of field index offset (based on number of columns from
left
) when it tries to parse theright
part of the join condition.Are these changes tested?
Yes.
Are there any user-facing changes?
No