Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substrait: Fix incorrect join key fields (indices) when same table is being used more than once #6135

Merged
merged 3 commits into from
Jun 11, 2023

Conversation

nseekhao
Copy link
Contributor

Which issue does this PR close?

Closes #.
This is a patch on top of PR#6059.

Rationale for this change

The PR#6059 fixed duplicate schema error. However, after inspecting the generated plans, it turned out that since we use table/col name from DF to translate to Substrait's field index, the producer has no way of distinguishing between the same column name from left and right if the table name is also the same.

What changes are included in this PR?

To fix the problem described, instead of constructing the join condition as one big expression then parse it, each side gets parsed separately so the producer can keep track of field index offset (based on number of columns from left) when it tries to parse the right part of the join condition.

Are these changes tested?

Yes.

Are there any user-facing changes?

No

@nseekhao
Copy link
Contributor Author

@alamb and @waynexia

I found that my last PR didn't completely fix the bug. I'm not sure if this is the best way to do it, but I think the join code is cleaner and the resulting plans I checked are correct. Any suggestions are welcome.

Thanks!

@nseekhao
Copy link
Contributor Author

Looking into failed checks.

@alamb
Copy link
Contributor

alamb commented Apr 27, 2023

Failed CI checks look likely related to #6132 -- I will restart them as that issue has been resolved

@alamb alamb closed this Apr 27, 2023
@alamb alamb reopened this Apr 27, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me; I have a few test suggestions but otherwise 👍 from me

@@ -278,6 +278,18 @@ mod tests {
roundtrip("SELECT a,b,c,d,e FROM datafusion.public.data;").await
}

#[tokio::test]
async fn roundtrip_inner_join_table_reuse() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given there are offsets it might help to also test columns other than a (which is at offset 0 in each schema)

Perhaps separate tests for the following cases would add better coverage:

  • SELECT d1.b, d2.c FROM data d1 JOIN data d2 ON d1.b = d2.b
  • SELECT d1.b, d2.c FROM data d1 JOIN data d2 ON d1.b = d2.c

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also as I recall USING is also special so maybe a test for:

  • SELECT d1.b, d2.c FROM data d1 JOIN data d2 using (b)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @alamb!

I added the first test case.

For the second, since c is type DATE, the join condition is invalid. So I tried:

SELECT d1.b, d2.c FROM data d1 JOIN data d2 ON CAST(d1.b AS int) = d2.e

and got an error due to unsupported expression in join condition. I filed an issue to track here.

For the using case, this requires producer support for the join constraint so I got the error Error: NotImplemented("join constraint"). Let me know if you want this issue filed as well.

@alamb
Copy link
Contributor

alamb commented Apr 28, 2023

Thank you @nseekhao

@alamb
Copy link
Contributor

alamb commented May 3, 2023

@nseekhao do you think you'll have time to add the other tests?

datafusion/substrait/src/logical_plan/producer.rs Outdated Show resolved Hide resolved
datafusion/substrait/src/logical_plan/producer.rs Outdated Show resolved Hide resolved
Comment on lines +284 to +422
"SELECT d1.b, d2.c FROM data d1 JOIN data d2 ON d1.a = d2.a",
"Projection: data.b, data.c\
\n Inner Join: data.a = data.a\
\n TableScan: data projection=[a, b]\
\n TableScan: data projection=[a, c]",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this case against the main branch and it can pass. I guess this is not the original query that causes the error?

BTW, if two tables have the same name, does it means they are the same table? (for this case, d1 and d2 are the same table data). If so I think we need not distinguish "different" columns because they will eventually refer to the same column. Please let me know if I missed anything

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall correctly, this would have failed in the earlier version of DF where DuplicateQualifiedField gets thrown in DFSchema::new_with_metadata(). However, in the later version, this error does not get thrown anymore and only DuplicateUnqualifiedField gets thrown here.

However, even if this error does not get thrown, and the test passes, it also does not mean that the produced Substrait plan is correct. If we join the same table to itself, the JoinRel in the Substrait plan would expect two input relations. AFAIK, there is no notion of pointer, so we'll need two ReadRels created from the same table. And as far as the JoinRel is concerned, left and right are two separate relations. Since Substrait uses indices as opposed to name qualifier, JoinRel would expect the available input indices to be 0 to size(left output)-1 and size(left output) to size(left output) + size(right output)-1. For example, if the input table has 5 columns, it'll expect to get indices from 0 to 9. And let's say we're trying to join first column from left to second column from right, that would be join condition of col_0 = col_6. Note that without this PR, the join condition would be col_0 = col_1 since from the DF named qualifiers, the producer would find the columns from the left and right to come from the same table, but that will send the wrong message to Substrait.

@waynexia
Copy link
Member

waynexia commented May 4, 2023

Apologies for the delay, I have just returned from a short vacation

@alamb alamb self-requested a review May 4, 2023 14:02
@alamb alamb marked this pull request as draft May 4, 2023 14:02
@alamb
Copy link
Contributor

alamb commented May 4, 2023

Converting back to draft so I don't accidentally merge this as we work out the comments on this PR

@nseekhao
Copy link
Contributor Author

@nseekhao do you think you'll have time to add the other tests?

My apologies. I missed this thread before my vacation, I'll address the comments this week. Thank you for the reviews @alamb and @waynexia

@nseekhao nseekhao marked this pull request as ready for review June 6, 2023 15:52
nseekhao and others added 2 commits June 6, 2023 09:00
Update datafusion/substrait/src/logical_plan/producer.rs

Co-authored-by: Ruihang Xia <waynestxia@gmail.com>

Update datafusion/substrait/src/logical_plan/producer.rs

Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
@nseekhao nseekhao force-pushed the substrait-join-patch-table-reuse branch from 17c5ff4 to 9041892 Compare June 6, 2023 16:03
@alamb alamb requested a review from waynexia June 8, 2023 18:29
@alamb
Copy link
Contributor

alamb commented Jun 8, 2023

@waynexia do you have time to help review this PR?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @nseekhao -- I am not sure I followed the code 100% and I am not a substrait expert but the code looked good to me and the new tests are 👍

@@ -566,10 +581,33 @@ pub fn make_binary_op_scalar_func(
}

/// Convert DataFusion Expr to Substrait Rex
///
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit e6265c1 into apache:main Jun 11, 2023
jayzhan211 pushed a commit to jayzhan211/datafusion that referenced this pull request Jun 12, 2023
… being used more than once (apache#6135)

* Fix incorrect join key fields (indices) when same table is being used more than once

* Addressed comments

Update datafusion/substrait/src/logical_plan/producer.rs

Co-authored-by: Ruihang Xia <waynestxia@gmail.com>

Update datafusion/substrait/src/logical_plan/producer.rs

Co-authored-by: Ruihang Xia <waynestxia@gmail.com>

* Fixed bugs after rebase

---------

Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants