-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19766][SQL] Constant alias columns in INNER JOIN should not be folded by FoldablePropagation rule #17099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
| // join is not always picked from its children, but can also be null. | ||
| // TODO(cloud-fan): It seems more reasonable to use new attributes as the output attributes | ||
| // of outer join. | ||
| case j @ Join(_, _, Inner, _) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We forgot to check stop here. Can you just change this line into:
case j @ Join(_, _, Inner, _) if !stop =>|
Test build #73588 has finished for PR 17099 at commit
|
|
Test build #73592 has finished for PR 17099 at commit
|
| SELECT tb.* FROM ta INNER JOIN tb ON ta.a = tb.a AND ta.tag = tb.tag; | ||
|
|
||
| -- Clean up | ||
| DROP VIEW IF EXISTS t1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think you need these drop views, since TEMPORARY VIEW are destroyed immediately after this file.
|
Test build #73594 has finished for PR 17099 at commit
|
|
Could you add a test case to |
| .union(testRelation.select('a, Literal("b").as('tag))) | ||
| .subquery('tb) | ||
| val query = ta.join(tb, Inner, | ||
| Some("ta.a".attr === "tb.a".attr && "ta.tag".attr === "tb.tag")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong. What you are doing is to compare the column ta.tag with a string constant "tb.tag"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> Some("ta.a".attr === "tb.a".attr && "ta.tag".attr === "tb.tag".attr))
Then add the rule ConstantFolding into the test suite.
|
Test build #73659 has finished for PR 17099 at commit
|
|
Thanks for @gatorsmile 's help.
It's fine without adding Before fix: After fix: I just fix the test case( |
|
LGTM |
|
Test build #73669 has finished for PR 17099 at commit
|
… folded by FoldablePropagation rule
## What changes were proposed in this pull request?
This PR fixes the code in Optimizer phase where the constant alias columns of a `INNER JOIN` query are folded in Rule `FoldablePropagation`.
For the following query():
```
val sqlA =
"""
|create temporary view ta as
|select a, 'a' as tag from t1 union all
|select a, 'b' as tag from t2
""".stripMargin
val sqlB =
"""
|create temporary view tb as
|select a, 'a' as tag from t3 union all
|select a, 'b' as tag from t4
""".stripMargin
val sql =
"""
|select tb.* from ta inner join tb on
|ta.a = tb.a and
|ta.tag = tb.tag
""".stripMargin
```
The tag column is an constant alias column, it's folded by `FoldablePropagation` like this:
```
TRACE SparkOptimizer:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.FoldablePropagation ===
Project [a#4, tag#14] Project [a#4, tag#14]
!+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14)) +- Join Inner, ((a#0 = a#4) && (a = a))
:- Union :- Union
: :- Project [a#0, a AS tag#8] : :- Project [a#0, a AS tag#8]
: : +- LocalRelation [a#0] : : +- LocalRelation [a#0]
: +- Project [a#2, b AS tag#9] : +- Project [a#2, b AS tag#9]
: +- LocalRelation [a#2] : +- LocalRelation [a#2]
+- Union +- Union
:- Project [a#4, a AS tag#14] :- Project [a#4, a AS tag#14]
: +- LocalRelation [a#4] : +- LocalRelation [a#4]
+- Project [a#6, b AS tag#15] +- Project [a#6, b AS tag#15]
+- LocalRelation [a#6] +- LocalRelation [a#6]
```
Finally the Result of Batch Operator Optimizations is:
```
Project [a#4, tag#14] Project [a#4, tag#14]
!+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14)) +- Join Inner, (a#0 = a#4)
! :- SubqueryAlias ta, `ta` :- Union
! : +- Union : :- LocalRelation [a#0]
! : :- Project [a#0, a AS tag#8] : +- LocalRelation [a#2]
! : : +- SubqueryAlias t1, `t1` +- Union
! : : +- Project [a#0] :- LocalRelation [a#4, tag#14]
! : : +- SubqueryAlias grouping +- LocalRelation [a#6, tag#15]
! : : +- LocalRelation [a#0]
! : +- Project [a#2, b AS tag#9]
! : +- SubqueryAlias t2, `t2`
! : +- Project [a#2]
! : +- SubqueryAlias grouping
! : +- LocalRelation [a#2]
! +- SubqueryAlias tb, `tb`
! +- Union
! :- Project [a#4, a AS tag#14]
! : +- SubqueryAlias t3, `t3`
! : +- Project [a#4]
! : +- SubqueryAlias grouping
! : +- LocalRelation [a#4]
! +- Project [a#6, b AS tag#15]
! +- SubqueryAlias t4, `t4`
! +- Project [a#6]
! +- SubqueryAlias grouping
! +- LocalRelation [a#6]
```
The condition `tag#8 = tag#14` of INNER JOIN has been removed. This leads to the data of inner join being wrong.
After fix:
```
=== Result of Batch LocalRelation ===
GlobalLimit 21 GlobalLimit 21
+- LocalLimit 21 +- LocalLimit 21
+- Project [a#4, tag#11] +- Project [a#4, tag#11]
+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#11)) +- Join Inner, ((a#0 = a#4) && (tag#8 = tag#11))
! :- SubqueryAlias ta :- Union
! : +- Union : :- LocalRelation [a#0, tag#8]
! : :- Project [a#0, a AS tag#8] : +- LocalRelation [a#2, tag#9]
! : : +- SubqueryAlias t1 +- Union
! : : +- Project [a#0] :- LocalRelation [a#4, tag#11]
! : : +- SubqueryAlias grouping +- LocalRelation [a#6, tag#12]
! : : +- LocalRelation [a#0]
! : +- Project [a#2, b AS tag#9]
! : +- SubqueryAlias t2
! : +- Project [a#2]
! : +- SubqueryAlias grouping
! : +- LocalRelation [a#2]
! +- SubqueryAlias tb
! +- Union
! :- Project [a#4, a AS tag#11]
! : +- SubqueryAlias t3
! : +- Project [a#4]
! : +- SubqueryAlias grouping
! : +- LocalRelation [a#4]
! +- Project [a#6, b AS tag#12]
! +- SubqueryAlias t4
! +- Project [a#6]
! +- SubqueryAlias grouping
! +- LocalRelation [a#6]
```
## How was this patch tested?
add sql-tests/inputs/inner-join.sql
All tests passed.
Author: Stan Zhai <zhaishidan@haizhi.com>
Closes #17099 from stanzhai/fix-inner-join.
(cherry picked from commit 5502a9c)
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
|
Thanks! Merging to master/2.1 |
|
@stanzhai Could you submit another PR to backport it to Spark 2.0? |
|
ok |
| @@ -0,0 +1,68 @@ | |||
| -- Automatically generated by SQLQueryTestSuite | |||
| -- Number of queries: 13 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this number is wrong. Next time, please do not manually change this file. You should run the command to generate the file. @stanzhai
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I will pay attention to this next time.
What changes were proposed in this pull request?
This PR fixes the code in Optimizer phase where the constant alias columns of a
INNER JOINquery are folded in RuleFoldablePropagation.For the following query():
The tag column is an constant alias column, it's folded by
FoldablePropagationlike this:Finally the Result of Batch Operator Optimizations is:
The condition
tag#8 = tag#14of INNER JOIN has been removed. This leads to the data of inner join being wrong.After fix:
How was this patch tested?
add sql-tests/inputs/inner-join.sql
All tests passed.