Skip to content

Conversation

@stanzhai
Copy link
Contributor

What changes were proposed in this pull request?

This PR fixes the code in Optimizer phase where the constant alias columns of a INNER JOIN query are folded in Rule FoldablePropagation.

For the following query():

val sqlA =
  """
    |create temporary view ta as
    |select a, 'a' as tag from t1 union all
    |select a, 'b' as tag from t2
  """.stripMargin

val sqlB =
  """
    |create temporary view tb as
    |select a, 'a' as tag from t3 union all
    |select a, 'b' as tag from t4
  """.stripMargin

val sql =
  """
    |select tb.* from ta inner join tb on
    |ta.a = tb.a and
    |ta.tag = tb.tag
  """.stripMargin

The tag column is an constant alias column, it's folded by FoldablePropagation like this:

TRACE SparkOptimizer: 
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.FoldablePropagation ===
 Project [a#4, tag#14]                              Project [a#4, tag#14]
!+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14))   +- Join Inner, ((a#0 = a#4) && (a = a))
    :- Union                                           :- Union
    :  :- Project [a#0, a AS tag#8]                    :  :- Project [a#0, a AS tag#8]
    :  :  +- LocalRelation [a#0]                       :  :  +- LocalRelation [a#0]
    :  +- Project [a#2, b AS tag#9]                    :  +- Project [a#2, b AS tag#9]
    :     +- LocalRelation [a#2]                       :     +- LocalRelation [a#2]
    +- Union                                           +- Union
       :- Project [a#4, a AS tag#14]                      :- Project [a#4, a AS tag#14]
       :  +- LocalRelation [a#4]                          :  +- LocalRelation [a#4]
       +- Project [a#6, b AS tag#15]                      +- Project [a#6, b AS tag#15]
          +- LocalRelation [a#6]                             +- LocalRelation [a#6]

Finally the Result of Batch Operator Optimizations is:

Project [a#4, tag#14]                              Project [a#4, tag#14]
!+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14))   +- Join Inner, (a#0 = a#4)
!   :- SubqueryAlias ta, `ta`                          :- Union
!   :  +- Union                                        :  :- LocalRelation [a#0]
!   :     :- Project [a#0, a AS tag#8]                 :  +- LocalRelation [a#2]
!   :     :  +- SubqueryAlias t1, `t1`                 +- Union
!   :     :     +- Project [a#0]                          :- LocalRelation [a#4, tag#14]
!   :     :        +- SubqueryAlias grouping              +- LocalRelation [a#6, tag#15]
!   :     :           +- LocalRelation [a#0]        
!   :     +- Project [a#2, b AS tag#9]              
!   :        +- SubqueryAlias t2, `t2`              
!   :           +- Project [a#2]                    
!   :              +- SubqueryAlias grouping        
!   :                 +- LocalRelation [a#2]        
!   +- SubqueryAlias tb, `tb`                       
!      +- Union                                     
!         :- Project [a#4, a AS tag#14]             
!         :  +- SubqueryAlias t3, `t3`              
!         :     +- Project [a#4]                    
!         :        +- SubqueryAlias grouping        
!         :           +- LocalRelation [a#4]        
!         +- Project [a#6, b AS tag#15]             
!            +- SubqueryAlias t4, `t4`              
!               +- Project [a#6]                    
!                  +- SubqueryAlias grouping        
!                     +- LocalRelation [a#6]    

The condition tag#8 = tag#14 of INNER JOIN has been removed. This leads to the data of inner join being wrong.

After fix:

=== Result of Batch LocalRelation ===
 GlobalLimit 21                                           GlobalLimit 21
 +- LocalLimit 21                                         +- LocalLimit 21
    +- Project [a#4, tag#11]                                 +- Project [a#4, tag#11]
       +- Join Inner, ((a#0 = a#4) && (tag#8 = tag#11))         +- Join Inner, ((a#0 = a#4) && (tag#8 = tag#11))
!         :- SubqueryAlias ta                                      :- Union
!         :  +- Union                                              :  :- LocalRelation [a#0, tag#8]
!         :     :- Project [a#0, a AS tag#8]                       :  +- LocalRelation [a#2, tag#9]
!         :     :  +- SubqueryAlias t1                             +- Union
!         :     :     +- Project [a#0]                                :- LocalRelation [a#4, tag#11]
!         :     :        +- SubqueryAlias grouping                    +- LocalRelation [a#6, tag#12]
!         :     :           +- LocalRelation [a#0]        
!         :     +- Project [a#2, b AS tag#9]              
!         :        +- SubqueryAlias t2                    
!         :           +- Project [a#2]                    
!         :              +- SubqueryAlias grouping        
!         :                 +- LocalRelation [a#2]        
!         +- SubqueryAlias tb                             
!            +- Union                                     
!               :- Project [a#4, a AS tag#11]             
!               :  +- SubqueryAlias t3                    
!               :     +- Project [a#4]                    
!               :        +- SubqueryAlias grouping        
!               :           +- LocalRelation [a#4]        
!               +- Project [a#6, b AS tag#12]             
!                  +- SubqueryAlias t4                    
!                     +- Project [a#6]                    
!                        +- SubqueryAlias grouping        
!                           +- LocalRelation [a#6]  

How was this patch tested?

add sql-tests/inputs/inner-join.sql
All tests passed.

@stanzhai stanzhai changed the title Constant alias columns in INNER JOIN should not be folded by FoldablePropagation rule [SPARK-19766][SQL] Constant alias columns in INNER JOIN should not be folded by FoldablePropagation rule Feb 28, 2017
@stanzhai
Copy link
Contributor Author

@hvanhovell

@hvanhovell
Copy link
Contributor

ok to test

// join is not always picked from its children, but can also be null.
// TODO(cloud-fan): It seems more reasonable to use new attributes as the output attributes
// of outer join.
case j @ Join(_, _, Inner, _) =>
Copy link
Contributor

@hvanhovell hvanhovell Feb 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We forgot to check stop here. Can you just change this line into:

case j @ Join(_, _, Inner, _) if !stop =>

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73588 has finished for PR 17099 at commit 4463648.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73592 has finished for PR 17099 at commit fc819f8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SELECT tb.* FROM ta INNER JOIN tb ON ta.a = tb.a AND ta.tag = tb.tag;

-- Clean up
DROP VIEW IF EXISTS t1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think you need these drop views, since TEMPORARY VIEW are destroyed immediately after this file.

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73594 has finished for PR 17099 at commit c197b20.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Could you add a test case to FoldablePropagationSuite? Thanks!

.union(testRelation.select('a, Literal("b").as('tag)))
.subquery('tb)
val query = ta.join(tb, Inner,
Some("ta.a".attr === "tb.a".attr && "ta.tag".attr === "tb.tag"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong. What you are doing is to compare the column ta.tag with a string constant "tb.tag"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> Some("ta.a".attr === "tb.a".attr && "ta.tag".attr === "tb.tag".attr))

Then add the rule ConstantFolding into the test suite.

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73659 has finished for PR 17099 at commit 15fae50.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@stanzhai
Copy link
Contributor Author

stanzhai commented Mar 1, 2017

Thanks for @gatorsmile 's help.

ConstantFolding will affect other test cases in FoldablePropagationSuite.

It's fine without adding ConstantFolding.

Before fix:

[info]   !'Join Inner, ((a#0 = a#0) && (1 = 1))        'Join Inner, (('tb.a = 'ta.a) && ('tb.tag = 'ta.tag))
[info]   !:- Union                                     :- 'SubqueryAlias ta
[info]   !:  :- Project [a#0, 1 AS tag#0]              :  +- 'Union
[info]   !:  :  +- LocalRelation <empty>, [a#0, b#0]   :     :- 'Project ['a, 1 AS tag#0]
[info]   !:  +- Project [a#0, 2 AS tag#0]              :     :  +- LocalRelation <empty>, [a#0, b#0]
[info]   !:     +- LocalRelation <empty>, [a#0, b#0]   :     +- 'Project ['a, 2 AS tag#0]
[info]   !+- Union                                     :        +- LocalRelation <empty>, [a#0, b#0]
[info]   !   :- Project [a#0, 1 AS tag#0]              +- 'SubqueryAlias tb
[info]   !   :  +- LocalRelation <empty>, [a#0, b#0]      +- 'Union
[info]   !   +- Project [a#0, 2 AS tag#0]                    :- 'Project ['a, 1 AS tag#0]
[info]   !      +- LocalRelation <empty>, [a#0, b#0]         :  +- LocalRelation <empty>, [a#0, b#0]
[info]   !                                                   +- 'Project ['a, 2 AS tag#0]
[info]   !                                                      +- LocalRelation <empty>, [a#0, b#0] (PlanTest.scala:99)

After fix:

[info]   !'Join Inner, ((a#0 = a#0) && (tag#0 = tag#0))   'Join Inner, (('tb.a = 'ta.a) && ('tb.tag = 'ta.tag))
[info]   !:- Union                                        :- 'SubqueryAlias ta
[info]   !:  :- Project [a#0, 1 AS tag#0]                 :  +- 'Union
[info]   !:  :  +- LocalRelation <empty>, [a#0, b#0]      :     :- 'Project ['a, 1 AS tag#0]
[info]   !:  +- Project [a#0, 2 AS tag#0]                 :     :  +- LocalRelation <empty>, [a#0, b#0]
[info]   !:     +- LocalRelation <empty>, [a#0, b#0]      :     +- 'Project ['a, 2 AS tag#0]
[info]   !+- Union                                        :        +- LocalRelation <empty>, [a#0, b#0]
[info]   !   :- Project [a#0, 1 AS tag#0]                 +- 'SubqueryAlias tb
[info]   !   :  +- LocalRelation <empty>, [a#0, b#0]         +- 'Union
[info]   !   +- Project [a#0, 2 AS tag#0]                       :- 'Project ['a, 1 AS tag#0]
[info]   !      +- LocalRelation <empty>, [a#0, b#0]            :  +- LocalRelation <empty>, [a#0, b#0]
[info]   !                                                      +- 'Project ['a, 2 AS tag#0]
[info]   !                                                         +- LocalRelation <empty>, [a#0, b#0] (PlanTest.scala:99)

I just fix the test case("tb.tag" -> "tb.tag".attr).

@gatorsmile
Copy link
Member

LGTM

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73669 has finished for PR 17099 at commit df11cc4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Mar 1, 2017
… folded by FoldablePropagation rule

## What changes were proposed in this pull request?
This PR fixes the code in Optimizer phase where the constant alias columns of a `INNER JOIN` query are folded in Rule `FoldablePropagation`.

For the following query():

```
val sqlA =
  """
    |create temporary view ta as
    |select a, 'a' as tag from t1 union all
    |select a, 'b' as tag from t2
  """.stripMargin

val sqlB =
  """
    |create temporary view tb as
    |select a, 'a' as tag from t3 union all
    |select a, 'b' as tag from t4
  """.stripMargin

val sql =
  """
    |select tb.* from ta inner join tb on
    |ta.a = tb.a and
    |ta.tag = tb.tag
  """.stripMargin
```

The tag column is an constant alias column, it's folded by `FoldablePropagation` like this:

```
TRACE SparkOptimizer:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.FoldablePropagation ===
 Project [a#4, tag#14]                              Project [a#4, tag#14]
!+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14))   +- Join Inner, ((a#0 = a#4) && (a = a))
    :- Union                                           :- Union
    :  :- Project [a#0, a AS tag#8]                    :  :- Project [a#0, a AS tag#8]
    :  :  +- LocalRelation [a#0]                       :  :  +- LocalRelation [a#0]
    :  +- Project [a#2, b AS tag#9]                    :  +- Project [a#2, b AS tag#9]
    :     +- LocalRelation [a#2]                       :     +- LocalRelation [a#2]
    +- Union                                           +- Union
       :- Project [a#4, a AS tag#14]                      :- Project [a#4, a AS tag#14]
       :  +- LocalRelation [a#4]                          :  +- LocalRelation [a#4]
       +- Project [a#6, b AS tag#15]                      +- Project [a#6, b AS tag#15]
          +- LocalRelation [a#6]                             +- LocalRelation [a#6]
```

Finally the Result of Batch Operator Optimizations is:

```
Project [a#4, tag#14]                              Project [a#4, tag#14]
!+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14))   +- Join Inner, (a#0 = a#4)
!   :- SubqueryAlias ta, `ta`                          :- Union
!   :  +- Union                                        :  :- LocalRelation [a#0]
!   :     :- Project [a#0, a AS tag#8]                 :  +- LocalRelation [a#2]
!   :     :  +- SubqueryAlias t1, `t1`                 +- Union
!   :     :     +- Project [a#0]                          :- LocalRelation [a#4, tag#14]
!   :     :        +- SubqueryAlias grouping              +- LocalRelation [a#6, tag#15]
!   :     :           +- LocalRelation [a#0]
!   :     +- Project [a#2, b AS tag#9]
!   :        +- SubqueryAlias t2, `t2`
!   :           +- Project [a#2]
!   :              +- SubqueryAlias grouping
!   :                 +- LocalRelation [a#2]
!   +- SubqueryAlias tb, `tb`
!      +- Union
!         :- Project [a#4, a AS tag#14]
!         :  +- SubqueryAlias t3, `t3`
!         :     +- Project [a#4]
!         :        +- SubqueryAlias grouping
!         :           +- LocalRelation [a#4]
!         +- Project [a#6, b AS tag#15]
!            +- SubqueryAlias t4, `t4`
!               +- Project [a#6]
!                  +- SubqueryAlias grouping
!                     +- LocalRelation [a#6]
```

The condition `tag#8 = tag#14` of INNER JOIN has been removed. This leads to the data of inner join being wrong.

After fix:

```
=== Result of Batch LocalRelation ===
 GlobalLimit 21                                           GlobalLimit 21
 +- LocalLimit 21                                         +- LocalLimit 21
    +- Project [a#4, tag#11]                                 +- Project [a#4, tag#11]
       +- Join Inner, ((a#0 = a#4) && (tag#8 = tag#11))         +- Join Inner, ((a#0 = a#4) && (tag#8 = tag#11))
!         :- SubqueryAlias ta                                      :- Union
!         :  +- Union                                              :  :- LocalRelation [a#0, tag#8]
!         :     :- Project [a#0, a AS tag#8]                       :  +- LocalRelation [a#2, tag#9]
!         :     :  +- SubqueryAlias t1                             +- Union
!         :     :     +- Project [a#0]                                :- LocalRelation [a#4, tag#11]
!         :     :        +- SubqueryAlias grouping                    +- LocalRelation [a#6, tag#12]
!         :     :           +- LocalRelation [a#0]
!         :     +- Project [a#2, b AS tag#9]
!         :        +- SubqueryAlias t2
!         :           +- Project [a#2]
!         :              +- SubqueryAlias grouping
!         :                 +- LocalRelation [a#2]
!         +- SubqueryAlias tb
!            +- Union
!               :- Project [a#4, a AS tag#11]
!               :  +- SubqueryAlias t3
!               :     +- Project [a#4]
!               :        +- SubqueryAlias grouping
!               :           +- LocalRelation [a#4]
!               +- Project [a#6, b AS tag#12]
!                  +- SubqueryAlias t4
!                     +- Project [a#6]
!                        +- SubqueryAlias grouping
!                           +- LocalRelation [a#6]
```

## How was this patch tested?

add sql-tests/inputs/inner-join.sql
All tests passed.

Author: Stan Zhai <zhaishidan@haizhi.com>

Closes #17099 from stanzhai/fix-inner-join.

(cherry picked from commit 5502a9c)
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
@asfgit asfgit closed this in 5502a9c Mar 1, 2017
@gatorsmile
Copy link
Member

Thanks! Merging to master/2.1

@gatorsmile
Copy link
Member

@stanzhai Could you submit another PR to backport it to Spark 2.0?

@stanzhai
Copy link
Contributor Author

stanzhai commented Mar 2, 2017

ok

@@ -0,0 +1,68 @@
-- Automatically generated by SQLQueryTestSuite
-- Number of queries: 13
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this number is wrong. Next time, please do not manually change this file. You should run the command to generate the file. @stanzhai

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
I will pay attention to this next time.

asfgit pushed a commit that referenced this pull request Mar 2, 2017
…hould not be folded by FoldablePropagation rule

This PR fix for branch-2.0

Refer #17099

gatorsmile

Author: Stan Zhai <zhaishidan@haizhi.com>

Closes #17131 from stanzhai/fix-inner-join-2.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants