[SPARK-17296][SQL] Simplify parser join processing. #14867

hvanhovell · 2016-08-29T20:34:50Z

What changes were proposed in this pull request?

Join processing in the parser relies on the fact that the grammar produces a right nested trees, for instance the parse tree for select * from a join b join c is expected to produce a tree similar to JOIN(a, JOIN(b, c)). However there are cases in which this (invariant) is violated, like:

SELECT COUNT(1)
FROM test T1 
     CROSS JOIN test T2
     JOIN test T3
      ON T3.col = T1.col
     JOIN test T4
      ON T4.col = T1.col

In this case the parser returns a tree in which Joins are located on both the left and the right sides of the parent join node.

This PR introduces a different grammar rule which does not make this assumption. The new rule takes a relation and searches for zero or more joined relations. As a bonus processing is much easier.

How was this patch tested?

Existing tests and I have added a regression test to the plan parser suite.

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

SparkQA · 2016-08-29T22:52:20Z

Test build #64596 has finished for PR 14867 at commit f9cb0d2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2016-09-02T09:43:06Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

+joinRelation
+    : joinType JOIN right=relationPrimary joinCriteria?
+    | NATURAL joinType JOIN right=relationPrimary
    ;


I think NATURAL CROSS JOIN is invalid, so perhaps we should not include CROSS in joinType?

You have a point there. Let me update that.

I had to move the code around. I have added a check in the AstBuilder (spark side of the parser) to catch this.

SparkQA · 2016-09-02T11:51:14Z

Test build #64849 has finished for PR 14867 at commit e20afbd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-02T14:01:37Z

Test build #64853 has finished for PR 14867 at commit b09d506.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-09-02T16:58:51Z

cc @srinathshankar

srinathshankar · 2016-09-02T21:57:11Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala

+
+    // SPARK-17296
+    assertEqual(
+      "select * from t1 cross join t2 join t3 on t3.id = t1.id join t4 on t4.id = t1.id",


How is something like
SELECT * FROM T1 INNER JOIN T2 INNER JOIN T3 ON col3 = col2 ON col3 = col1;
supposed to parse ?
Without your change it returns the following error:
org.apache.spark.sql.AnalysisException: cannot resolve 'col3' given input columns: [col1, col2]; line 1 pos 63
which I don't understand. The following parses though:
SELECT * FROM T1 INNER JOIN T2 INNER JOIN T3 ON col1 = col2 ON col2 = col1
and returns a result

To clarify, it looks like your patch will disallow both queries at the parser level. Could you add a test that enforces this ?

Good catch. I have added a test.

srinathshankar · 2016-09-02T22:41:43Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala

+        .select(star()))
+
+    // Test multiple on clauses.
+    intercept("select * from t1 inner join t2 inner join t3 on col3 = col2 on col3 = col1")


As discussed, let's also add a test somewhere for
SELECT * FROM T1 INNER JOIN (T2 INNER JOIN T3 ON col3 = col2) ON col3 = col1
SELECT * FROM T1 INNER JOIN (T2 INNER JOIN T3) ON col3 = col2
SELECT * FROM T1 INNER JOIN (T2 INNER JOIN T3 ON col3 = col2)

This looks good to me.

# Conflicts: # sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

SparkQA · 2016-09-03T00:54:02Z

Test build #64880 has finished for PR 14867 at commit 3b13cd7.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-09-03T01:10:47Z

Test build #64882 has finished for PR 14867 at commit fca4489.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-05T01:49:51Z

Test build #64926 has finished for PR 14867 at commit c30e665.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srinathshankar · 2016-09-06T16:57:33Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala

+        .select(star()))
+
+    // Implicit joins.
+    assertEqual(


Great, LGTM

hvanhovell · 2016-09-06T22:43:12Z

Merging to master/2.0. Thanks for the review!

## What changes were proposed in this pull request? This PR backports #14867 to branch-2.0. It fixes a number of join ordering bugs. ## How was this patch tested? Added tests to `PlanParserSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #14984 from hvanhovell/SPARK-17296-branch-2.0.

hvanhovell added 2 commits August 29, 2016 22:12

Simplify join processing.

b25e2db

Merge remote-tracking branch 'apache-github/master' into SPARK-17296

f9cb0d2

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

jiangxb1987 reviewed Sep 2, 2016
View reviewed changes

hvanhovell added 3 commits September 2, 2016 12:01

Update rule to prevent NATURAL CROSS JOIN...

91eafcc

Merge remote-tracking branch 'apache-github/master' into SPARK-17296

ad1e56b

Add more tests.

e20afbd

Allow a cross join with a condition.

b09d506

hvanhovell changed the title ~~[SPARK-17296][SQL] Simplify join parser join processing.~~ [SPARK-17296][SQL] Simplif parser join processing. Sep 2, 2016

hvanhovell changed the title ~~[SPARK-17296][SQL] Simplif parser join processing.~~ [SPARK-17296][SQL] Simplify parser join processing. Sep 2, 2016

srinathshankar reviewed Sep 2, 2016
View reviewed changes

Add tests for multiple on clauses.

3b13cd7

srinathshankar reviewed Sep 2, 2016
View reviewed changes

hvanhovell added 2 commits September 3, 2016 01:03

Merge remote-tracking branch 'apache-github/master' into SPARK-17296

8edb1e4

# Conflicts: # sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

Add parenthesis tests

fca4489

Properly support , separated joins

c30e665

srinathshankar reviewed Sep 6, 2016
View reviewed changes

asfgit closed this in 4f769b9 Sep 6, 2016

hvanhovell mentioned this pull request Sep 6, 2016

[SPARK-17296][SQL] Simplify parser join processing [BACKPORT 2.0] #14984

Closed

[SPARK-17296][SQL] Simplify parser join processing. #14867

[SPARK-17296][SQL] Simplify parser join processing. #14867

Uh oh!

Conversation

hvanhovell commented Aug 29, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 2, 2016

Uh oh!

SparkQA commented Sep 2, 2016

Uh oh!

hvanhovell commented Sep 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 3, 2016

Uh oh!

SparkQA commented Sep 3, 2016

Uh oh!

SparkQA commented Sep 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Sep 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants