-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17296][SQL] Simplify parser join processing. #14867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
|
Test build #64596 has finished for PR 14867 at commit
|
| joinRelation | ||
| : joinType JOIN right=relationPrimary joinCriteria? | ||
| | NATURAL joinType JOIN right=relationPrimary | ||
| ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think NATURAL CROSS JOIN is invalid, so perhaps we should not include CROSS in joinType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a point there. Let me update that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to move the code around. I have added a check in the AstBuilder (spark side of the parser) to catch this.
|
Test build #64849 has finished for PR 14867 at commit
|
|
Test build #64853 has finished for PR 14867 at commit
|
|
|
||
| // SPARK-17296 | ||
| assertEqual( | ||
| "select * from t1 cross join t2 join t3 on t3.id = t1.id join t4 on t4.id = t1.id", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is something like
SELECT * FROM T1 INNER JOIN T2 INNER JOIN T3 ON col3 = col2 ON col3 = col1;
supposed to parse ?
Without your change it returns the following error:
org.apache.spark.sql.AnalysisException: cannot resolve 'col3' given input columns: [col1, col2]; line 1 pos 63
which I don't understand. The following parses though:
SELECT * FROM T1 INNER JOIN T2 INNER JOIN T3 ON col1 = col2 ON col2 = col1
and returns a result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, it looks like your patch will disallow both queries at the parser level. Could you add a test that enforces this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I have added a test.
| .select(star())) | ||
|
|
||
| // Test multiple on clauses. | ||
| intercept("select * from t1 inner join t2 inner join t3 on col3 = col2 on col3 = col1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, let's also add a test somewhere for
SELECT * FROM T1 INNER JOIN (T2 INNER JOIN T3 ON col3 = col2) ON col3 = col1
SELECT * FROM T1 INNER JOIN (T2 INNER JOIN T3) ON col3 = col2
SELECT * FROM T1 INNER JOIN (T2 INNER JOIN T3 ON col3 = col2)
This looks good to me.
# Conflicts: # sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
|
Test build #64880 has finished for PR 14867 at commit
|
|
Test build #64882 has finished for PR 14867 at commit
|
|
Test build #64926 has finished for PR 14867 at commit
|
| .select(star())) | ||
|
|
||
| // Implicit joins. | ||
| assertEqual( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, LGTM
|
Merging to master/2.0. Thanks for the review! |
## What changes were proposed in this pull request? This PR backports #14867 to branch-2.0. It fixes a number of join ordering bugs. ## How was this patch tested? Added tests to `PlanParserSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #14984 from hvanhovell/SPARK-17296-branch-2.0.
What changes were proposed in this pull request?
Join processing in the parser relies on the fact that the grammar produces a right nested trees, for instance the parse tree for
select * from a join b join cis expected to produce a tree similar toJOIN(a, JOIN(b, c)). However there are cases in which this (invariant) is violated, like:In this case the parser returns a tree in which Joins are located on both the left and the right sides of the parent join node.
This PR introduces a different grammar rule which does not make this assumption. The new rule takes a relation and searches for zero or more joined relations. As a bonus processing is much easier.
How was this patch tested?
Existing tests and I have added a regression test to the plan parser suite.