-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17982][SQL] SQLBuilder should wrap the generated SQL with parenthesis for LIMIT #15546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Wrong JIRA number. : ) |
|
Ooops. Thank you for fixing me! @gatorsmile |
|
Test build #67166 has finished for PR 15546 at commit
|
|
Test build #67165 has finished for PR 15546 at commit
|
|
Test build #67172 has finished for PR 15546 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we create a test in the sql generation test there?
|
Thank you for review, @rxin . CREATE VIEW v1(id2) AS SELECT id FROM tbl LIMIT 2 The case is a view with column names and limit. |
|
but you can construct a query with limit to make sure the output has ( ) no? |
|
Sorry, but what do you mean by |
|
Ah. I see. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this PR, this test case fails because it generate the following query. Note that the third line.
SELECT `gen_attr_0` AS `id`
FROM (SELECT `gen_attr_0`
FROM SELECT `gen_attr_0`
FROM (SELECT `id` AS `gen_attr_0`, `name` AS `gen_attr_1`
FROM `default`.`tbl`) AS gen_subquery_0 LIMIT 2) AS tbl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be absolute path. It seems to be changed by accidentally by the following PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to now, SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "hive/test-only *LogicalPlanToSQLSuite" didn't update the golden files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @rxin .
IMO, this bug on LogicalPlanToSQLSuite is needed to be fixed soon. If this PR is needed to be reviewed more, should I make another PR containing that only?
|
@rxin . I moved the test case and fixed a bug in |
|
Test build #67178 has finished for PR 15546 at commit
|
|
The current running build has one testcase failure, |
|
Retest this please. |
|
Test build #67185 has finished for PR 15546 at commit
|
|
Test build #67190 has finished for PR 15546 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it always safe to just add limit to any plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Mostly. During testing, I realized that the previous code is designed to add "LIMIT" string without parenthesis to handle the most cases. And, Spark do not allow double parenthesis.
- ORDER BY: Limit(_, Sort)
- GROUP BY: Limit(_, Aggr)
...
Project is the only observed cases in CREAE VIEW or SELECT * FROM (SELECT ... LIMIT ..).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change the title of PR.
Limit with Project in ()
|
@rxin .
|
Limit with Project in ()Limit with Project in ()
|
@dongjoon-hyun any update on this? |
|
Sorry for delay, @rxin . I'm still working on, and will update in a few days. |
|
@dongjoon-hyun we are going to cut the release branch today (or tomorrow) so it would be good to get this in asap. |
|
Oh, 2.1-rc1 today? I see. |
|
Not rc, but the branch cut. |
|
Ah, I see. It's feature freeze. Thank you for informing me. |
|
Test build #67968 has finished for PR 15546 at commit
|
|
@dongjoon-hyun I have a general comment about the SQL generation fixes. Not sure how you fix the issue. Normally, when we did it, we need to see the analyzed SQL plan. In this case, it becomes a little complex, we called Could you try to show reviewers the failed plans with the added |
|
Thank you for review, @gatorsmile . I'll add the failed plans, too. |
|
Hi, @gatorsmile . The PR is updated according to the advice.
|
|
Test build #68239 has finished for PR 15546 at commit
|
|
Hi, @gatorsmile . |
…nthesis for LIMIT
|
During updating the PR, I rebased and squashed to resolve conflict. |
|
Test build #68458 has finished for PR 15546 at commit
|
|
The three failures seems to be irrelevant. |
|
Retest this please. |
|
Will review this PR tonight. Thanks! |
|
Thank you, @gatorsmile ! |
|
Test build #68480 has finished for PR 15546 at commit
|
|
Test build #68484 has finished for PR 15546 at commit
|
|
Could you review again, @gatorsmile ? |
|
LGTM |
|
retest this please |
|
Thank you, @gatorsmile . |
|
Test build #68528 has finished for PR 15546 at commit
|
…nthesis for LIMIT
## What changes were proposed in this pull request?
Currently, `SQLBuilder` handles `LIMIT` by always adding `LIMIT` at the end of the generated subSQL. It makes `RuntimeException`s like the following. This PR adds a parenthesis always except `SubqueryAlias` is used together with `LIMIT`.
**Before**
``` scala
scala> sql("CREATE TABLE tbl(id INT)")
scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl LIMIT 2")
java.lang.RuntimeException: Failed to analyze the canonicalized SQL: ...
```
**After**
``` scala
scala> sql("CREATE TABLE tbl(id INT)")
scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl LIMIT 2")
scala> sql("SELECT id2 FROM v1")
res4: org.apache.spark.sql.DataFrame = [id2: int]
```
**Fixed cases in this PR**
The following two cases are the detail query plans having problematic SQL generations.
1. `SELECT * FROM (SELECT id FROM tbl LIMIT 2)`
Please note that **FROM SELECT** part of the generated SQL in the below. When we don't use '()' for limit, this fails.
```scala
# Original logical plan:
Project [id#1]
+- GlobalLimit 2
+- LocalLimit 2
+- Project [id#1]
+- MetastoreRelation default, tbl
# Canonicalized logical plan:
Project [gen_attr_0#1 AS id#4]
+- SubqueryAlias tbl
+- Project [gen_attr_0#1]
+- GlobalLimit 2
+- LocalLimit 2
+- Project [gen_attr_0#1]
+- SubqueryAlias gen_subquery_0
+- Project [id#1 AS gen_attr_0#1]
+- SQLTable default, tbl, [id#1]
# Generated SQL:
SELECT `gen_attr_0` AS `id` FROM (SELECT `gen_attr_0` FROM SELECT `gen_attr_0` FROM (SELECT `id` AS `gen_attr_0` FROM `default`.`tbl`) AS gen_subquery_0 LIMIT 2) AS tbl
```
2. `SELECT * FROM (SELECT id FROM tbl TABLESAMPLE (2 ROWS))`
Please note that **((~~~) AS gen_subquery_0 LIMIT 2)** in the below. When we use '()' for limit on `SubqueryAlias`, this fails.
```scala
# Original logical plan:
Project [id#1]
+- Project [id#1]
+- GlobalLimit 2
+- LocalLimit 2
+- MetastoreRelation default, tbl
# Canonicalized logical plan:
Project [gen_attr_0#1 AS id#4]
+- SubqueryAlias tbl
+- Project [gen_attr_0#1]
+- GlobalLimit 2
+- LocalLimit 2
+- SubqueryAlias gen_subquery_0
+- Project [id#1 AS gen_attr_0#1]
+- SQLTable default, tbl, [id#1]
# Generated SQL:
SELECT `gen_attr_0` AS `id` FROM (SELECT `gen_attr_0` FROM ((SELECT `id` AS `gen_attr_0` FROM `default`.`tbl`) AS gen_subquery_0 LIMIT 2)) AS tbl
```
## How was this patch tested?
Pass the Jenkins test with a newly added test case.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #15546 from dongjoon-hyun/SPARK-17982.
(cherry picked from commit d42bb7c)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
|
Thanks! Merging to master/2.1. |
|
Could you please backport it to 2.0? It sounds the JIRA opener needs it in Spark 2.0 branch. |
|
Thank you, @gatorsmile . |
|
Thank you for review, @rxin and @cloud-fan , too. |
…nthesis for LIMIT
## What changes were proposed in this pull request?
Currently, `SQLBuilder` handles `LIMIT` by always adding `LIMIT` at the end of the generated subSQL. It makes `RuntimeException`s like the following. This PR adds a parenthesis always except `SubqueryAlias` is used together with `LIMIT`.
**Before**
``` scala
scala> sql("CREATE TABLE tbl(id INT)")
scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl LIMIT 2")
java.lang.RuntimeException: Failed to analyze the canonicalized SQL: ...
```
**After**
``` scala
scala> sql("CREATE TABLE tbl(id INT)")
scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl LIMIT 2")
scala> sql("SELECT id2 FROM v1")
res4: org.apache.spark.sql.DataFrame = [id2: int]
```
**Fixed cases in this PR**
The following two cases are the detail query plans having problematic SQL generations.
1. `SELECT * FROM (SELECT id FROM tbl LIMIT 2)`
Please note that **FROM SELECT** part of the generated SQL in the below. When we don't use '()' for limit, this fails.
```scala
# Original logical plan:
Project [id#1]
+- GlobalLimit 2
+- LocalLimit 2
+- Project [id#1]
+- MetastoreRelation default, tbl
# Canonicalized logical plan:
Project [gen_attr_0#1 AS id#4]
+- SubqueryAlias tbl
+- Project [gen_attr_0#1]
+- GlobalLimit 2
+- LocalLimit 2
+- Project [gen_attr_0#1]
+- SubqueryAlias gen_subquery_0
+- Project [id#1 AS gen_attr_0#1]
+- SQLTable default, tbl, [id#1]
# Generated SQL:
SELECT `gen_attr_0` AS `id` FROM (SELECT `gen_attr_0` FROM SELECT `gen_attr_0` FROM (SELECT `id` AS `gen_attr_0` FROM `default`.`tbl`) AS gen_subquery_0 LIMIT 2) AS tbl
```
2. `SELECT * FROM (SELECT id FROM tbl TABLESAMPLE (2 ROWS))`
Please note that **((~~~) AS gen_subquery_0 LIMIT 2)** in the below. When we use '()' for limit on `SubqueryAlias`, this fails.
```scala
# Original logical plan:
Project [id#1]
+- Project [id#1]
+- GlobalLimit 2
+- LocalLimit 2
+- MetastoreRelation default, tbl
# Canonicalized logical plan:
Project [gen_attr_0#1 AS id#4]
+- SubqueryAlias tbl
+- Project [gen_attr_0#1]
+- GlobalLimit 2
+- LocalLimit 2
+- SubqueryAlias gen_subquery_0
+- Project [id#1 AS gen_attr_0#1]
+- SQLTable default, tbl, [id#1]
# Generated SQL:
SELECT `gen_attr_0` AS `id` FROM (SELECT `gen_attr_0` FROM ((SELECT `id` AS `gen_attr_0` FROM `default`.`tbl`) AS gen_subquery_0 LIMIT 2)) AS tbl
```
## How was this patch tested?
Pass the Jenkins test with a newly added test case.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes apache#15546 from dongjoon-hyun/SPARK-17982.
What changes were proposed in this pull request?
Currently,
SQLBuilderhandlesLIMITby always addingLIMITat the end of the generated subSQL. It makesRuntimeExceptions like the following. This PR adds a parenthesis always exceptSubqueryAliasis used together withLIMIT.Before
After
Fixed cases in this PR
The following two cases are the detail query plans having problematic SQL generations.
SELECT * FROM (SELECT id FROM tbl LIMIT 2)Please note that FROM SELECT part of the generated SQL in the below. When we don't use '()' for limit, this fails.
SELECT * FROM (SELECT id FROM tbl TABLESAMPLE (2 ROWS))Please note that ((~~~) AS gen_subquery_0 LIMIT 2) in the below. When we use '()' for limit on
SubqueryAlias, this fails.How was this patch tested?
Pass the Jenkins test with a newly added test case.