Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Oct 19, 2016

What changes were proposed in this pull request?

Currently, SQLBuilder handles LIMIT by always adding LIMIT at the end of the generated subSQL. It makes RuntimeExceptions like the following. This PR adds a parenthesis always except SubqueryAlias is used together with LIMIT.

Before

scala> sql("CREATE TABLE tbl(id INT)")
scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl LIMIT 2")
java.lang.RuntimeException: Failed to analyze the canonicalized SQL: ...

After

scala> sql("CREATE TABLE tbl(id INT)")
scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl LIMIT 2")
scala> sql("SELECT id2 FROM v1")
res4: org.apache.spark.sql.DataFrame = [id2: int]

Fixed cases in this PR

The following two cases are the detail query plans having problematic SQL generations.

  1. SELECT * FROM (SELECT id FROM tbl LIMIT 2)

    Please note that FROM SELECT part of the generated SQL in the below. When we don't use '()' for limit, this fails.

# Original logical plan:
Project [id#1]
+- GlobalLimit 2
   +- LocalLimit 2
      +- Project [id#1]
         +- MetastoreRelation default, tbl

# Canonicalized logical plan:
Project [gen_attr_0#1 AS id#4]
+- SubqueryAlias tbl
   +- Project [gen_attr_0#1]
      +- GlobalLimit 2
         +- LocalLimit 2
            +- Project [gen_attr_0#1]
               +- SubqueryAlias gen_subquery_0
                  +- Project [id#1 AS gen_attr_0#1]
                     +- SQLTable default, tbl, [id#1]

# Generated SQL:
SELECT `gen_attr_0` AS `id` FROM (SELECT `gen_attr_0` FROM SELECT `gen_attr_0` FROM (SELECT `id` AS `gen_attr_0` FROM `default`.`tbl`) AS gen_subquery_0 LIMIT 2) AS tbl
  1. SELECT * FROM (SELECT id FROM tbl TABLESAMPLE (2 ROWS))

    Please note that ((~~~) AS gen_subquery_0 LIMIT 2) in the below. When we use '()' for limit on SubqueryAlias, this fails.

# Original logical plan:
Project [id#1]
+- Project [id#1]
   +- GlobalLimit 2
      +- LocalLimit 2
         +- MetastoreRelation default, tbl

# Canonicalized logical plan:
Project [gen_attr_0#1 AS id#4]
+- SubqueryAlias tbl
   +- Project [gen_attr_0#1]
      +- GlobalLimit 2
         +- LocalLimit 2
            +- SubqueryAlias gen_subquery_0
               +- Project [id#1 AS gen_attr_0#1]
                  +- SQLTable default, tbl, [id#1]

# Generated SQL:
SELECT `gen_attr_0` AS `id` FROM (SELECT `gen_attr_0` FROM ((SELECT `id` AS `gen_attr_0` FROM `default`.`tbl`) AS gen_subquery_0 LIMIT 2)) AS tbl

How was this patch tested?

Pass the Jenkins test with a newly added test case.

@gatorsmile
Copy link
Member

Wrong JIRA number. : )

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-17892][SQL] SQLBuilder should wrap the generated SQL with parenthesis for LIMIT [SPARK-17982][SQL] SQLBuilder should wrap the generated SQL with parenthesis for LIMIT Oct 19, 2016
@dongjoon-hyun
Copy link
Member Author

Ooops. Thank you for fixing me! @gatorsmile

@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67166 has finished for PR 15546 at commit 2af55c8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67165 has finished for PR 15546 at commit 85c0686.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67172 has finished for PR 15546 at commit 105c36a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we create a test in the sql generation test there?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Oct 19, 2016

Thank you for review, @rxin .
Here, CREATE VIEW is not supported by SQLGeneration test suite.

CREATE VIEW v1(id2) AS SELECT id FROM tbl LIMIT 2

The case is a view with column names and limit.

@rxin
Copy link
Contributor

rxin commented Oct 19, 2016

but you can construct a query with limit to make sure the output has ( ) no?

@dongjoon-hyun
Copy link
Member Author

Sorry, but what do you mean by the output has ( ) no??

@dongjoon-hyun
Copy link
Member Author

Ah. I see. ( ) means parenthesis, literally. I'll replace CREATE VIEW test case with SQL Generation suite. BTW, at the last commit, the PR is changed to add parenthesis if the child of Limit is Project.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this PR, this test case fails because it generate the following query. Note that the third line.

SELECT `gen_attr_0` AS `id`
FROM (SELECT `gen_attr_0`
             FROM SELECT `gen_attr_0`
                         FROM (SELECT `id` AS `gen_attr_0`, `name` AS `gen_attr_1`
                                     FROM `default`.`tbl`) AS gen_subquery_0 LIMIT 2) AS tbl

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be absolute path. It seems to be changed by accidentally by the following PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to now, SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "hive/test-only *LogicalPlanToSQLSuite" didn't update the golden files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @rxin .
IMO, this bug on LogicalPlanToSQLSuite is needed to be fixed soon. If this PR is needed to be reviewed more, should I make another PR containing that only?

@dongjoon-hyun
Copy link
Member Author

@rxin . I moved the test case and fixed a bug in LogicalPlanToSQLSuite.

@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67178 has finished for PR 15546 at commit 637499b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

The current running build has one testcase failure, when schema inference is turned on, should read partition data. It seems to be irrelevant.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67185 has finished for PR 15546 at commit 05d9475.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67190 has finished for PR 15546 at commit 05d9475.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it always safe to just add limit to any plan?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Oct 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Mostly. During testing, I realized that the previous code is designed to add "LIMIT" string without parenthesis to handle the most cases. And, Spark do not allow double parenthesis.

  • ORDER BY: Limit(_, Sort)
  • GROUP BY: Limit(_, Aggr)
    ...

Project is the only observed cases in CREAE VIEW or SELECT * FROM (SELECT ... LIMIT ..).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change the title of PR.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-17982][SQL] SQLBuilder should wrap the generated SQL with parenthesis for LIMIT [SPARK-17982][SQL] SQLBuilder should wrap the generated SQL for Limit with Project in () Oct 20, 2016
@dongjoon-hyun
Copy link
Member Author

@rxin .
I'll add more testcases for this and figure out the correct answer for your question.

is it always safe to just add limit to any plan?

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-17982][SQL] SQLBuilder should wrap the generated SQL for Limit with Project in () [SPARK-17982][SQL][WIP] SQLBuilder should wrap the generated SQL for Limit with Project in () Oct 20, 2016
@rxin
Copy link
Contributor

rxin commented Nov 1, 2016

@dongjoon-hyun any update on this?

@dongjoon-hyun
Copy link
Member Author

Sorry for delay, @rxin . I'm still working on, and will update in a few days.

@rxin
Copy link
Contributor

rxin commented Nov 1, 2016

@dongjoon-hyun we are going to cut the release branch today (or tomorrow) so it would be good to get this in asap.

@dongjoon-hyun
Copy link
Member Author

Oh, 2.1-rc1 today? I see.

@rxin
Copy link
Contributor

rxin commented Nov 1, 2016

Not rc, but the branch cut.

@dongjoon-hyun
Copy link
Member Author

Ah, I see. It's feature freeze. Thank you for informing me.

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #67968 has finished for PR 15546 at commit 813e41f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@dongjoon-hyun I have a general comment about the SQL generation fixes. Not sure how you fix the issue. Normally, when we did it, we need to see the analyzed SQL plan. In this case, it becomes a little complex, we called addSubqueryIfNeeded in toSQL.

Could you try to show reviewers the failed plans with the added SubqueryAlias? Then, we can discuss whether the fix coverage and test case coverage.

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @gatorsmile . I'll add the failed plans, too.

@dongjoon-hyun
Copy link
Member Author

Hi, @gatorsmile . The PR is updated according to the advice.

  • Remove the goldenSQLPath from this PR.
  • Update the PR description about the target fixed cases and the detailed plans, which includes the SubqueryAlias cases.

@SparkQA
Copy link

SparkQA commented Nov 6, 2016

Test build #68239 has finished for PR 15546 at commit a81434f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @gatorsmile .
Could you review again?

@dongjoon-hyun
Copy link
Member Author

During updating the PR, I rebased and squashed to resolve conflict.

@SparkQA
Copy link

SparkQA commented Nov 10, 2016

Test build #68458 has finished for PR 15546 at commit e76a675.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

The three failures seems to be irrelevant.

[info] - randomized aggregation test - [typed, with partial + safe] - with grouping keys - with non-empty input *** FAILED *** (1 second, 419 milliseconds)

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@gatorsmile
Copy link
Member

Will review this PR tonight. Thanks!

@dongjoon-hyun
Copy link
Member Author

Thank you, @gatorsmile !

@SparkQA
Copy link

SparkQA commented Nov 10, 2016

Test build #68480 has finished for PR 15546 at commit e76a675.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 10, 2016

Test build #68484 has finished for PR 15546 at commit 604414e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Could you review again, @gatorsmile ?

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

retest this please

@dongjoon-hyun
Copy link
Member Author

Thank you, @gatorsmile .

@SparkQA
Copy link

SparkQA commented Nov 11, 2016

Test build #68528 has finished for PR 15546 at commit 604414e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Nov 11, 2016
…nthesis for LIMIT

## What changes were proposed in this pull request?

Currently, `SQLBuilder` handles `LIMIT` by always adding `LIMIT` at the end of the generated subSQL. It makes `RuntimeException`s like the following. This PR adds a parenthesis always except `SubqueryAlias` is used together with `LIMIT`.

**Before**

``` scala
scala> sql("CREATE TABLE tbl(id INT)")
scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl LIMIT 2")
java.lang.RuntimeException: Failed to analyze the canonicalized SQL: ...
```

**After**

``` scala
scala> sql("CREATE TABLE tbl(id INT)")
scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl LIMIT 2")
scala> sql("SELECT id2 FROM v1")
res4: org.apache.spark.sql.DataFrame = [id2: int]
```

**Fixed cases in this PR**

The following two cases are the detail query plans having problematic SQL generations.

1. `SELECT * FROM (SELECT id FROM tbl LIMIT 2)`

    Please note that **FROM SELECT** part of the generated SQL in the below. When we don't use '()' for limit, this fails.

```scala
# Original logical plan:
Project [id#1]
+- GlobalLimit 2
   +- LocalLimit 2
      +- Project [id#1]
         +- MetastoreRelation default, tbl

# Canonicalized logical plan:
Project [gen_attr_0#1 AS id#4]
+- SubqueryAlias tbl
   +- Project [gen_attr_0#1]
      +- GlobalLimit 2
         +- LocalLimit 2
            +- Project [gen_attr_0#1]
               +- SubqueryAlias gen_subquery_0
                  +- Project [id#1 AS gen_attr_0#1]
                     +- SQLTable default, tbl, [id#1]

# Generated SQL:
SELECT `gen_attr_0` AS `id` FROM (SELECT `gen_attr_0` FROM SELECT `gen_attr_0` FROM (SELECT `id` AS `gen_attr_0` FROM `default`.`tbl`) AS gen_subquery_0 LIMIT 2) AS tbl
```

2. `SELECT * FROM (SELECT id FROM tbl TABLESAMPLE (2 ROWS))`

    Please note that **((~~~) AS gen_subquery_0 LIMIT 2)** in the below. When we use '()' for limit on `SubqueryAlias`, this fails.

```scala
# Original logical plan:
Project [id#1]
+- Project [id#1]
   +- GlobalLimit 2
      +- LocalLimit 2
         +- MetastoreRelation default, tbl

# Canonicalized logical plan:
Project [gen_attr_0#1 AS id#4]
+- SubqueryAlias tbl
   +- Project [gen_attr_0#1]
      +- GlobalLimit 2
         +- LocalLimit 2
            +- SubqueryAlias gen_subquery_0
               +- Project [id#1 AS gen_attr_0#1]
                  +- SQLTable default, tbl, [id#1]

# Generated SQL:
SELECT `gen_attr_0` AS `id` FROM (SELECT `gen_attr_0` FROM ((SELECT `id` AS `gen_attr_0` FROM `default`.`tbl`) AS gen_subquery_0 LIMIT 2)) AS tbl
```

## How was this patch tested?

Pass the Jenkins test with a newly added test case.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15546 from dongjoon-hyun/SPARK-17982.

(cherry picked from commit d42bb7c)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@gatorsmile
Copy link
Member

Thanks! Merging to master/2.1.

@asfgit asfgit closed this in d42bb7c Nov 11, 2016
@gatorsmile
Copy link
Member

Could you please backport it to 2.0? It sounds the JIRA opener needs it in Spark 2.0 branch.

@dongjoon-hyun
Copy link
Member Author

Thank you, @gatorsmile .
I'll make a backport for 2.0, too.

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @rxin and @cloud-fan , too.

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…nthesis for LIMIT

## What changes were proposed in this pull request?

Currently, `SQLBuilder` handles `LIMIT` by always adding `LIMIT` at the end of the generated subSQL. It makes `RuntimeException`s like the following. This PR adds a parenthesis always except `SubqueryAlias` is used together with `LIMIT`.

**Before**

``` scala
scala> sql("CREATE TABLE tbl(id INT)")
scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl LIMIT 2")
java.lang.RuntimeException: Failed to analyze the canonicalized SQL: ...
```

**After**

``` scala
scala> sql("CREATE TABLE tbl(id INT)")
scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl LIMIT 2")
scala> sql("SELECT id2 FROM v1")
res4: org.apache.spark.sql.DataFrame = [id2: int]
```

**Fixed cases in this PR**

The following two cases are the detail query plans having problematic SQL generations.

1. `SELECT * FROM (SELECT id FROM tbl LIMIT 2)`

    Please note that **FROM SELECT** part of the generated SQL in the below. When we don't use '()' for limit, this fails.

```scala
# Original logical plan:
Project [id#1]
+- GlobalLimit 2
   +- LocalLimit 2
      +- Project [id#1]
         +- MetastoreRelation default, tbl

# Canonicalized logical plan:
Project [gen_attr_0#1 AS id#4]
+- SubqueryAlias tbl
   +- Project [gen_attr_0#1]
      +- GlobalLimit 2
         +- LocalLimit 2
            +- Project [gen_attr_0#1]
               +- SubqueryAlias gen_subquery_0
                  +- Project [id#1 AS gen_attr_0#1]
                     +- SQLTable default, tbl, [id#1]

# Generated SQL:
SELECT `gen_attr_0` AS `id` FROM (SELECT `gen_attr_0` FROM SELECT `gen_attr_0` FROM (SELECT `id` AS `gen_attr_0` FROM `default`.`tbl`) AS gen_subquery_0 LIMIT 2) AS tbl
```

2. `SELECT * FROM (SELECT id FROM tbl TABLESAMPLE (2 ROWS))`

    Please note that **((~~~) AS gen_subquery_0 LIMIT 2)** in the below. When we use '()' for limit on `SubqueryAlias`, this fails.

```scala
# Original logical plan:
Project [id#1]
+- Project [id#1]
   +- GlobalLimit 2
      +- LocalLimit 2
         +- MetastoreRelation default, tbl

# Canonicalized logical plan:
Project [gen_attr_0#1 AS id#4]
+- SubqueryAlias tbl
   +- Project [gen_attr_0#1]
      +- GlobalLimit 2
         +- LocalLimit 2
            +- SubqueryAlias gen_subquery_0
               +- Project [id#1 AS gen_attr_0#1]
                  +- SQLTable default, tbl, [id#1]

# Generated SQL:
SELECT `gen_attr_0` AS `id` FROM (SELECT `gen_attr_0` FROM ((SELECT `id` AS `gen_attr_0` FROM `default`.`tbl`) AS gen_subquery_0 LIMIT 2)) AS tbl
```

## How was this patch tested?

Pass the Jenkins test with a newly added test case.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes apache#15546 from dongjoon-hyun/SPARK-17982.
@dongjoon-hyun dongjoon-hyun deleted the SPARK-17982 branch January 7, 2019 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants