Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jul 19, 2016

What changes were proposed in this pull request?

Currently, the generated SQLs have not-stable IDs for generated attributes.
The stable generated SQL will give more benefit for understanding or testing the queries.
This PR provides stable SQL generation by the followings.

  • Provide unique ids for generated subqueries, gen_subquery_xxx.
  • Provide unique and stable ids for generated attributes, gen_attr_xxx.

Before

scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res0: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res1: String = SELECT `gen_attr_4` AS `1` FROM (SELECT 1 AS `gen_attr_4`) AS gen_subquery_0

After

scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res1: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res2: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0

How was this patch tested?

Pass the existing Jenkins tests.

@SparkQA
Copy link

SparkQA commented Jul 19, 2016

Test build #62517 has finished for PR 14257 at commit 2699f9e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 24, 2016

Test build #62761 has finished for PR 14257 at commit 12180db.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 24, 2016

Test build #62772 has finished for PR 14257 at commit a1776ef.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-16621][SQL][WIP] Use a stable ID generation method for attributes in SQLBuilder [SPARK-16621][SQL] Use a stable ID generation method for attributes in SQLBuilder Jul 25, 2016
@SparkQA
Copy link

SparkQA commented Jul 25, 2016

Test build #62796 has finished for PR 14257 at commit 06fb99c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SQLBuilder(

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin .
Could you review this stable SQL generation PR?

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-16621][SQL] Use a stable ID generation method for attributes in SQLBuilder [SPARK-16621][SQL] Generate stable SQLs in SQLBuilder Jul 25, 2016
@rxin
Copy link
Contributor

rxin commented Jul 25, 2016

I'm less sure about the new format -- it leads to tons of whitespaces for very long SQL queries.

@dongjoon-hyun
Copy link
Member Author

No problem. We can simply remove object SQLBuilder and modify two lines.
May I remove those?

@dongjoon-hyun
Copy link
Member Author

It's just for helping review processes.

@dongjoon-hyun
Copy link
Member Author

I removed the new format stuff. I'll update this PR, soon.

@dongjoon-hyun
Copy link
Member Author

Now, the PR is updated with new golden SQL files.

@rxin
Copy link
Contributor

rxin commented Jul 25, 2016

cc @liancheng

@SparkQA
Copy link

SparkQA commented Jul 25, 2016

Test build #62829 has finished for PR 14257 at commit 6b3adbf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SQLBuilder(

select * from t1 b where exists (select * from t1 a)
--------------------------------------------------------------------------------
SELECT `gen_attr` AS `a` FROM (SELECT `gen_attr` FROM (SELECT `a` AS `gen_attr` FROM `default`.`t1`) AS gen_subquery_0 WHERE EXISTS(SELECT `gen_attr` AS `a` FROM ((SELECT `gen_attr` FROM (SELECT `a` AS `gen_attr` FROM `default`.`t1`) AS gen_subquery_0) AS gen_subquery_1) AS gen_subquery_1)) AS b
SELECT `gen_attr_0` AS `a` FROM (SELECT `gen_attr_0` FROM (SELECT `a` AS `gen_attr_0` FROM `default`.`t1`) AS gen_subquery_0 WHERE EXISTS(SELECT `gen_attr_1` AS `a` FROM ((SELECT `gen_attr_1` FROM (SELECT `a` AS `gen_attr_1` FROM `default`.`t1`) AS gen_subquery_2) AS gen_subquery_1) AS gen_subquery_1)) AS b
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this query, I reformatted like the following. Here, gen_subquery_xxxs are generated uniquely. But, gen_subquery_1 is repeated due to the added nested subquery alias. Please note that it's not a repetition by duplicated ID and happens as a direct double nesting. I think it's okay.

SELECT `gen_attr_0` AS `a`
FROM (SELECT `gen_attr_0`
      FROM (SELECT `a` AS `gen_attr_0`
            FROM `default`.`t1`
           ) AS gen_subquery_0
      WHERE EXISTS(SELECT `gen_attr_1` AS `a`
                   FROM (  (SELECT `gen_attr_1`
                            FROM (SELECT `a` AS `gen_attr_1`
                                  FROM `default`.`t1`
                                 ) AS gen_subquery_2
                           ) AS gen_subquery_1
                        ) AS gen_subquery_1
                   )
     ) AS b

@SparkQA
Copy link

SparkQA commented Jul 26, 2016

Test build #62862 has finished for PR 14257 at commit 16b0f49.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* supported by this builder (yet).
*/
class SQLBuilder(logicalPlan: LogicalPlan) extends Logging {
class SQLBuilder(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's mark this constructor as private and add a new public constructor without nextSubqueryId, nextGenAttrId, or exprIdMap since these arguments are only used within SQLBuilder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thank you, @liancheng !

@liancheng
Copy link
Contributor

LGTM except for one minor comment. Thanks!

@SparkQA
Copy link

SparkQA commented Jul 26, 2016

Test build #62893 has finished for PR 14257 at commit 29e953a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

It was Java internal error.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (sharedRuntime.cpp:834), pid=49858, tid=140477527770880

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jul 26, 2016

Test build #62896 has finished for PR 14257 at commit 29e953a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @liancheng and @rxin .
It's not ready for review again.
Thank you always!

@liancheng
Copy link
Contributor

LGTM, merging to master. Thanks!

@asfgit asfgit closed this in 5b8e848 Jul 27, 2016
@rxin
Copy link
Contributor

rxin commented Jul 27, 2016

I'm also going to merge this in branch-2.0 to avoid conflicts in fixing bugs in this area.

asfgit pushed a commit that referenced this pull request Jul 27, 2016
Currently, the generated SQLs have not-stable IDs for generated attributes.
The stable generated SQL will give more benefit for understanding or testing the queries.
This PR provides stable SQL generation by the followings.

 - Provide unique ids for generated subqueries, `gen_subquery_xxx`.
 - Provide unique and stable ids for generated attributes, `gen_attr_xxx`.

**Before**
```scala
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res0: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res1: String = SELECT `gen_attr_4` AS `1` FROM (SELECT 1 AS `gen_attr_4`) AS gen_subquery_0
```

**After**
```scala
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res1: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
res2: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
```

Pass the existing Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14257 from dongjoon-hyun/SPARK-16621.

(cherry picked from commit 5b8e848)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@dongjoon-hyun
Copy link
Member Author

Thank you, @liancheng and @rxin !

@dongjoon-hyun dongjoon-hyun deleted the SPARK-16621 branch August 14, 2016 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants