[SPARK-16590][SQL] Improve LogicalPlanToSQLSuite to check generated SQL directly #14235

dongjoon-hyun · 2016-07-17T01:42:53Z

What changes were proposed in this pull request?

This PR improves LogicalPlanToSQLSuite to check the generated SQL directly by structure. So far, LogicalPlanToSQLSuite relies on checkHiveQl to ensure the successful SQL generation and answer equality. However, it does not guarantee the generated SQL is the same or will not be changed unnoticeably.

How was this patch tested?

Pass the Jenkins. This is only a testsuite change.

…QL directly

rxin · 2016-07-17T01:48:13Z

hm the problem with this approach is that we'd need to spend a lot of time to update test cases whenever we change sql generation slightly. I think in order to do this, we should put the generated sql in files and then have a way to regenerate all the sql queries in bulk.

dongjoon-hyun · 2016-07-17T01:51:15Z

Oh, thank you for fast review!
Yep. I will update like that.
The purpose of this PR is having stronger LogicalPlanToSQLSuite.

SparkQA · 2016-07-17T03:18:22Z

Test build #62420 has finished for PR 14235 at commit a3bb306.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-07-17T05:03:14Z

It sounds like the latest changes are unable to resolve Reynold's concern: how to regenerate all the expected SQL queries in bulk?

You know, for native view support, we are not generating optimal SQL statements. You can see the generated SQL is verbose and not readable. However, these SQL statements can be optimized by Catalyst optimizer at runtime. Thus, I have the same concern. This approach is not maintainable

dongjoon-hyun · 2016-07-17T05:05:06Z

sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala


-  private def checkHiveQl(hiveQl: String): Unit = {
+  // Used for generating new query answer files by saving
+  private val saveQuery = false


Hi, @gatorsmile .
This is my answer to that. :)

dongjoon-hyun · 2016-07-17T05:15:46Z

First of all, you can update the whole query set by one flag. So, maintainability is no more difficult issue.

I made this PR because SQL generation is currently fragile as you said yesterday.

We need to prevent unintentional and accidental changes on that before both Hint or SPARK-16576.

IMO, this is a correctness issue we should resolve. I hope this PR protects Spark from me. :)

gatorsmile · 2016-07-17T05:22:53Z

Comparing the SQL statement strings is horrible to me.

I have a different approach to verify the correctness. How about compare the optimized plans, which can tolerate more slight changes that do not affect the correctness?

dongjoon-hyun · 2016-07-17T05:31:38Z

:) Sorry, but I don't think so.

At every level, we need to prove correctness. The tolerance you want is the result of unpredictable removals of Optimizer.
Also, this is LogicalPlanToSQLSuite . SQL statement comparison is not a good way in general, but the only correct way to this module.

gatorsmile · 2016-07-17T05:45:36Z

I can understand the advantage of comparing SQL statement strings, but the cost is higher than the benefit especially for a small group of key developers who are maintaining the Spark code everyday.

In some commercial RDBMS, they have a whole team for developing and maintaining SQL generation. However, I am not sure if the Spark community can afford it. You know, SQL generation is only used for native view support so far.

dongjoon-hyun · 2016-07-17T06:06:55Z

I'm not sure what cost exactly do you mean.
It's a flag, isn't it? Also, if someone change SQLBuilder, it should be tested by the generated SQLs, not the query execution result.
For example, BROADCAST hints will be tested here as comments after merging this PR.

gatorsmile · 2016-07-17T06:14:00Z

BROADCAST hints is special. We need to check the statements.

My concern is we might spend a lot of times when making changes that impact the SQL statement strings but it does not change the correctness of SQL builder.

SparkQA · 2016-07-17T06:19:01Z

Test build #62423 has finished for PR 14235 at commit 9f20a63.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-07-17T06:26:41Z

First, we will make more HINTS. I made a general HINT syntax for that. We are opening the door together.

Second, when you add your new testcase here, the additional cost is just one second to generate the file.

Third, when you develops your PR, this will save much time. I'm already using this to save my time. At this time, we need to verify correctness at every level. If you use optimized result or the execution result, that is a naive test. You know that.

dongjoon-hyun · 2016-07-17T06:31:41Z

If I miss some cost you concern, could you give me some examples? I'm still not sure what you mean by cost. If I really understand what is the real situation you concern, I might change my thought.

But, at the worst case, when you add a new testcase, you can omit the secondary file option. Then, it works like before.

rxin · 2016-07-17T06:34:13Z

@dongjoon-hyun can you update the classdoc for the test suite to explain how to generate the expected results?

rxin · 2016-07-17T06:35:01Z

Also personally I find it difficult to look at the result without looking at the original SQL. Can we update the result file to show the following format?

original query
------------------------------------------------------------------------------------
generated query

dongjoon-hyun · 2016-07-17T06:36:47Z

Oh, sure. I'll update the classdoc and change the file format.

rxin · 2016-07-17T06:37:16Z

Checking the generated SQL is a good thing to do - we just need to make sure it is maintainable (i.e. have tooling to automatically regenerate results).

Once we are done with this, we should also do similar things for the parser -- I think the unit tests for parsers don't currently check for the plan, but only checks whether the parser can parse the input.

gatorsmile · 2016-07-17T06:38:00Z

For example, if we change generation of alias names, most of test cases will fail.

gatorsmile · 2016-07-17T06:39:43Z

Yeah, we need to do it for Parser. That part is very independent. The changes on the other parts will not affect it.

gatorsmile · 2016-07-17T06:42:31Z

Automating SQL statement file generation can simplify the work, but the reviewers have to be careful when reviewing the statement changes. Those statements could be very complex and hard to read.

rxin · 2016-07-17T06:51:19Z

Since the purpose of the function here is to generate SQL, I'd say it's important to actually show the generated SQL in reviews.

That said, if we can also get something even more automated (e.g. asserting that the optimized plan equals) in addition to this, then it'd be even more robust!

…ration answerset.

dongjoon-hyun · 2016-07-17T07:05:41Z

Now, the answer files have both original and generated queries and the description of regenerating answer sets is added to LogicalPlanToSQLSuite classdoc.

SparkQA · 2016-07-17T08:40:29Z

Test build #62426 has finished for PR 14235 at commit 4165581.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-07-17T23:37:33Z

sql/hive/src/test/resources/sqlgen/agg1.sql

@@ -0,0 +1,3 @@
+SELECT COUNT(value) FROM parquet_t1 GROUP BY key HAVING MAX(key) > 0
+--------------------------------------------------------------------------------
+SELECT `gen_attr` AS `count(value)` FROM (SELECT `gen_attr` FROM (SELECT count(`gen_attr`) AS `gen_attr`, max(`gen_attr`) AS `gen_attr` FROM (SELECT `key` AS `gen_attr`, `value` AS `gen_attr` FROM `default`.`parquet_t1`) AS gen_subquery_0 GROUP BY `gen_attr` HAVING (`gen_attr` > CAST(0 AS BIGINT))) AS gen_subquery_1) AS gen_subquery_2


can you add a new line to the end? this little red thing on github is annoying

Oh, I see. Sure.

dongjoon-hyun · 2016-07-18T05:48:01Z

Now, the remaining issue is using getResource to save the golden files. I left a comment about that.

rxin · 2016-07-18T05:55:06Z

Can you also remove "[TEST]" from the title? TEST isn't a module.

SparkQA · 2016-07-18T06:56:52Z

Test build #62448 has finished for PR 14235 at commit 38f52ce.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-18T08:01:56Z

Test build #62450 has finished for PR 14235 at commit a9a1b00.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-07-18T08:06:28Z

sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala

  import testImplicits._

+  // Used for generating new query answer files by saving
+  private val saveQuery = System.getenv("SPARK_GENERATE_GOLDEN_FILES") != null


hm if it is 0 we probably shouldn't be running it, so i'd test the value against 1

Option(System.getenv("SPARK_GENERATE_GOLDEN_FILES")) == Some("1")

also rename saveQuery to regenerateGoldenFiles

rxin · 2016-07-18T08:08:36Z

Looks pretty good now. Just couple minor comments.

dongjoon-hyun · 2016-07-18T13:30:03Z

Thank you, @rxin . By the way, only the following test occurs two times sequentially. It's timeout errors.

HiveSparkSubmitSuite.SPARK-8020: set sql conf in spark conf *** FAILED *** (5 minutes, 0 seconds)

I looked into the cases, but still have no idea about that. I wish the Jenkins pass at this time.

SparkQA · 2016-07-18T14:40:52Z

Test build #62469 has finished for PR 14235 at commit efaa4d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-07-18T14:44:42Z

Thank you for guiding me in this PR, @rxin .

liancheng · 2016-07-18T16:50:38Z

LGTM.

One thing is that I feel most of the times the SQL comparison assertion may fail due to reasonable internal changes that somehow affect SQL generation in no harmful ways, and can be fixed by simply regenerating the golden files. That said, shall we add instructions about how to regenerate the golden files when the the assertion fails?

dongjoon-hyun · 2016-07-18T17:03:20Z

Thank you for review, @liancheng . Sure. Currently, it is only documented in class doc. I think you are suggesting to have that in some HTML or Wiki(Confluence). Did I understand your advice correctly?

/**
 * A test suite for LogicalPlan-to-SQL conversion.
 *
 * Each query has a golden generated SQL file in test/resources/sqlgen. The test suite also has
 * built-in functionality to automatically generate these golden files.
 *
 * To re-generate golden files, run:
 *    SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "hive/test-only *LogicalPlanToSQLSuite"
 */

dongjoon-hyun · 2016-07-18T19:47:32Z

Hi, @rxin and @liancheng .
I will update this PR one more time. Please wait a moment.
I can use stable identifiers for gen_attr, too.

SparkQA · 2016-07-18T21:10:24Z

Test build #62486 has finished for PR 14235 at commit 244d013.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-18T21:26:28Z

Test build #62487 has finished for PR 14235 at commit ee5b747.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-07-18T21:31:07Z

Hmm, HiveCompatibilitySuite has some dependency on gen_attr_. I reverted the last attempt.
I will try as another PR as planed before.

dongjoon-hyun · 2016-07-18T21:33:36Z

Jenkins is restarted, but the current last commit is efaa4d0 having the passed Jenkins test. Could you review and merge this first if possible, @rxin ?

Test build #62469 has finished for PR 14235 at commit efaa4d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-18T23:08:48Z

Test build #62493 has finished for PR 14235 at commit efaa4d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-07-19T00:16:52Z

Thanks - merging in master / 2.0.

I'm also merging this in 2.0 since it is a test only change and will reduce merge conflicts.

…QL directly ## What changes were proposed in this pull request? This PR improves `LogicalPlanToSQLSuite` to check the generated SQL directly by **structure**. So far, `LogicalPlanToSQLSuite` relies on `checkHiveQl` to ensure the **successful SQL generation** and **answer equality**. However, it does not guarantee the generated SQL is the same or will not be changed unnoticeably. ## How was this patch tested? Pass the Jenkins. This is only a testsuite change. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14235 from dongjoon-hyun/SPARK-16590. (cherry picked from commit ea78edb) Signed-off-by: Reynold Xin <rxin@databricks.com>

dongjoon-hyun · 2016-07-19T00:20:40Z

Oh, thank you for merging, @rxin ! Also, thank you for review, @gatorsmile and @liancheng .

rxin · 2016-07-19T00:23:31Z

@dongjoon-hyun can you also look into having stable identifiers for gen_attr? Right now the golden files look really weird because gen_attr is used more than once.

dongjoon-hyun · 2016-07-19T00:33:11Z

Sure. I've been looking that. It's on my list.
I'll make a JIRA issue and proceed.

rxin · 2016-07-19T00:54:53Z

sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala


+  // Used for generating new query answer files by saving
+  private val regenerateGoldenFiles =
+    Option(System.getenv("SPARK_GENERATE_GOLDEN_FILES")).contains("1")


Why did you use contains here? This is super confusing and also broke 2.10.

I think I asked to do comparison with Some("1"). In most cases it is a very bad idea to use collection-oriented methods on Options, because they make the code more confusing.

I fixed it here c4524f5

Yep. I have nothing to say. My bad. Sorry about this. :(

[SPARK-16590][SQL] Improve LogicalPlanToSQLSuite to check generated S…

a3bb306

…QL directly

Use answer files and support easy generation.

9f20a63

dongjoon-hyun reviewed Jul 17, 2016
View reviewed changes

Include original queries in answer files and add class doc for regene…

4165581

…ration answerset.

dongjoon-hyun mentioned this pull request Jul 17, 2016

[SPARK-16475][SQL] Broadcast Hint for SQL Queries #14132

Closed

rxin reviewed Jul 17, 2016
View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-16590][SQL][TEST] Improve LogicalPlanToSQLSuite to check generated SQL directly~~ [SPARK-16590][SQL] Improve LogicalPlanToSQLSuite to check generated SQL directly Jul 18, 2016

Use SPARK_GENERATE_GOLDEN_FILES.

a9a1b00

rxin reviewed Jul 18, 2016
View reviewed changes

Address comments.

efaa4d0

asfgit closed this in ea78edb Jul 19, 2016

rxin reviewed Jul 19, 2016
View reviewed changes

dongjoon-hyun deleted the SPARK-16590 branch August 14, 2016 09:41

[SPARK-16590][SQL] Improve LogicalPlanToSQLSuite to check generated SQL directly #14235

[SPARK-16590][SQL] Improve LogicalPlanToSQLSuite to check generated SQL directly #14235

Uh oh!

Conversation

dongjoon-hyun commented Jul 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

rxin commented Jul 17, 2016

Uh oh!

dongjoon-hyun commented Jul 17, 2016

Uh oh!

SparkQA commented Jul 17, 2016

Uh oh!

gatorsmile commented Jul 17, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jul 17, 2016

Uh oh!

gatorsmile commented Jul 17, 2016

Uh oh!

dongjoon-hyun commented Jul 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Jul 17, 2016

Uh oh!

dongjoon-hyun commented Jul 17, 2016

Uh oh!

gatorsmile commented Jul 17, 2016

Uh oh!

SparkQA commented Jul 17, 2016

Uh oh!

dongjoon-hyun commented Jul 17, 2016

Uh oh!

dongjoon-hyun commented Jul 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rxin commented Jul 17, 2016

Uh oh!

rxin commented Jul 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Jul 17, 2016

Uh oh!

rxin commented Jul 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Jul 17, 2016

Uh oh!

gatorsmile commented Jul 17, 2016

Uh oh!

gatorsmile commented Jul 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rxin commented Jul 17, 2016

Uh oh!

dongjoon-hyun commented Jul 17, 2016

Uh oh!

SparkQA commented Jul 17, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jul 18, 2016

Uh oh!

rxin commented Jul 18, 2016

Uh oh!

SparkQA commented Jul 18, 2016

Uh oh!

SparkQA commented Jul 18, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Jul 18, 2016

Uh oh!

dongjoon-hyun commented Jul 17, 2016 •

edited

Loading

dongjoon-hyun commented Jul 17, 2016 •

edited

Loading

dongjoon-hyun commented Jul 17, 2016 •

edited

Loading

rxin commented Jul 17, 2016 •

edited

Loading

rxin commented Jul 17, 2016 •

edited

Loading

gatorsmile commented Jul 17, 2016 •

edited

Loading

dongjoon-hyun commented Jul 18, 2016 •

edited

Loading