Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jul 17, 2016

What changes were proposed in this pull request?

This PR improves LogicalPlanToSQLSuite to check the generated SQL directly by structure. So far, LogicalPlanToSQLSuite relies on checkHiveQl to ensure the successful SQL generation and answer equality. However, it does not guarantee the generated SQL is the same or will not be changed unnoticeably.

How was this patch tested?

Pass the Jenkins. This is only a testsuite change.

@rxin
Copy link
Contributor

rxin commented Jul 17, 2016

hm the problem with this approach is that we'd need to spend a lot of time to update test cases whenever we change sql generation slightly. I think in order to do this, we should put the generated sql in files and then have a way to regenerate all the sql queries in bulk.

@dongjoon-hyun
Copy link
Member Author

Oh, thank you for fast review!
Yep. I will update like that.
The purpose of this PR is having stronger LogicalPlanToSQLSuite.

@SparkQA
Copy link

SparkQA commented Jul 17, 2016

Test build #62420 has finished for PR 14235 at commit a3bb306.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

It sounds like the latest changes are unable to resolve Reynold's concern: how to regenerate all the expected SQL queries in bulk?

You know, for native view support, we are not generating optimal SQL statements. You can see the generated SQL is verbose and not readable. However, these SQL statements can be optimized by Catalyst optimizer at runtime. Thus, I have the same concern. This approach is not maintainable


private def checkHiveQl(hiveQl: String): Unit = {
// Used for generating new query answer files by saving
private val saveQuery = false
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @gatorsmile .
This is my answer to that. :)

@dongjoon-hyun
Copy link
Member Author

First of all, you can update the whole query set by one flag. So, maintainability is no more difficult issue.

I made this PR because SQL generation is currently fragile as you said yesterday.

We need to prevent unintentional and accidental changes on that before both Hint or SPARK-16576.

IMO, this is a correctness issue we should resolve. I hope this PR protects Spark from me. :)

@gatorsmile
Copy link
Member

Comparing the SQL statement strings is horrible to me.

I have a different approach to verify the correctness. How about compare the optimized plans, which can tolerate more slight changes that do not affect the correctness?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jul 17, 2016

:) Sorry, but I don't think so.

  • At every level, we need to prove correctness. The tolerance you want is the result of unpredictable removals of Optimizer.
  • Also, this is LogicalPlanToSQLSuite . SQL statement comparison is not a good way in general, but the only correct way to this module.

@gatorsmile
Copy link
Member

I can understand the advantage of comparing SQL statement strings, but the cost is higher than the benefit especially for a small group of key developers who are maintaining the Spark code everyday.

In some commercial RDBMS, they have a whole team for developing and maintaining SQL generation. However, I am not sure if the Spark community can afford it. You know, SQL generation is only used for native view support so far.

@dongjoon-hyun
Copy link
Member Author

I'm not sure what cost exactly do you mean.
It's a flag, isn't it? Also, if someone change SQLBuilder, it should be tested by the generated SQLs, not the query execution result.
For example, BROADCAST hints will be tested here as comments after merging this PR.

@gatorsmile
Copy link
Member

BROADCAST hints is special. We need to check the statements.

My concern is we might spend a lot of times when making changes that impact the SQL statement strings but it does not change the correctness of SQL builder.

@SparkQA
Copy link

SparkQA commented Jul 17, 2016

Test build #62423 has finished for PR 14235 at commit 9f20a63.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

First, we will make more HINTS. I made a general HINT syntax for that. We are opening the door together.

Second, when you add your new testcase here, the additional cost is just one second to generate the file.

Third, when you develops your PR, this will save much time. I'm already using this to save my time. At this time, we need to verify correctness at every level. If you use optimized result or the execution result, that is a naive test. You know that.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jul 17, 2016

If I miss some cost you concern, could you give me some examples? I'm still not sure what you mean by cost. If I really understand what is the real situation you concern, I might change my thought.

But, at the worst case, when you add a new testcase, you can omit the secondary file option. Then, it works like before.

@rxin
Copy link
Contributor

rxin commented Jul 17, 2016

@dongjoon-hyun can you update the classdoc for the test suite to explain how to generate the expected results?

@rxin
Copy link
Contributor

rxin commented Jul 17, 2016

Also personally I find it difficult to look at the result without looking at the original SQL. Can we update the result file to show the following format?

original query
------------------------------------------------------------------------------------
generated query

@dongjoon-hyun
Copy link
Member Author

Oh, sure. I'll update the classdoc and change the file format.

@rxin
Copy link
Contributor

rxin commented Jul 17, 2016

Checking the generated SQL is a good thing to do - we just need to make sure it is maintainable (i.e. have tooling to automatically regenerate results).

Once we are done with this, we should also do similar things for the parser -- I think the unit tests for parsers don't currently check for the plan, but only checks whether the parser can parse the input.

@gatorsmile
Copy link
Member

For example, if we change generation of alias names, most of test cases will fail.

@gatorsmile
Copy link
Member

Yeah, we need to do it for Parser. That part is very independent. The changes on the other parts will not affect it.

@gatorsmile
Copy link
Member

gatorsmile commented Jul 17, 2016

Automating SQL statement file generation can simplify the work, but the reviewers have to be careful when reviewing the statement changes. Those statements could be very complex and hard to read.

@rxin
Copy link
Contributor

rxin commented Jul 17, 2016

Since the purpose of the function here is to generate SQL, I'd say it's important to actually show the generated SQL in reviews.

That said, if we can also get something even more automated (e.g. asserting that the optimized plan equals) in addition to this, then it'd be even more robust!

@dongjoon-hyun
Copy link
Member Author

Now, the answer files have both original and generated queries and the description of regenerating answer sets is added to LogicalPlanToSQLSuite classdoc.

@SparkQA
Copy link

SparkQA commented Jul 17, 2016

Test build #62426 has finished for PR 14235 at commit 4165581.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -0,0 +1,3 @@
SELECT COUNT(value) FROM parquet_t1 GROUP BY key HAVING MAX(key) > 0
--------------------------------------------------------------------------------
SELECT `gen_attr` AS `count(value)` FROM (SELECT `gen_attr` FROM (SELECT count(`gen_attr`) AS `gen_attr`, max(`gen_attr`) AS `gen_attr` FROM (SELECT `key` AS `gen_attr`, `value` AS `gen_attr` FROM `default`.`parquet_t1`) AS gen_subquery_0 GROUP BY `gen_attr` HAVING (`gen_attr` > CAST(0 AS BIGINT))) AS gen_subquery_1) AS gen_subquery_2 No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a new line to the end? this little red thing on github is annoying

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Sure.

@dongjoon-hyun
Copy link
Member Author

Now, the remaining issue is using getResource to save the golden files. I left a comment about that.

@rxin
Copy link
Contributor

rxin commented Jul 18, 2016

Can you also remove "[TEST]" from the title? TEST isn't a module.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-16590][SQL][TEST] Improve LogicalPlanToSQLSuite to check generated SQL directly [SPARK-16590][SQL] Improve LogicalPlanToSQLSuite to check generated SQL directly Jul 18, 2016
@SparkQA
Copy link

SparkQA commented Jul 18, 2016

Test build #62448 has finished for PR 14235 at commit 38f52ce.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 18, 2016

Test build #62450 has finished for PR 14235 at commit a9a1b00.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

import testImplicits._

// Used for generating new query answer files by saving
private val saveQuery = System.getenv("SPARK_GENERATE_GOLDEN_FILES") != null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm if it is 0 we probably shouldn't be running it, so i'd test the value against 1

Option(System.getenv("SPARK_GENERATE_GOLDEN_FILES")) == Some("1")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also rename saveQuery to regenerateGoldenFiles

@rxin
Copy link
Contributor

rxin commented Jul 18, 2016

Looks pretty good now. Just couple minor comments.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jul 18, 2016

Thank you, @rxin . By the way, only the following test occurs two times sequentially. It's timeout errors.

HiveSparkSubmitSuite.SPARK-8020: set sql conf in spark conf *** FAILED *** (5 minutes, 0 seconds)

I looked into the cases, but still have no idea about that. I wish the Jenkins pass at this time.

@SparkQA
Copy link

SparkQA commented Jul 18, 2016

Test build #62469 has finished for PR 14235 at commit efaa4d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Thank you for guiding me in this PR, @rxin .

@liancheng
Copy link
Contributor

LGTM.

One thing is that I feel most of the times the SQL comparison assertion may fail due to reasonable internal changes that somehow affect SQL generation in no harmful ways, and can be fixed by simply regenerating the golden files. That said, shall we add instructions about how to regenerate the golden files when the the assertion fails?

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @liancheng . Sure. Currently, it is only documented in class doc. I think you are suggesting to have that in some HTML or Wiki(Confluence). Did I understand your advice correctly?

/**
 * A test suite for LogicalPlan-to-SQL conversion.
 *
 * Each query has a golden generated SQL file in test/resources/sqlgen. The test suite also has
 * built-in functionality to automatically generate these golden files.
 *
 * To re-generate golden files, run:
 *    SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "hive/test-only *LogicalPlanToSQLSuite"
 */

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin and @liancheng .
I will update this PR one more time. Please wait a moment.
I can use stable identifiers for gen_attr, too.

@SparkQA
Copy link

SparkQA commented Jul 18, 2016

Test build #62486 has finished for PR 14235 at commit 244d013.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 18, 2016

Test build #62487 has finished for PR 14235 at commit ee5b747.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hmm, HiveCompatibilitySuite has some dependency on gen_attr_. I reverted the last attempt.
I will try as another PR as planed before.

@dongjoon-hyun
Copy link
Member Author

Jenkins is restarted, but the current last commit is efaa4d0 having the passed Jenkins test. Could you review and merge this first if possible, @rxin ?

Test build #62469 has finished for PR 14235 at commit efaa4d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 18, 2016

Test build #62493 has finished for PR 14235 at commit efaa4d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jul 19, 2016

Thanks - merging in master / 2.0.

I'm also merging this in 2.0 since it is a test only change and will reduce merge conflicts.

@asfgit asfgit closed this in ea78edb Jul 19, 2016
asfgit pushed a commit that referenced this pull request Jul 19, 2016
…QL directly

## What changes were proposed in this pull request?

This PR improves `LogicalPlanToSQLSuite` to check the generated SQL directly by **structure**. So far, `LogicalPlanToSQLSuite` relies on  `checkHiveQl` to ensure the **successful SQL generation** and **answer equality**. However, it does not guarantee the generated SQL is the same or will not be changed unnoticeably.

## How was this patch tested?

Pass the Jenkins. This is only a testsuite change.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14235 from dongjoon-hyun/SPARK-16590.

(cherry picked from commit ea78edb)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@dongjoon-hyun
Copy link
Member Author

Oh, thank you for merging, @rxin ! Also, thank you for review, @gatorsmile and @liancheng .

@rxin
Copy link
Contributor

rxin commented Jul 19, 2016

@dongjoon-hyun can you also look into having stable identifiers for gen_attr? Right now the golden files look really weird because gen_attr is used more than once.

@dongjoon-hyun
Copy link
Member Author

Sure. I've been looking that. It's on my list.
I'll make a JIRA issue and proceed.


// Used for generating new query answer files by saving
private val regenerateGoldenFiles =
Option(System.getenv("SPARK_GENERATE_GOLDEN_FILES")).contains("1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you use contains here? This is super confusing and also broke 2.10.

I think I asked to do comparison with Some("1"). In most cases it is a very bad idea to use collection-oriented methods on Options, because they make the code more confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it here c4524f5

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I have nothing to say. My bad. Sorry about this. :(

@dongjoon-hyun dongjoon-hyun deleted the SPARK-16590 branch August 14, 2016 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants