-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16590][SQL] Improve LogicalPlanToSQLSuite to check generated SQL directly #14235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
hm the problem with this approach is that we'd need to spend a lot of time to update test cases whenever we change sql generation slightly. I think in order to do this, we should put the generated sql in files and then have a way to regenerate all the sql queries in bulk. |
|
Oh, thank you for fast review! |
|
Test build #62420 has finished for PR 14235 at commit
|
|
It sounds like the latest changes are unable to resolve Reynold's concern: how to regenerate all the expected SQL queries in bulk? You know, for native view support, we are not generating optimal SQL statements. You can see the generated SQL is verbose and not readable. However, these SQL statements can be optimized by Catalyst optimizer at runtime. Thus, I have the same concern. This approach is not maintainable |
|
|
||
| private def checkHiveQl(hiveQl: String): Unit = { | ||
| // Used for generating new query answer files by saving | ||
| private val saveQuery = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @gatorsmile .
This is my answer to that. :)
|
First of all, you can update the whole query set by one flag. So, maintainability is no more difficult issue. I made this PR because We need to prevent unintentional and accidental changes on that before both Hint or SPARK-16576. IMO, this is a correctness issue we should resolve. I hope this PR protects Spark from me. :) |
|
Comparing the SQL statement strings is horrible to me. I have a different approach to verify the correctness. How about compare the optimized plans, which can tolerate more slight changes that do not affect the correctness? |
|
:) Sorry, but I don't think so.
|
|
I can understand the advantage of comparing SQL statement strings, but the cost is higher than the benefit especially for a small group of key developers who are maintaining the Spark code everyday. In some commercial RDBMS, they have a whole team for developing and maintaining SQL generation. However, I am not sure if the Spark community can afford it. You know, SQL generation is only used for native view support so far. |
|
I'm not sure what cost exactly do you mean. |
|
My concern is we might spend a lot of times when making changes that impact the SQL statement strings but it does not change the correctness of SQL builder. |
|
Test build #62423 has finished for PR 14235 at commit
|
|
First, we will make more HINTS. I made a general HINT syntax for that. We are opening the door together. Second, when you add your new testcase here, the additional cost is just one second to generate the file. Third, when you develops your PR, this will save much time. I'm already using this to save my time. At this time, we need to verify correctness at every level. If you use optimized result or the execution result, that is a naive test. You know that. |
|
If I miss some cost you concern, could you give me some examples? I'm still not sure what you mean by cost. If I really understand what is the real situation you concern, I might change my thought. But, at the worst case, when you add a new testcase, you can omit the secondary file option. Then, it works like before. |
|
@dongjoon-hyun can you update the classdoc for the test suite to explain how to generate the expected results? |
|
Also personally I find it difficult to look at the result without looking at the original SQL. Can we update the result file to show the following format? |
|
Oh, sure. I'll update the classdoc and change the file format. |
|
Checking the generated SQL is a good thing to do - we just need to make sure it is maintainable (i.e. have tooling to automatically regenerate results). Once we are done with this, we should also do similar things for the parser -- I think the unit tests for parsers don't currently check for the plan, but only checks whether the parser can parse the input. |
|
For example, if we change generation of alias names, most of test cases will fail. |
|
Yeah, we need to do it for |
|
Automating SQL statement file generation can simplify the work, but the reviewers have to be careful when reviewing the statement changes. Those statements could be very complex and hard to read. |
|
Since the purpose of the function here is to generate SQL, I'd say it's important to actually show the generated SQL in reviews. That said, if we can also get something even more automated (e.g. asserting that the optimized plan equals) in addition to this, then it'd be even more robust! |
…ration answerset.
|
Now, the answer files have both original and generated queries and the description of regenerating answer sets is added to |
|
Test build #62426 has finished for PR 14235 at commit
|
| @@ -0,0 +1,3 @@ | |||
| SELECT COUNT(value) FROM parquet_t1 GROUP BY key HAVING MAX(key) > 0 | |||
| -------------------------------------------------------------------------------- | |||
| SELECT `gen_attr` AS `count(value)` FROM (SELECT `gen_attr` FROM (SELECT count(`gen_attr`) AS `gen_attr`, max(`gen_attr`) AS `gen_attr` FROM (SELECT `key` AS `gen_attr`, `value` AS `gen_attr` FROM `default`.`parquet_t1`) AS gen_subquery_0 GROUP BY `gen_attr` HAVING (`gen_attr` > CAST(0 AS BIGINT))) AS gen_subquery_1) AS gen_subquery_2 No newline at end of file | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a new line to the end? this little red thing on github is annoying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. Sure.
|
Now, the remaining issue is using |
|
Can you also remove "[TEST]" from the title? TEST isn't a module. |
|
Test build #62448 has finished for PR 14235 at commit
|
|
Test build #62450 has finished for PR 14235 at commit
|
| import testImplicits._ | ||
|
|
||
| // Used for generating new query answer files by saving | ||
| private val saveQuery = System.getenv("SPARK_GENERATE_GOLDEN_FILES") != null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm if it is 0 we probably shouldn't be running it, so i'd test the value against 1
Option(System.getenv("SPARK_GENERATE_GOLDEN_FILES")) == Some("1")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also rename saveQuery to regenerateGoldenFiles
|
Looks pretty good now. Just couple minor comments. |
|
Thank you, @rxin . By the way, only the following test occurs two times sequentially. It's timeout errors. I looked into the cases, but still have no idea about that. I wish the Jenkins pass at this time. |
|
Test build #62469 has finished for PR 14235 at commit
|
|
Thank you for guiding me in this PR, @rxin . |
|
LGTM. One thing is that I feel most of the times the SQL comparison assertion may fail due to reasonable internal changes that somehow affect SQL generation in no harmful ways, and can be fixed by simply regenerating the golden files. That said, shall we add instructions about how to regenerate the golden files when the the assertion fails? |
|
Thank you for review, @liancheng . Sure. Currently, it is only documented in class doc. I think you are suggesting to have that in some HTML or Wiki(Confluence). Did I understand your advice correctly? /**
* A test suite for LogicalPlan-to-SQL conversion.
*
* Each query has a golden generated SQL file in test/resources/sqlgen. The test suite also has
* built-in functionality to automatically generate these golden files.
*
* To re-generate golden files, run:
* SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "hive/test-only *LogicalPlanToSQLSuite"
*/ |
|
Hi, @rxin and @liancheng . |
|
Test build #62486 has finished for PR 14235 at commit
|
|
Test build #62487 has finished for PR 14235 at commit
|
|
Hmm, HiveCompatibilitySuite has some dependency on |
|
Jenkins is restarted, but the current last commit is efaa4d0 having the passed Jenkins test. Could you review and merge this first if possible, @rxin ? |
|
Test build #62493 has finished for PR 14235 at commit
|
|
Thanks - merging in master / 2.0. I'm also merging this in 2.0 since it is a test only change and will reduce merge conflicts. |
…QL directly ## What changes were proposed in this pull request? This PR improves `LogicalPlanToSQLSuite` to check the generated SQL directly by **structure**. So far, `LogicalPlanToSQLSuite` relies on `checkHiveQl` to ensure the **successful SQL generation** and **answer equality**. However, it does not guarantee the generated SQL is the same or will not be changed unnoticeably. ## How was this patch tested? Pass the Jenkins. This is only a testsuite change. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14235 from dongjoon-hyun/SPARK-16590. (cherry picked from commit ea78edb) Signed-off-by: Reynold Xin <rxin@databricks.com>
|
Oh, thank you for merging, @rxin ! Also, thank you for review, @gatorsmile and @liancheng . |
|
@dongjoon-hyun can you also look into having stable identifiers for gen_attr? Right now the golden files look really weird because gen_attr is used more than once. |
|
Sure. I've been looking that. It's on my list. |
|
|
||
| // Used for generating new query answer files by saving | ||
| private val regenerateGoldenFiles = | ||
| Option(System.getenv("SPARK_GENERATE_GOLDEN_FILES")).contains("1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you use contains here? This is super confusing and also broke 2.10.
I think I asked to do comparison with Some("1"). In most cases it is a very bad idea to use collection-oriented methods on Options, because they make the code more confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed it here c4524f5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I have nothing to say. My bad. Sorry about this. :(
What changes were proposed in this pull request?
This PR improves
LogicalPlanToSQLSuiteto check the generated SQL directly by structure. So far,LogicalPlanToSQLSuiterelies oncheckHiveQlto ensure the successful SQL generation and answer equality. However, it does not guarantee the generated SQL is the same or will not be changed unnoticeably.How was this patch tested?
Pass the Jenkins. This is only a testsuite change.