Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22787] [TEST] [SQL] Add a TPC-H query suite #19982

Closed
wants to merge 5 commits into from

Conversation

gatorsmile
Copy link
Member

What changes were proposed in this pull request?

Add a test suite to ensure all the TPC-H queries can be successfully analyzed, optimized and compiled without hitting the max iteration threshold.

How was this patch tested?

N/A

private def readToUnsafeMem(
conf: Broadcast[SerializableConfiguration],
requiredSchema: StructType,
wholeTextMode: Boolean): (PartitionedFile) => Iterator[UnsafeRow] = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a style fix of a PR I merged today.

@gatorsmile
Copy link
Member Author

cc @cloud-fan @maropu @kiszk

@SparkQA
Copy link

SparkQA commented Dec 14, 2017

Test build #84927 has finished for PR 19982 at commit 9d52ff0.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

-- using default substitutions

select
l_returnflag,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use space instead of tab?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the generated SQL. I plan to keep them unchanged.

@SparkQA
Copy link

SparkQA commented Dec 14, 2017

Test build #84928 has finished for PR 19982 at commit fac5fb2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

"q1", "q2", "q3", "q4", "q5", "q6", "q7", "q8", "q9", "q10", "q11",
"q12", "q13", "q14", "q15", "q16", "q17", "q18", "q19", "q20", "q21", "q22")

private def checkGeneratedCode(plan: SparkPlan): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making a base trait to remove duplicate code between this and TPCDSQuerySuite?

@maropu
Copy link
Member

maropu commented Dec 14, 2017

LGTM for one minor comment.
BTW, how about also adding star schema benchmark (many join cases) to check compilation?

@gatorsmile
Copy link
Member Author

@maropu
Copy link
Member

maropu commented Dec 14, 2017

yea.

@maropu
Copy link
Member

maropu commented Dec 14, 2017

Probably, we could get sql queries from here https://github.com/electrum/ssb-dbgen (I don't look into yet though...) Probably, we need to get sql queries from the paper (or, https://github.com/eyalroz/ssb-tools/tree/master/benchmark_queries)

@kiszk
Copy link
Member

kiszk commented Dec 15, 2017

LGTM

/**
* Drop all the tables
*/
protected override def afterAll(): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As same as TPCDSQuerySuite, we can use BenchmarkQueryTest.afterAll.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea we don't need to overwrite it here.

@viirya
Copy link
Member

viirya commented Dec 15, 2017

LGTM with one minor comment.

protected override def afterAll(): Unit = {
try {
// For debugging dump some statistics about how much time was spent in various optimizer rules
logWarning(RuleExecutor.dumpTimeSpent())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why warning for debug uses?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is that a bad idea to dump the time for each query?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do use logWarning, the messages will not be shown in the test log.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to give the overall picture how long each rule takes.

I plan to submit another PR to track which rule takes an effect for a specific query and also record the time cost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, ok

@SparkQA
Copy link

SparkQA commented Dec 15, 2017

Test build #84936 has finished for PR 19982 at commit 9dc4457.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class BenchmarkQueryTest extends QueryTest with SharedSQLContext with BeforeAndAfterAll
  • class TPCDSQuerySuite extends BenchmarkQueryTest
  • class TPCHQuerySuite extends BenchmarkQueryTest

@gatorsmile
Copy link
Member Author

cc @maropu Feel free to submit a PR for adding SSB

@maropu
Copy link
Member

maropu commented Dec 15, 2017

yea, sure. I have much bandwidth now :)

@SparkQA
Copy link

SparkQA commented Dec 15, 2017

Test build #84940 has finished for PR 19982 at commit 2671416.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

Thanks! Merged to master.

@asfgit asfgit closed this in 3fea5c4 Dec 15, 2017
@maropu
Copy link
Member

maropu commented Dec 15, 2017

#19990

@maropu
Copy link
Member

maropu commented Dec 23, 2017

@gatorsmile Any progress on this? #19982 (comment)
After I thought your comment, I came up with collecting metrics for each rule like;
master...maropu:MetricSpike
This conflicts with your activity, or this is not acceptable? welcome any comment.

@gatorsmile
Copy link
Member Author

@maropu Thanks for your contribution. It looks over engineering. We do not need such complicated solutions for this simple use case. We just need to record them in the log. We are also proposing new APIs for our logs. @jiangxb1987 is working on the design.

@maropu
Copy link
Member

maropu commented Dec 24, 2017

ok, I just look forward to the proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants