Skip to content

Conversation

@yucai
Copy link
Contributor

@yucai yucai commented Oct 27, 2018

What changes were proposed in this pull request?

Refactor BuiltInDataSourceWriteBenchmark, DataSourceWriteBenchmark and AvroWriteBenchmark to use main method.

SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.BuiltInDataSourceWriteBenchmark"

SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "avro/test:runMain org.apache.spark.sql.execution.benchmark.AvroWriteBenchmark"

How was this patch tested?

manual tests

@yucai
Copy link
Contributor Author

yucai commented Oct 27, 2018

@gengliangwang @wangyum @dongjoon-hyun help review.

@SparkQA
Copy link

SparkQA commented Oct 27, 2018

Test build #98133 has finished for PR 22861 at commit 81fe383.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait DataSourceWriteBenchmark extends SqlBasedBenchmark

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yucai . We (@yucai and @gengliangwang and me) know what is this for, but please don't piggy-back (hide) something like this. This kind of change requires another explicit PR in general.

@yucai
Copy link
Contributor Author

yucai commented Oct 29, 2018

@dongjoon-hyun Originally, I want to do two things in this PR.

  1. Make mainArgs correctly set in BenchmarkBase.
  2. Include an example to use mainArgs: refactor DataSourceWriteBenchmark and BuiltInDataSourceWriteBenchmark to use main method.

But, refactor DataSourceWriteBenchmark will lead to refactor AvroWriteBenchmark.

Any suggestion how to split PR?

@dongjoon-hyun
Copy link
Member

At least, the following is worth for a separate PR because it's orthogonal Refactor ... to use main method.

1. Make mainArgs correctly set in BenchmarkBase.

One PR had better have one theme. Putting different themes into one PR together is not a good practice.

Please start with the minimal one. If the committer asks some example, then add that later. That's better.

@gengliangwang
Copy link
Member

Personally I am against accessing the main args in such way. It looks a bit ugly.
But if we have to move everything to BenchmarkBase, then maybe this is the way.

@yucai
Copy link
Contributor Author

yucai commented Oct 29, 2018

Current implementation misses main args, but some suite would need it anyway.
Let's can discuss #22872.

@yucai
Copy link
Contributor Author

yucai commented Oct 29, 2018

@dongjoon-hyun I used #22872 to make main args accessible for BenchmarkBase's subclass, this PR is mainly for refactoring DataSourceWriteBenchmark and BuiltInDataSourceWriteBenchmark. But it will lead to AvroWriteBenchmark refactor, so I put them in one PR, kindly let me know if you have better idea.

* 2. with sbt: build/sbt "avro/test:runMain <this class>"
* To run this benchmark:
* {{{
* 1. without sbt: bin/spark-submit --class <this class>
Copy link
Member

@wangyum wangyum Oct 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add avro:

bin/spark-submit --class <this class> --jars <spark core test jar>,<spark catalyst test jar>,<spark sql test jar> <spark avro test jar>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hint an exception when run:

bin/spark-submit --class org.apache.spark.sql.execution.benchmark.AvroWriteBenchmark --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar,./sql/catalyst/target/spark-catalyst_2.11-3.0.0-SNAPSHOT-tests.jar,./sql/core/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar ./external/avro/target/spark-avro_2.11-3.0.0-SNAPSHOT-tests.jar
Exception in thread "main" org.apache.spark.sql.AnalysisException: Failed to find data source: Avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:647)
	at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:94)
	at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:93)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:313)
	at org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:78)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
	at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
	at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
......

Copy link
Contributor Author

@yucai yucai Oct 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangyum Good catch! I think it needs <spark avro jar>, added.

@SparkQA
Copy link

SparkQA commented Oct 29, 2018

Test build #98207 has finished for PR 22861 at commit 59f167d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait UnevaluableAggregate extends DeclarativeAggregate
  • abstract class UnevaluableBooleanAggBase(arg: Expression)
  • case class EveryAgg(arg: Expression) extends UnevaluableBooleanAggBase(arg)
  • case class AnyAgg(arg: Expression) extends UnevaluableBooleanAggBase(arg)
  • case class SomeAgg(arg: Expression) extends UnevaluableBooleanAggBase(arg)
  • case class UnresolvedCatalystToExternalMap(

@SparkQA
Copy link

SparkQA commented Oct 29, 2018

Test build #98215 has finished for PR 22861 at commit 74f6ef5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 30, 2018

Test build #98247 has finished for PR 22861 at commit 93ca3be.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 30, 2018

Test build #98245 has finished for PR 22861 at commit f06b2ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yucai
Copy link
Contributor Author

yucai commented Oct 30, 2018

@dongjoon-hyun Tests have been passed.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yucai . Almost looks good. Please update the PR description according to the PR content and merge yucai#6 .

  • Fix
- SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.AvroWriteBenchmark"
+ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "avro/test:runMain org.apache.spark.sql.execution.benchmark.AvroWriteBenchmark"
  • Remove
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.BuiltInDataSourceWriteBenchmark Parquet ORC"

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 31, 2018

Thank you, @yucai , @gengliangwang and @wangyum .
Merged to master. (My PR is just for test output update. So there is no need to wait Jenkins)

@asfgit asfgit closed this in f8484e4 Oct 31, 2018
@SparkQA
Copy link

SparkQA commented Oct 31, 2018

Test build #98310 has finished for PR 22861 at commit 3ca0cf2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…Benchmark, DataSourceWriteBenchmark and AvroWriteBenchmark to use main method

## What changes were proposed in this pull request?

Refactor BuiltInDataSourceWriteBenchmark, DataSourceWriteBenchmark and AvroWriteBenchmark to use main method.

```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.BuiltInDataSourceWriteBenchmark"

SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "avro/test:runMain org.apache.spark.sql.execution.benchmark.AvroWriteBenchmark"
```
## How was this patch tested?

manual tests

Closes apache#22861 from yucai/BuiltInDataSourceWriteBenchmark.

Lead-authored-by: yucai <yyu1@ebay.com>
Co-authored-by: Yucai Yu <yucai.yu@foxmail.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants