[SPARK-31364][SQL][TESTS] Benchmark Parquet Nested Field Predicate Pushdown #28319

JiJiTang · 2020-04-23T19:57:46Z

What changes were proposed in this pull request?

This PR aims to add a benchmark suite for nested predicate pushdown with parquet file:

Performance comparison: Nested predicate pushdown disabled vs enabled, with the following queries scenarios:

When predicate pushed down, parquet reader are able to filter out all the row groups without loading them.
When predicate pushed down, parquet reader only loads one of the row groups.
When predicate pushed down, parquet reader can't filter out any row group in order to see if we introduce too much overhead or not when enabling nested predicate push down.

Why are the changes needed?

No benchmark exists today for nested fields predicate pushdown performance evaluation.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Benchmark runs and reporting result.

dongjoon-hyun · 2020-04-23T20:01:55Z

ok to test

dongjoon-hyun · 2020-04-23T20:02:09Z

Thank you for your first contribution, @JiJiTang .

dongjoon-hyun · 2020-04-23T20:04:15Z

...scala/org/apache/spark/sql/execution/benchmark/ParquetNestedPredicatePushDownBenchmark.scala

+ *      Results will be written to "benchmarks/ParquetNestedPredicatePushDownBenchmark-results.txt".
+ * }}}
+ */
+object ParquetNestedPredicatePushDownBenchmark extends SqlBasedBenchmark {


BTW, you need to switch to JDK8 HOME in your environment and run the above command once more. That will generate ParquetNestedPredicatePushDownBenchmark-results.txt additionally.

Thanks a lot @dongjoon-hyun. I will run the benchmark with JDK8 and commit the report.

@dongjoon-hyun , jdk8 benchmark results pushed.

dongjoon-hyun · 2020-04-23T20:04:52Z

cc @dbtsai , @holdenk , @gatorsmile

Add jdk8 benchmark result

dongjoon-hyun · 2020-04-23T21:26:30Z

ok to test

SparkQA · 2020-04-24T02:04:28Z

Test build #121702 has finished for PR 28319 at commit 7a6df2c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2020-04-24T03:32:54Z

...scala/org/apache/spark/sql/execution/benchmark/ParquetNestedPredicatePushDownBenchmark.scala

+                      name: String, filterFn: DataFrame => DataFrame): Unit = {
+    val loadDF = spark.read.parquet(inputPath)
+    benchmark.addCase(name) {
+      _ =>


nit, move _ => to previous line

dbtsai · 2020-04-24T03:34:44Z

...scala/org/apache/spark/sql/execution/benchmark/ParquetNestedPredicatePushDownBenchmark.scala

+
+  private def addCase(benchmark: Benchmark, inputPath: String,
+                      enableNestedPD: Boolean,
+                      name: String, filterFn: DataFrame => DataFrame): Unit = {


We do 2 space indentation in our codebase.

private def addCase( benchmark: Benchmark, inputPath: String, enableNestedPD: Boolean, name: String, filterFn: DataFrame => DataFrame): Unit = {

nit, you might call filterFn as withFilter.

actually, 4 spaces indentation for method parameters :)

haha, my bad~ was sleepy... lol

dbtsai · 2020-04-24T03:41:28Z

...scala/org/apache/spark/sql/execution/benchmark/ParquetNestedPredicatePushDownBenchmark.scala

+        df.write.mode(SaveMode.Overwrite).parquet(tempDir.getCanonicalPath)
+        val benchmark = new Benchmark(name, N, NUMBER_OF_ITER, output = output)
+        addCase(benchmark, outputPath, enableNestedPD = false,
+          "NestedFieldsPredicatePushDownDisabled", filterFn)


Maybe just call it, With nested predicate Pushdown and Without nested predicate pushdown for readability?

dbtsai · 2020-04-24T03:42:59Z

...scala/org/apache/spark/sql/execution/benchmark/ParquetNestedPredicatePushDownBenchmark.scala

+   */
+  def runLoadNoRowGroupWhenPredicatePushedDown(): Unit = {
+    // no row group will be loaded when predicate pushed down
+    val filterFn: DataFrame => DataFrame = df => df.filter("nested.x < 0")


nit, you might call filterFn as withFilter. Typically, the type is not required for non-public variable.

dbtsai · 2020-04-24T03:46:40Z

...scala/org/apache/spark/sql/execution/benchmark/ParquetNestedPredicatePushDownBenchmark.scala

+    val filterFn: DataFrame => DataFrame = { df =>
+      df.filter("nested.x >= 0").filter(s"nested.x <= $N")
+    }
+    createAndRunBenchmark("LoadAllRowGroupsWhenPredicatePushedDown", filterFn)


"All row groups can not be skipped"

dbtsai · 2020-04-24T03:47:05Z

...scala/org/apache/spark/sql/execution/benchmark/ParquetNestedPredicatePushDownBenchmark.scala

+  def runLoadSomeRowGroupWhenPredicatePushedDown(): Unit = {
+    // only a row group will be loaded when predicate pushed down
+    val filterFn: DataFrame => DataFrame = df => df.filter("nested.x = 100")
+    createAndRunBenchmark("LoadSomeRowGroupsWhenPredicatePushedDown", filterFn)


"Some row groups can be skipped with nested predicate pushdown"

dbtsai · 2020-04-24T03:47:29Z

...scala/org/apache/spark/sql/execution/benchmark/ParquetNestedPredicatePushDownBenchmark.scala

+  def runLoadNoRowGroupWhenPredicatePushedDown(): Unit = {
+    // no row group will be loaded when predicate pushed down
+    val filterFn: DataFrame => DataFrame = df => df.filter("nested.x < 0")
+    createAndRunBenchmark("LoadNoRowGroupsWhenPredicatePushedDown", filterFn)


"all row groups can be skipped with nested predicate pushdown"

dbtsai

Thanks for the benchmark.

Looks like that nested predicate pushdown doesn't add overhead in the worst case when no row group can be skipped. cc @HyukjinKwon @cloud-fan @gatorsmile and @rdblue

On this synthetic data, the performance improvement is very impressive; in some of our prod jobs, we see 10x gains.

LGTM on this PR except some minor comments.

cloud-fan · 2020-04-24T06:23:17Z

cc @MaxGekk

Add jdk8 benchmark result

JiJiTang · 2020-04-24T17:49:06Z

Thanks a lot @dbtsai , @cloud-fan. I have pushed a commit to fix coding style and rename the benchmark names. BTW do we have a IDE formatter config somewhere so that we can apply the formatting before pushing the commits?

dbtsai · 2020-04-24T19:16:43Z

...scala/org/apache/spark/sql/execution/benchmark/ParquetNestedPredicatePushDownBenchmark.scala

+
+  private def createAndRunBenchmark(name: String, withFilter: DataFrame => DataFrame): Unit = {
+    withTempPath {
+      tempDir =>


move this to previous line

dbtsai · 2020-04-24T19:17:42Z

...scala/org/apache/spark/sql/execution/benchmark/ParquetNestedPredicatePushDownBenchmark.scala

+    withTempPath {
+      tempDir =>
+        val outputPath = tempDir.getCanonicalPath
+        df.write.mode(SaveMode.Overwrite).parquet(tempDir.getCanonicalPath)


df.write.mode(SaveMode.Overwrite).parquet(outputPath)?

updated @dbtsai . And also pushed scalafmt formatted source file.

JiJiTang · 2020-04-24T19:35:27Z

@dbtsai , @cloud-fan , just found that there's ./dev/scalafmt which can be used to format newly added code. Will use this to format the code.

Rename benchmark and apply scalafmt

MaxGekk · 2020-04-24T20:55:09Z

...scala/org/apache/spark/sql/execution/benchmark/ParquetNestedPredicatePushDownBenchmark.scala

+  private val N = 100 * 1024 * 1024
+  private val NUMBER_OF_ITER = 10
+
+  override def getSparkSession: SparkSession = {


Why do you override it? What's the problem with default settings?

@MaxGekk , thanks a lot, this was copied from FilterPushdownBenchmark.scala. Just realised the default setting also sets the master to local[1]...

override method removed.

MaxGekk · 2020-04-24T20:57:11Z

...scala/org/apache/spark/sql/execution/benchmark/ParquetNestedPredicatePushDownBenchmark.scala

+  }
+
+  private val df: DataFrame = spark
+    .range(1, N, 1, 4)


What's the reason for creating 4 partitions (and 4 files) if you have only 1 CPU?

Hi @MaxGekk, 4 partitions are here to make sure we have multiple row groups created for the small benchmark parquet dataset (as I didn't change parquet row group block size). Multiple partitions and 1 CPU to simulate a production scenario that we get a lot of partitions across limited number of executors with limited number of cores, with nest predicate pushed down we can have big performance gain since we don't need to read all the row groups. In this benchmark, since the data set is small, if put multiple CPUs, partitions will be read in parallel when nest predicate pushdown disabled, in which case we will not be able see a clear performance gain in terms of job execution time.

dbtsai · 2020-04-24T20:59:21Z

LGTM.

remove override of spark session creation

dbtsai · 2020-04-24T22:11:35Z

Merged into master / branch-3.0 Thanks!

…shdown ### What changes were proposed in this pull request? This PR aims to add a benchmark suite for nested predicate pushdown with parquet file: Performance comparison: Nested predicate pushdown disabled vs enabled, with the following queries scenarios: 1. When predicate pushed down, parquet reader are able to filter out all the row groups without loading them. 2. When predicate pushed down, parquet reader only loads one of the row groups. 3. When predicate pushed down, parquet reader can't filter out any row group in order to see if we introduce too much overhead or not when enabling nested predicate push down. ### Why are the changes needed? No benchmark exists today for nested fields predicate pushdown performance evaluation. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Benchmark runs and reporting result. Closes #28319 from JiJiTang/SPARK-31364. Authored-by: Jian Tang <jian_tang@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com> (cherry picked from commit 6a57616) Signed-off-by: DB Tsai <d_tsai@apple.com>

dbtsai · 2020-04-24T22:13:20Z

@JiJiTang do you have a apache jira account so I can assign this ticket to you?

SparkQA · 2020-04-24T22:39:28Z

Test build #121778 has finished for PR 28319 at commit c72d0ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-25T02:24:26Z

Test build #121780 has finished for PR 28319 at commit e0703a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-25T02:57:13Z

Test build #121783 has finished for PR 28319 at commit f275021.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JiJiTang · 2020-04-25T20:16:22Z

@dbtsai , thanks a lot. here you are my jira account : https://issues.apache.org/jira/secure/ViewProfile.jspa?name=jijitang

dongjoon-hyun · 2020-04-25T20:27:15Z

Thanks, @JiJiTang .
I added you to the Apache Spark contributor group and assigned SPARK-31364 to you.

JiJiTang · 2020-04-25T20:47:07Z

Thanks a lot @dongjoon-hyun

Benchmark Parquet Nested Field Predicate Pushdown

002a5e5

probot-autolabeler bot added the SQL label Apr 23, 2020

dongjoon-hyun reviewed Apr 23, 2020

View reviewed changes

SPARK-31364 Benchmark Parquet Predicate Pushdown

7a6df2c

Add jdk8 benchmark result

dongjoon-hyun changed the title ~~[SPARK-31364][SQL]Benchmark Parquet Nested Field Predicate Pushdown~~ [SPARK-31364][SQL][TESTS] Benchmark Parquet Nested Field Predicate Pushdown Apr 23, 2020

dongjoon-hyun added the TESTS label Apr 23, 2020

dbtsai reviewed Apr 24, 2020

View reviewed changes

SPARK-31364 Benchmark Parquet Predicate Pushdown

c72d0ce

Add jdk8 benchmark result

dbtsai reviewed Apr 24, 2020

View reviewed changes

SPARK-31364 Benchmark Parquet Predicate Pushdown

e0703a5

Rename benchmark and apply scalafmt

MaxGekk reviewed Apr 24, 2020

View reviewed changes

SPARK-31364 Benchmark Parquet Predicate Pushdown

f275021

remove override of spark session creation

dbtsai closed this in 6a57616 Apr 24, 2020

[SPARK-31364][SQL][TESTS] Benchmark Parquet Nested Field Predicate Pushdown #28319

[SPARK-31364][SQL][TESTS] Benchmark Parquet Nested Field Predicate Pushdown #28319

Uh oh!

Conversation

JiJiTang commented Apr 23, 2020 • edited by dongjoon-hyun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Apr 23, 2020

Uh oh!

dongjoon-hyun commented Apr 23, 2020

Uh oh!

dongjoon-hyun Apr 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Apr 23, 2020

Uh oh!

dongjoon-hyun commented Apr 23, 2020

Uh oh!

SparkQA commented Apr 24, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dbtsai Apr 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dbtsai left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Apr 24, 2020

Uh oh!

JiJiTang commented Apr 24, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JiJiTang commented Apr 24, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JiJiTang Apr 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JiJiTang commented Apr 23, 2020 •

edited by dongjoon-hyun

Loading

dongjoon-hyun Apr 23, 2020 •

edited

Loading

dbtsai Apr 24, 2020 •

edited

Loading

JiJiTang Apr 24, 2020 •

edited

Loading

dbtsai commented Apr 24, 2020 •

edited

Loading