Skip to content

Conversation

@JiJiTang
Copy link
Contributor

@JiJiTang JiJiTang commented Apr 23, 2020

What changes were proposed in this pull request?

This PR aims to add a benchmark suite for nested predicate pushdown with parquet file:

Performance comparison: Nested predicate pushdown disabled vs enabled, with the following queries scenarios:

  1. When predicate pushed down, parquet reader are able to filter out all the row groups without loading them.

  2. When predicate pushed down, parquet reader only loads one of the row groups.

  3. When predicate pushed down, parquet reader can't filter out any row group in order to see if we introduce too much overhead or not when enabling nested predicate push down.

Why are the changes needed?

No benchmark exists today for nested fields predicate pushdown performance evaluation.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Benchmark runs and reporting result.

@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun
Copy link
Member

Thank you for your first contribution, @JiJiTang .

* Results will be written to "benchmarks/ParquetNestedPredicatePushDownBenchmark-results.txt".
* }}}
*/
object ParquetNestedPredicatePushDownBenchmark extends SqlBasedBenchmark {
Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, you need to switch to JDK8 HOME in your environment and run the above command once more. That will generate ParquetNestedPredicatePushDownBenchmark-results.txt additionally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @dongjoon-hyun. I will run the benchmark with JDK8 and commit the report.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun , jdk8 benchmark results pushed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@dongjoon-hyun
Copy link
Member

cc @dbtsai , @holdenk , @gatorsmile

@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-31364][SQL]Benchmark Parquet Nested Field Predicate Pushdown [SPARK-31364][SQL][TESTS] Benchmark Parquet Nested Field Predicate Pushdown Apr 23, 2020
@SparkQA
Copy link

SparkQA commented Apr 24, 2020

Test build #121702 has finished for PR 28319 at commit 7a6df2c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

name: String, filterFn: DataFrame => DataFrame): Unit = {
val loadDF = spark.read.parquet(inputPath)
benchmark.addCase(name) {
_ =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, move _ => to previous line


private def addCase(benchmark: Benchmark, inputPath: String,
enableNestedPD: Boolean,
name: String, filterFn: DataFrame => DataFrame): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do 2 space indentation in our codebase.

private def addCase(
  benchmark: Benchmark, inputPath: String,
  enableNestedPD: Boolean,
  name: String, filterFn: DataFrame => DataFrame): Unit = {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, you might call filterFn as withFilter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, 4 spaces indentation for method parameters :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, my bad~ was sleepy... lol

df.write.mode(SaveMode.Overwrite).parquet(tempDir.getCanonicalPath)
val benchmark = new Benchmark(name, N, NUMBER_OF_ITER, output = output)
addCase(benchmark, outputPath, enableNestedPD = false,
"NestedFieldsPredicatePushDownDisabled", filterFn)
Copy link
Member

@dbtsai dbtsai Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just call it, With nested predicate Pushdown and Without nested predicate pushdown for readability?

*/
def runLoadNoRowGroupWhenPredicatePushedDown(): Unit = {
// no row group will be loaded when predicate pushed down
val filterFn: DataFrame => DataFrame = df => df.filter("nested.x < 0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, you might call filterFn as withFilter. Typically, the type is not required for non-public variable.

val filterFn: DataFrame => DataFrame = { df =>
df.filter("nested.x >= 0").filter(s"nested.x <= $N")
}
createAndRunBenchmark("LoadAllRowGroupsWhenPredicatePushedDown", filterFn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"All row groups can not be skipped"

def runLoadSomeRowGroupWhenPredicatePushedDown(): Unit = {
// only a row group will be loaded when predicate pushed down
val filterFn: DataFrame => DataFrame = df => df.filter("nested.x = 100")
createAndRunBenchmark("LoadSomeRowGroupsWhenPredicatePushedDown", filterFn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Some row groups can be skipped with nested predicate pushdown"

def runLoadNoRowGroupWhenPredicatePushedDown(): Unit = {
// no row group will be loaded when predicate pushed down
val filterFn: DataFrame => DataFrame = df => df.filter("nested.x < 0")
createAndRunBenchmark("LoadNoRowGroupsWhenPredicatePushedDown", filterFn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"all row groups can be skipped with nested predicate pushdown"

Copy link
Member

@dbtsai dbtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the benchmark.

Looks like that nested predicate pushdown doesn't add overhead in the worst case when no row group can be skipped. cc @HyukjinKwon @cloud-fan @gatorsmile and @rdblue

On this synthetic data, the performance improvement is very impressive; in some of our prod jobs, we see 10x gains.

LGTM on this PR except some minor comments.

@cloud-fan
Copy link
Contributor

cc @MaxGekk

@JiJiTang
Copy link
Contributor Author

Thanks a lot @dbtsai , @cloud-fan. I have pushed a commit to fix coding style and rename the benchmark names. BTW do we have a IDE formatter config somewhere so that we can apply the formatting before pushing the commits?


private def createAndRunBenchmark(name: String, withFilter: DataFrame => DataFrame): Unit = {
withTempPath {
tempDir =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to previous line

withTempPath {
tempDir =>
val outputPath = tempDir.getCanonicalPath
df.write.mode(SaveMode.Overwrite).parquet(tempDir.getCanonicalPath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.write.mode(SaveMode.Overwrite).parquet(outputPath)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated @dbtsai . And also pushed scalafmt formatted source file.

@JiJiTang
Copy link
Contributor Author

@dbtsai , @cloud-fan , just found that there's ./dev/scalafmt which can be used to format newly added code. Will use this to format the code.

Rename benchmark and apply scalafmt
private val N = 100 * 1024 * 1024
private val NUMBER_OF_ITER = 10

override def getSparkSession: SparkSession = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you override it? What's the problem with default settings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk , thanks a lot, this was copied from FilterPushdownBenchmark.scala. Just realised the default setting also sets the master to local[1]...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override method removed.

}

private val df: DataFrame = spark
.range(1, N, 1, 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for creating 4 partitions (and 4 files) if you have only 1 CPU?

Copy link
Contributor Author

@JiJiTang JiJiTang Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @MaxGekk, 4 partitions are here to make sure we have multiple row groups created for the small benchmark parquet dataset (as I didn't change parquet row group block size). Multiple partitions and 1 CPU to simulate a production scenario that we get a lot of partitions across limited number of executors with limited number of cores, with nest predicate pushed down we can have big performance gain since we don't need to read all the row groups. In this benchmark, since the data set is small, if put multiple CPUs, partitions will be read in parallel when nest predicate pushdown disabled, in which case we will not be able see a clear performance gain in terms of job execution time.

@dbtsai
Copy link
Member

dbtsai commented Apr 24, 2020

LGTM.

remove override of spark session creation
@dbtsai dbtsai closed this in 6a57616 Apr 24, 2020
@dbtsai
Copy link
Member

dbtsai commented Apr 24, 2020

Merged into master / branch-3.0 Thanks!

dbtsai pushed a commit that referenced this pull request Apr 24, 2020
…shdown

### What changes were proposed in this pull request?

This PR aims to add a benchmark suite for nested predicate pushdown with parquet file:

Performance comparison: Nested predicate pushdown disabled vs enabled,  with the following queries scenarios:

1.  When predicate pushed down, parquet reader are able to filter out all the row groups without loading them.

2. When predicate pushed down, parquet reader only loads one of the row groups.

3. When predicate pushed down, parquet reader can't filter out any row group in order to see if we introduce too much overhead or not when enabling nested predicate push down.

### Why are the changes needed?

No benchmark exists today for nested fields predicate pushdown performance evaluation.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
 Benchmark runs and reporting result.

Closes #28319 from JiJiTang/SPARK-31364.

Authored-by: Jian Tang <jian_tang@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
(cherry picked from commit 6a57616)
Signed-off-by: DB Tsai <d_tsai@apple.com>
@dbtsai
Copy link
Member

dbtsai commented Apr 24, 2020

@JiJiTang do you have a apache jira account so I can assign this ticket to you?

@SparkQA
Copy link

SparkQA commented Apr 24, 2020

Test build #121778 has finished for PR 28319 at commit c72d0ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 25, 2020

Test build #121780 has finished for PR 28319 at commit e0703a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 25, 2020

Test build #121783 has finished for PR 28319 at commit f275021.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JiJiTang
Copy link
Contributor Author

@dbtsai , thanks a lot. here you are my jira account : https://issues.apache.org/jira/secure/ViewProfile.jspa?name=jijitang

@dongjoon-hyun
Copy link
Member

Thanks, @JiJiTang .
I added you to the Apache Spark contributor group and assigned SPARK-31364 to you.

@JiJiTang
Copy link
Contributor Author

Thanks a lot @dongjoon-hyun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants