Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Add benchmarks for Spark Scan + Comet Exec #863

Merged
merged 5 commits into from
Aug 23, 2024

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Aug 22, 2024

Which issue does this PR close?

Related to #798

Rationale for this change

Add benchmarks to show performance of disabling native scan and converting spark columns to Comet (currently performed via converting to row first, but we plan on implementing a more efficient conversion in #798).

What changes are included in this PR?

  • Remove SPARK_GENERATE_BENCHMARK_FILES=1 from documentation because we have no such feature (there is no other reference to this env var in our codebase as far as I can tell)
  • Add new combination to existing benchmarks

How are these changes tested?

AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
q3: Spark Scan + Spark Exec                         159            171          14        182.8           5.5       1.0X
q3: Comet Scan + Spark Exec                         182            201          13        159.2           6.3       0.9X
q3: Comet Scan + Comet Exec                         216            240          17        134.2           7.5       0.7X
q3: Spark Scan + Comet Exec                         450            458           9         64.4          15.5       0.4X

@andygrove
Copy link
Member Author

@mbutrovich @parthchandra could you review?

@@ -28,7 +28,7 @@ import org.apache.comet.CometConf
/**
* Benchmark to measure Comet execution performance. To run this benchmark:
* {{{
* SPARK_GENERATE_BENCHMARK_FILES=1 make benchmark-org.apache.spark.sql.benchmark.CometAggregateBenchmark
* make benchmark-org.apache.spark.sql.benchmark.CometAggregateBenchmark
Copy link
Contributor

@mbutrovich mbutrovich Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it was just an old config that wasn't doing anything anymore? Just wondering about the change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that seems to be the case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This env variable is used in Spark's BenchmarkBase which is the parent of this (and other benchmarks).
The effect is to write the output to a file under the spark/benchmarks.
In Spark, for any PR that has a performance impact, the contributor can run the benchmarks in their github environment and include the benchmark results as part of the PR. This enables reviewers to verify the performance against previously run benchmarks. Since the benchmarks are run on github, all users end up running the benchmarks on the same hardware and environment and so the results are actually comparable.
Note that this is also the reason why, once a benchmark case has been given a name, it should not be changed. The benchmark case has a name solely to allow this comparison.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, like Spark, we will enable this for Comet so that updates to benchmark results are added as part of the PR and then we can track changes to the benchmark results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthchandra. I had only searched within our repo trying to find where it was used. I have reverted removing these references.

withSQLConf(CometConf.COMET_ENABLED.key -> "true") {
cometSpark.sql(queryString).noop()
}
}
benchmark.addCase(s"$name$nameSuffix: Comet (Scan, Exec)") { _ =>
benchmark.addCase(s"$name$nameSuffix: Comet Scan + Comet Exec") { _ =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the PR to blow up the diff with a refactor, but we may consider standardizing this terminology (Comet (Scan, Exec) vs. Comet Scan + Comet Exec) for searching the code base for test/benchmark cases or downstream processing of the benchmark output.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can switch back to more consistent naming with the existing tests for this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the name remains consistent across commits once we start recording the benchmark results, it shouldn't matter.
Here's an example from Spark btw.
But if it helps to standardize, sure.

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 34.01%. Comparing base (9d8730d) to head (e9cf4ae).
Report is 7 commits behind head on main.

Additional details and impacted files
@@              Coverage Diff              @@
##               main     #863       +/-   ##
=============================================
- Coverage     55.16%   34.01%   -21.15%     
- Complexity      857      860        +3     
=============================================
  Files           109      112        +3     
  Lines         10542    42918    +32376     
  Branches       2010     9477     +7467     
=============================================
+ Hits           5815    14599     +8784     
- Misses         3714    25327    +21613     
- Partials       1013     2992     +1979     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than reintroducing the env variable removed from the usage comments, lgtm

Copy link
Contributor

@huaxingao huaxingao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the PR @andygrove

@andygrove andygrove merged commit cff7697 into apache:main Aug 23, 2024
74 checks passed
@andygrove andygrove deleted the convert-from-parquet-bench branch August 23, 2024 16:28
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
* Add benchmarks for Spark Scan + Comet Exec

* address feedback

* address feedback

* revert removing env var from usage examples

* fix

(cherry picked from commit cff7697)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants