-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Add benchmarks for Spark Scan + Comet Exec #863
Conversation
@mbutrovich @parthchandra could you review? |
@@ -28,7 +28,7 @@ import org.apache.comet.CometConf | |||
/** | |||
* Benchmark to measure Comet execution performance. To run this benchmark: | |||
* {{{ | |||
* SPARK_GENERATE_BENCHMARK_FILES=1 make benchmark-org.apache.spark.sql.benchmark.CometAggregateBenchmark | |||
* make benchmark-org.apache.spark.sql.benchmark.CometAggregateBenchmark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it was just an old config that wasn't doing anything anymore? Just wondering about the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that seems to be the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This env variable is used in Spark's BenchmarkBase which is the parent of this (and other benchmarks).
The effect is to write the output to a file under the spark/benchmarks
.
In Spark, for any PR that has a performance impact, the contributor can run the benchmarks in their github environment and include the benchmark results as part of the PR. This enables reviewers to verify the performance against previously run benchmarks. Since the benchmarks are run on github, all users end up running the benchmarks on the same hardware and environment and so the results are actually comparable.
Note that this is also the reason why, once a benchmark case has been given a name, it should not be changed. The benchmark case has a name solely to allow this comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point, like Spark, we will enable this for Comet so that updates to benchmark results are added as part of the PR and then we can track changes to the benchmark results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @parthchandra. I had only searched within our repo trying to find where it was used. I have reverted removing these references.
withSQLConf(CometConf.COMET_ENABLED.key -> "true") { | ||
cometSpark.sql(queryString).noop() | ||
} | ||
} | ||
benchmark.addCase(s"$name$nameSuffix: Comet (Scan, Exec)") { _ => | ||
benchmark.addCase(s"$name$nameSuffix: Comet Scan + Comet Exec") { _ => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the PR to blow up the diff with a refactor, but we may consider standardizing this terminology (Comet (Scan, Exec)
vs. Comet Scan + Comet Exec
) for searching the code base for test/benchmark cases or downstream processing of the benchmark output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can switch back to more consistent naming with the existing tests for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as the name remains consistent across commits once we start recording the benchmark results, it shouldn't matter.
Here's an example from Spark btw.
But if it helps to standardize, sure.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #863 +/- ##
=============================================
- Coverage 55.16% 34.01% -21.15%
- Complexity 857 860 +3
=============================================
Files 109 112 +3
Lines 10542 42918 +32376
Branches 2010 9477 +7467
=============================================
+ Hits 5815 14599 +8784
- Misses 3714 25327 +21613
- Partials 1013 2992 +1979 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than reintroducing the env variable removed from the usage comments, lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the PR @andygrove
* Add benchmarks for Spark Scan + Comet Exec * address feedback * address feedback * revert removing env var from usage examples * fix (cherry picked from commit cff7697)
Which issue does this PR close?
Related to #798
Rationale for this change
Add benchmarks to show performance of disabling native scan and converting spark columns to Comet (currently performed via converting to row first, but we plan on implementing a more efficient conversion in #798).
What changes are included in this PR?
SPARK_GENERATE_BENCHMARK_FILES=1
from documentation because we have no such feature (there is no other reference to this env var in our codebase as far as I can tell)How are these changes tested?