perf: Add benchmarks for Spark Scan + Comet Exec #863

andygrove · 2024-08-22T12:16:15Z

Which issue does this PR close?

Related to #798

Rationale for this change

Add benchmarks to show performance of disabling native scan and converting spark columns to Comet (currently performed via converting to row first, but we plan on implementing a more efficient conversion in #798).

What changes are included in this PR?

Remove SPARK_GENERATE_BENCHMARK_FILES=1 from documentation because we have no such feature (there is no other reference to this env var in our codebase as far as I can tell)
Add new combination to existing benchmarks

How are these changes tested?

AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
q3: Spark Scan + Spark Exec                         159            171          14        182.8           5.5       1.0X
q3: Comet Scan + Spark Exec                         182            201          13        159.2           6.3       0.9X
q3: Comet Scan + Comet Exec                         216            240          17        134.2           7.5       0.7X
q3: Spark Scan + Comet Exec                         450            458           9         64.4          15.5       0.4X

andygrove · 2024-08-22T12:17:42Z

@mbutrovich @parthchandra could you review?

mbutrovich · 2024-08-22T12:54:15Z

spark/src/test/scala/org/apache/spark/sql/benchmark/CometAggregateBenchmark.scala

@@ -28,7 +28,7 @@ import org.apache.comet.CometConf
 /**
 * Benchmark to measure Comet execution performance. To run this benchmark:
 * {{{
- *   SPARK_GENERATE_BENCHMARK_FILES=1 make benchmark-org.apache.spark.sql.benchmark.CometAggregateBenchmark
+ *   make benchmark-org.apache.spark.sql.benchmark.CometAggregateBenchmark


This looks like it was just an old config that wasn't doing anything anymore? Just wondering about the change.

Yes, that seems to be the case.

This env variable is used in Spark's BenchmarkBase which is the parent of this (and other benchmarks).
The effect is to write the output to a file under the spark/benchmarks.
In Spark, for any PR that has a performance impact, the contributor can run the benchmarks in their github environment and include the benchmark results as part of the PR. This enables reviewers to verify the performance against previously run benchmarks. Since the benchmarks are run on github, all users end up running the benchmarks on the same hardware and environment and so the results are actually comparable.
Note that this is also the reason why, once a benchmark case has been given a name, it should not be changed. The benchmark case has a name solely to allow this comparison.

At some point, like Spark, we will enable this for Comet so that updates to benchmark results are added as part of the PR and then we can track changes to the benchmark results.

Thanks @parthchandra. I had only searched within our repo trying to find where it was used. I have reverted removing these references.

mbutrovich · 2024-08-22T12:58:34Z

spark/src/test/scala/org/apache/spark/sql/benchmark/CometTPCDSMicroBenchmark.scala

        withSQLConf(CometConf.COMET_ENABLED.key -> "true") {
          cometSpark.sql(queryString).noop()
        }
      }
-      benchmark.addCase(s"$name$nameSuffix: Comet (Scan, Exec)") { _ =>
+      benchmark.addCase(s"$name$nameSuffix: Comet Scan + Comet Exec") { _ =>


I'm not sure this is the PR to blow up the diff with a refactor, but we may consider standardizing this terminology (Comet (Scan, Exec) vs. Comet Scan + Comet Exec) for searching the code base for test/benchmark cases or downstream processing of the benchmark output.

Sure, I can switch back to more consistent naming with the existing tests for this PR

As long as the name remains consistent across commits once we start recording the benchmark results, it shouldn't matter.
Here's an example from Spark btw.
But if it helps to standardize, sure.

codecov-commenter · 2024-08-22T13:10:00Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 34.01%. Comparing base (9d8730d) to head (e9cf4ae).
Report is 7 commits behind head on main.

Additional details and impacted files

@@              Coverage Diff              @@
##               main     #863       +/-   ##
=============================================
- Coverage     55.16%   34.01%   -21.15%     
- Complexity      857      860        +3     
=============================================
  Files           109      112        +3     
  Lines         10542    42918    +32376     
  Branches       2010     9477     +7467     
=============================================
+ Hits           5815    14599     +8784     
- Misses         3714    25327    +21613     
- Partials       1013     2992     +1979

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

parthchandra

Other than reintroducing the env variable removed from the usage comments, lgtm

huaxingao

LGTM. Thanks for the PR @andygrove

* Add benchmarks for Spark Scan + Comet Exec * address feedback * address feedback * revert removing env var from usage examples * fix (cherry picked from commit cff7697)

Add benchmarks for Spark Scan + Comet Exec

e9cf4ae

mbutrovich reviewed Aug 22, 2024

View reviewed changes

andygrove added 2 commits August 22, 2024 21:58

address feedback

2257567

address feedback

3b61edd

parthchandra approved these changes Aug 23, 2024

View reviewed changes

andygrove added 2 commits August 23, 2024 06:48

revert removing env var from usage examples

ccc4f17

fix

75f94c4

andygrove requested review from kazuyukitanimura and huaxingao August 23, 2024 14:42

huaxingao approved these changes Aug 23, 2024

View reviewed changes

andygrove merged commit cff7697 into apache:main Aug 23, 2024
74 checks passed

andygrove deleted the convert-from-parquet-bench branch August 23, 2024 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Add benchmarks for Spark Scan + Comet Exec #863

perf: Add benchmarks for Spark Scan + Comet Exec #863

andygrove commented Aug 22, 2024 •

edited

Loading

andygrove commented Aug 22, 2024

mbutrovich Aug 22, 2024 •

edited

Loading

andygrove Aug 22, 2024

parthchandra Aug 23, 2024

parthchandra Aug 23, 2024

andygrove Aug 23, 2024

mbutrovich Aug 22, 2024

andygrove Aug 23, 2024

parthchandra Aug 23, 2024

codecov-commenter commented Aug 22, 2024

parthchandra left a comment

huaxingao left a comment

perf: Add benchmarks for Spark Scan + Comet Exec #863

perf: Add benchmarks for Spark Scan + Comet Exec #863

Conversation

andygrove commented Aug 22, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

andygrove commented Aug 22, 2024

mbutrovich Aug 22, 2024 • edited Loading

Choose a reason for hiding this comment

andygrove Aug 22, 2024

Choose a reason for hiding this comment

parthchandra Aug 23, 2024

Choose a reason for hiding this comment

parthchandra Aug 23, 2024

Choose a reason for hiding this comment

andygrove Aug 23, 2024

Choose a reason for hiding this comment

mbutrovich Aug 22, 2024

Choose a reason for hiding this comment

andygrove Aug 23, 2024

Choose a reason for hiding this comment

parthchandra Aug 23, 2024

Choose a reason for hiding this comment

codecov-commenter commented Aug 22, 2024

Codecov Report

parthchandra left a comment

Choose a reason for hiding this comment

huaxingao left a comment

Choose a reason for hiding this comment

andygrove commented Aug 22, 2024 •

edited

Loading

mbutrovich Aug 22, 2024 •

edited

Loading