[SPARK-49490][SQL] Add benchmarks for initCap #48501

mrk-andreev · 2024-10-16T16:59:48Z

What changes were proposed in this pull request?

Add benchmarks for all codepaths of initCap, namely, paths that call:

execBinaryICU
execBinary
execLowercase
execICU

Why are the changes needed?

Requested by jira ticket SPARK-49490.

Does this PR introduce any user-facing change?

No

How was this patch tested?

The benchmark was tested locally by performing a manual run.

Was this patch authored or co-authored using generative AI tooling?

No

mrk-andreev · 2024-10-16T17:02:23Z

Results of local run InitCapBenchmark-local.txt

Sample

Running benchmark: InitCap evaluation [wc=1000, wl=16, capitalized=false]
  Running case: execICU
  Stopped after 8978 iterations, 2000 ms
  Running case: execBinaryICU
  Stopped after 6235 iterations, 2000 ms
  Running case: execBinary
  Stopped after 28374 iterations, 2000 ms
  Running case: execLowercase
  Stopped after 8839 iterations, 2000 ms

OpenJDK 64-Bit Server VM 17.0.2+8-86 on Linux 5.15.0-122-generic
Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
InitCap evaluation [wc=1000, wl=16, capitalized=false]:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
--------------------------------------------------------------------------------------------------------------------------------------
execICU                                                             0              0           0     432768.3           0.0       1.0X
execBinaryICU                                                       0              0           0     285450.1           0.0       0.7X
execBinary                                                          0              0           0    1494256.8           0.0       3.5X
execLowercase                                                       0              0           0     415082.4           0.0       1.0X

Open questions

Should we place the benchmark code in the same package, 'unsafe,' or at the 'SQL level'? If it's in 'unsafe,' should we extract the shared code for benchmarks into a shared library?
The benchmark output expects each measurement to be at least 1 ms, but this isn't the case here. Should we align the rounding to the first non-zero digit after the decimal point?
How detailed do we expect the benchmarks to be? Do we want different axes of variation, or should we stick to defaults like parameters?

HyukjinKwon · 2024-10-17T00:30:13Z

Can we include the benchmark result files too? See also "Testing with GitHub Actions workflow" at https://spark.apache.org/developer-tools.html

MaxGekk

Should we place the benchmark code in the same package, 'unsafe,' or at the 'SQL level'

Let's place the backmark at the SQL level so far.

mrk-andreev · 2024-10-19T10:30:13Z

Let's place the backmark at the SQL level so far.

Done

Can we include the benchmark result files too?

Done

MaxGekk

Could you generate benchmark results for jdk 21 too.

MaxGekk · 2024-10-20T17:03:40Z

sql/core/benchmarks/InitCapBenchmark-results.txt

Let's bump number of iterations to see seconds in Best/Avg Time.

I adjusted the word count for my Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz, but encountered issues with local evaluation. This led to a remote evaluation on an Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz, where the performance was noticeably less impressive.

MaxGekk · 2024-10-22T12:31:37Z

@uros-db @mihailom-db @viktorluc-db Could you review this PR, please.

uros-db

we already have benchmarks for collations

please see: CollationBenchmark

MaxGekk · 2024-10-26T21:00:10Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InitCapBenchmark.scala

please, fix indentation here, see https://github.com/databricks/scala-style-guide?tab=readme-ov-file#spacing-and-indentation or it is better to place the parameters on the same line.

MaxGekk · 2024-10-26T21:05:23Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InitCapBenchmark.scala

Could you benchmark more collations, see

spark/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala

Line 27 in 9909817

Seq("UTF8_BINARY", "UTF8_LCASE", "UNICODE", "UNICODE_CI")

Extended with

for (collationName <- List("he_ISR", "UNICODE", "UNICODE_CI")) { val collationId = CollationFactory.collationNameToId(collationName) assert(CollationFactory.fetchCollation(collationId).collator != null) val caseName = s"execICU[collationName=${collationName}]" benchmark.addCase(caseName)(_ => InitCap.execICU(text, collationId)) }

The primary requirement for collationId in InitCap.execICU is that CollationFactory.fetchCollation(collationId).collator must not be null; otherwise, the function will throw an NPE.

Updated

InitCapBenchmark-results.txt

InitCapBenchmark-jdk21-results.txt

MaxGekk · 2024-11-07T19:41:36Z

@mrk-andreev Could you intergrate your benchmark into CollationBenchmark, please, as @uros-db pointed out #48501 (review). Otherwise we might forget to re-run your benchmark while benchmarking collation related code.

mrk-andreev · 2024-11-12T21:24:04Z

@mrk-andreev Could you intergrate your benchmark into CollationBenchmark, please, as @uros-db pointed out #48501 (review). Otherwise we might forget to re-run your benchmark while benchmarking collation related code.

@MaxGekk , done.

MaxGekk

also cc @stevomitric who is working on the same benchmarks in #48804

MaxGekk · 2024-11-13T09:04:21Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala

Could you fix indentations here.

Fixed. My bad

MaxGekk · 2024-11-13T15:49:26Z

sql/core/benchmarks/CollationBenchmark-jdk21-results.txt

Could you increase the number of iterations to have non-zero StdDev, and make the benchmark more reliable.

Sorry. Fixed.

I re-evaluated just initCap related benchmarks.

As we can see, this benchmark is sensitive to CPU clock speed. During my latest measurements on an Intel(R) Xeon(R) Platinum 8252C CPU @ 3.80GHz (AWS m5zn.xlarge), the stdev for some measurements - along with others - dropped to zero or one.

I suggest adding more decimal places to the results in a separate PR.

stevomitric · 2024-11-15T11:02:52Z

sql/core/benchmarks/CollationBenchmark-jdk21-results.txt

We recently merged a fix for these benchmarks here #48804, so this regression is outdated.

Could you please sync with master and re-run the benchmarks to not commit the outdated results?

mrk-andreev · 2024-11-16T16:42:27Z

cc: @MaxGekk

Related work

This is not related to my code changes but rather to the benchmarks we are modifying. It might be worth starting a separate thread in the dev mailing list or creating an additional ticket in Jira, which I would be happy to handle.

Blackhole

I would like to point out that the current implementation of org.apache.spark.benchmark.Benchmark::addCase does not use any form of Blackhole (Blackhole in JMH), which could lead to dead-code elimination. However, I have not observed this issue in the existing tests. This is likely due to the complexity and side effects of the code being benchmarked, which prevents such elimination.

Would it be a good idea to consider adding this as a feature in the future?

Context

org.apache.spark.benchmark.Benchmark::addCase

  def addCase(name: String, numIters: Int = 0)(f: Int => Unit): Unit = {
    addTimerCase(name, numIters) { timer =>
      timer.startTiming()
      f(timer.iteration)
      timer.stopTiming()
    }
  }

Async-profiler

I suggest adding Async Profiler, a low-overhead sampling profiler, to all benchmark runs. This will help us identify the causes of performance degradation.

Would it also be worth considering adding this as a feature in the future?

mrk-andreev · 2024-11-19T21:14:38Z

Hi @MaxGekk, @stevomitric,

Does this PR need any additional changes? Are there any blockers we should address? Let me know how I can help to move it forward!

MaxGekk

LGTM except of a few minor comments.

MaxGekk · 2024-11-20T10:38:56Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala

nit: the enclosing braces are redundant:

Suggested change

s"collation unit benchmarks - initCap using impl ${implName}",

s"collation unit benchmarks - initCap using $implName",

MaxGekk · 2024-11-20T10:40:58Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala

nit:

Suggested change

benchmark.addCase(s"$collationType") { _ =>

benchmark.addCase(collationType) { _ =>

MaxGekk · 2024-11-20T10:43:03Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala

It is a collation id, and types should begin from an upper case letter.

Thank you. Fixed

MaxGekk · 2024-11-21T08:13:45Z

+1, LGTM. Merging to master.
Thank you, @mrk-andreev and @stevomitric @uros-db for review.

github-actions bot added the SQL label Oct 16, 2024

MaxGekk reviewed Oct 18, 2024

View reviewed changes

mrk-andreev force-pushed the SPARK-49490 branch from ab71500 to b0e0cf0 Compare October 19, 2024 10:28

MaxGekk reviewed Oct 20, 2024

View reviewed changes

mrk-andreev force-pushed the SPARK-49490 branch from b0e0cf0 to 15fbcb5 Compare October 21, 2024 16:20

uros-db suggested changes Oct 22, 2024

View reviewed changes

MaxGekk requested changes Oct 26, 2024

View reviewed changes

mrk-andreev force-pushed the SPARK-49490 branch 2 times, most recently from 31632c9 to 6b1d79e Compare November 3, 2024 16:44

mrk-andreev force-pushed the SPARK-49490 branch 2 times, most recently from 5bf2fba to 6c336c2 Compare November 12, 2024 21:20

MaxGekk reviewed Nov 13, 2024

View reviewed changes

mrk-andreev force-pushed the SPARK-49490 branch from 6c336c2 to c755b68 Compare November 14, 2024 23:37

stevomitric reviewed Nov 15, 2024

View reviewed changes

mrk-andreev force-pushed the SPARK-49490 branch from c755b68 to f98d97d Compare November 15, 2024 20:04

MaxGekk reviewed Nov 20, 2024

View reviewed changes

stevomitric approved these changes Nov 20, 2024

View reviewed changes

[SPARK-49490][SQL] Add benchmarks for initCap

d0702d1

mrk-andreev force-pushed the SPARK-49490 branch from f98d97d to d0702d1 Compare November 20, 2024 21:50

MaxGekk approved these changes Nov 21, 2024

View reviewed changes

MaxGekk closed this in 95faa02 Nov 21, 2024

	s"collation unit benchmarks - initCap using impl ${implName}",
	s"collation unit benchmarks - initCap using $implName",

	benchmark.addCase(s"$collationType") { _ =>
	benchmark.addCase(collationType) { _ =>

[SPARK-49490][SQL] Add benchmarks for initCap #48501

[SPARK-49490][SQL] Add benchmarks for initCap #48501

Conversation

mrk-andreev commented Oct 16, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

mrk-andreev commented Oct 16, 2024

Sample

Open questions

Uh oh!

HyukjinKwon commented Oct 17, 2024

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

mrk-andreev commented Oct 19, 2024

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Oct 22, 2024

Uh oh!

uros-db left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrk-andreev Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Nov 7, 2024

Uh oh!

mrk-andreev commented Nov 12, 2024

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevomitric Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrk-andreev commented Nov 16, 2024

Related work

Blackhole

Context

Async-profiler

Uh oh!

mrk-andreev commented Nov 19, 2024

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

mrk-andreev Oct 29, 2024 •

edited

Loading

stevomitric Nov 15, 2024 •

edited

Loading