-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-49490][SQL] Add benchmarks for initCap #48501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Results of local run InitCapBenchmark-local.txt SampleOpen questions
|
|
Can we include the benchmark result files too? See also "Testing with GitHub Actions workflow" at https://spark.apache.org/developer-tools.html |
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we place the benchmark code in the same package, 'unsafe,' or at the 'SQL level'
Let's place the backmark at the SQL level so far.
ab71500 to
b0e0cf0
Compare
Done
Done |
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you generate benchmark results for jdk 21 too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's bump number of iterations to see seconds in Best/Avg Time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adjusted the word count for my Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz, but encountered issues with local evaluation. This led to a remote evaluation on an Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz, where the performance was noticeably less impressive.
b0e0cf0 to
15fbcb5
Compare
|
@uros-db @mihailom-db @viktorluc-db Could you review this PR, please. |
uros-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have benchmarks for collations
please see: CollationBenchmark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please, fix indentation here, see https://github.com/databricks/scala-style-guide?tab=readme-ov-file#spacing-and-indentation or it is better to place the parameters on the same line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you benchmark more collations, see
spark/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala
Line 27 in 9909817
| Seq("UTF8_BINARY", "UTF8_LCASE", "UNICODE", "UNICODE_CI") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extended with
for (collationName <- List("he_ISR", "UNICODE", "UNICODE_CI")) {
val collationId = CollationFactory.collationNameToId(collationName)
assert(CollationFactory.fetchCollation(collationId).collator != null)
val caseName = s"execICU[collationName=${collationName}]"
benchmark.addCase(caseName)(_ => InitCap.execICU(text, collationId))
}The primary requirement for collationId in InitCap.execICU is that CollationFactory.fetchCollation(collationId).collator must not be null; otherwise, the function will throw an NPE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
InitCapBenchmark-results.txtInitCapBenchmark-jdk21-results.txt
31632c9 to
6b1d79e
Compare
|
@mrk-andreev Could you intergrate your benchmark into |
5bf2fba to
6c336c2
Compare
@MaxGekk , done. |
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also cc @stevomitric who is working on the same benchmarks in #48804
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fix indentations here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. My bad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you increase the number of iterations to have non-zero StdDev, and make the benchmark more reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. Fixed.
I re-evaluated just initCap related benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we can see, this benchmark is sensitive to CPU clock speed. During my latest measurements on an Intel(R) Xeon(R) Platinum 8252C CPU @ 3.80GHz (AWS m5zn.xlarge), the stdev for some measurements - along with others - dropped to zero or one.
I suggest adding more decimal places to the results in a separate PR.
6c336c2 to
c755b68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recently merged a fix for these benchmarks here #48804, so this regression is outdated.
Could you please sync with master and re-run the benchmarks to not commit the outdated results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
c755b68 to
f98d97d
Compare
|
cc: @MaxGekk Related workThis is not related to my code changes but rather to the benchmarks we are modifying. It might be worth starting a separate thread in the dev mailing list or creating an additional ticket in Jira, which I would be happy to handle. BlackholeI would like to point out that the current implementation of org.apache.spark.benchmark.Benchmark::addCase does not use any form of Blackhole (Blackhole in JMH), which could lead to dead-code elimination. However, I have not observed this issue in the existing tests. This is likely due to the complexity and side effects of the code being benchmarked, which prevents such elimination. Would it be a good idea to consider adding this as a feature in the future? Context
Async-profilerI suggest adding Async Profiler, a low-overhead sampling profiler, to all benchmark runs. This will help us identify the causes of performance degradation. Would it also be worth considering adding this as a feature in the future? |
|
Hi @MaxGekk, @stevomitric, Does this PR need any additional changes? Are there any blockers we should address? Let me know how I can help to move it forward! |
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except of a few minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the enclosing braces are redundant:
| s"collation unit benchmarks - initCap using impl ${implName}", | |
| s"collation unit benchmarks - initCap using $implName", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| benchmark.addCase(s"$collationType") { _ => | |
| benchmark.addCase(collationType) { _ => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a collation id, and types should begin from an upper case letter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Fixed
f98d97d to
d0702d1
Compare
|
+1, LGTM. Merging to master. |
What changes were proposed in this pull request?
Add benchmarks for all codepaths of initCap, namely, paths that call:
Why are the changes needed?
Requested by jira ticket SPARK-49490.
Does this PR introduce any user-facing change?
No
How was this patch tested?
The benchmark was tested locally by performing a manual run.
Was this patch authored or co-authored using generative AI tooling?
No