Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Aug 13, 2025

What changes were proposed in this pull request?

This PR aims to regenerate benchmark results.

Why are the changes needed?

We have 3 goals.

  1. Make the results up-to-date with the code because the previous results were generated 7 months ago on 2025-01-30.
  2. Add the missed Java 21 test result, RecursiveCTEBenchmark-jdk21-results.txt
  3. Fix the misgenerated result via GitHub Action generated result.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual review.

Was this patch authored or co-authored using generative AI tooling?

No.

@dongjoon-hyun
Copy link
Member Author

Thank you, @HyukjinKwon .

Compression 10000 times at level 1 with buffer pool 580 581 0 0.0 58038.0 1.1X
Compression 10000 times at level 2 with buffer pool 612 615 3 0.0 61246.1 1.1X
Compression 10000 times at level 3 with buffer pool 721 734 11 0.0 72106.4 0.9X
Compression 10000 times at level 1 without buffer pool 265 267 1 0.0 26513.9 1.0X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This becomes much faster.

Compression 4 times at level 1 with buffer pool 2568 2571 4 0.0 642087500.8 1.0X
Compression 4 times at level 2 with buffer pool 4211 4212 1 0.0 1052833529.0 0.6X
Compression 4 times at level 3 with buffer pool 6290 6291 2 0.0 1572505716.0 0.4X
Compression 4 times at level 1 without buffer pool 2764 2764 1 0.0 690899114.5 1.0X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it's a known issue, this is still slower than Java 17.

arrayOfAnyAsObject 6 6 0 1611.8 0.6 1.0X
arrayOfAnyAsSeq 174 175 1 57.5 17.4 0.0X
arrayOfInt 393 395 1 25.4 39.3 0.0X
arrayOfIntAsObject 419 419 1 23.9 41.9 0.0X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relationship between arrayOfInt and arrayOfIntAsObject is switched.

Common Codecs 4821 4894 64 0.2 4820.6 1.0X
Java 2565 2572 10 0.4 2564.8 1.9X
Spark 3811 3812 1 0.3 3810.7 1.3X
Spark Binary 2758 2759 1 0.4 2757.9 1.7X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relationship between Java and Spark Binary is switched. Now, Java is faster than Spark.

Without bloom filter, blocksize: 16777216 790 794 5 126.6 7.9 1.0X
With bloom filter, blocksize: 16777216 792 798 9 126.3 7.9 1.0X
Without bloom filter, blocksize: 16777216 827 835 10 120.9 8.3 1.0X
With bloom filter, blocksize: 16777216 536 542 5 186.5 5.4 1.5X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This become faster and consistent with Java 21 result.

Without bloom filter, blocksize: 33554432 427 430 3 234.2 4.3 1.0X
With bloom filter, blocksize: 33554432 508 520 12 196.9 5.1 0.8X
Without bloom filter, blocksize: 33554432 507 520 10 197.1 5.1 1.0X
With bloom filter, blocksize: 33554432 444 465 32 225.3 4.4 1.1X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This becomes faster and expected one.

UTF-16 52085 52137 74 0.2 5208.5 0.6X
UTF-8 30150 30156 9 0.3 3015.0 1.1X
UTF-32 56295 56403 153 0.2 5629.5 1.0X
UTF-16 50644 50653 13 0.2 5064.4 1.1X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF-32 becomes suddenly the slowest among all. Previously, UTF-16 is slowest in both Java 17 and 21.

columnar deserialization + columnar-to-row 179 220 36 5.6 178.7 1.0X
row-based deserialization 171 219 70 5.9 170.5 1.0X
columnar deserialization + columnar-to-row 222 257 41 4.5 222.3 1.0X
row-based deserialization 140 178 63 7.2 139.8 1.6X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Row-base deserialization becomes much faster relatively.

@dongjoon-hyun
Copy link
Member Author

I'm going to merge this as a new baseline for further work about the above observations.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-53266 branch August 13, 2025 05:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants