Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Feb 17, 2023

What changes were proposed in this pull request?

This aims to regenerate benchmark results on master branch as a baseline for Spark 3.5.0 and a way to comparing Apache Spark 3.4.0 branch.

Why are the changes needed?

These are reference values with minor changes.

- OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1023-azure
+ OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1031-azure

- OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure
+ OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure

- OpenJDK 64-Bit Server VM 17.0.5+8 on Linux 5.15.0-1023-azure
+ OpenJDK 64-Bit Server VM 17.0.6+10 on Linux 5.15.0-1031-azure

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual review.

Compression 10000 times at level 1 without buffer pool 605 812 220 0.0 60521.0 1.0X
Compression 10000 times at level 2 without buffer pool 665 678 20 0.0 66512.5 0.9X
Compression 10000 times at level 3 without buffer pool 890 903 20 0.0 88961.3 0.7X
Compression 10000 times at level 1 with buffer pool 829 839 11 0.0 82940.2 0.7X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at this after this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java 8/17 doesn't have this regression.

Use HashSet 4 4 0 226.9 4.4 1.0X
Use EnumSet 1 1 0 737.3 1.4 3.2X
Use HashSet 0 1 0 2440.2 0.4 1.0X
Use EnumSet 1 1 0 884.8 1.1 0.4X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to investigate this reversed ratio.

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Feb 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashSet seems to get some improvements in this case, contains use empty Set:. The other cases looks in a reasonable range.

Use HashSet 5 5 0 209.4 4.8 1.0X
Use EnumSet 2 2 0 459.8 2.2 2.2X
Use HashSet 1 1 1 1972.0 0.5 1.0X
Use EnumSet 2 2 0 444.0 2.3 0.2X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

interpreted version 4933 4935 2 108.8 9.2 1.0X
codegen version 5135 5141 9 104.6 9.6 1.0X
codegen version 64-bit 5071 5079 10 105.9 9.4 1.0X
codegen HiveHash version 4326 4326 0 124.1 8.1 1.1X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, this is the fastest.

To non-nullable StructTypes using performant method 5520 5639 168 0.0 Infinity 1.0X
To nullable StructTypes using performant method 2657 2708 72 0.0 Infinity 2.1X
To non-nullable StructTypes using performant method 3126 3150 34 0.0 Infinity 1.0X
To nullable StructTypes using performant method 3136 4768 2309 0.0 Infinity 1.0X
Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Feb 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a regression in Java 8. We need to take a look at this later.

Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
TPCDS Snappy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
q3 718 759 41 4.1 241.8 1.0X
q3 996 1035 55 3.0 335.3 1.0X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, slower?

radix sort one byte 197 197 0 127.0 7.9 61.5X
radix sort two bytes 371 372 0 67.4 14.8 32.6X
radix sort eight bytes 1391 1397 8 18.0 55.7 8.7X
radix sort key prefix array 1914 1951 52 13.1 76.6 6.3X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this benchmark, all Java 17 results are faster than Java 8.

SQL ORC MR 1654 1661 9 9.5 105.2 6.3X

OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1023-azure
SQL CSV 13143 13363 311 1.2 835.6 1.0X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSV seems to become 30% slower.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it's significant.

@dongjoon-hyun
Copy link
Member Author

When you have some time, could you review this, @viirya ? I want to merge this to proceed the further investigations.

@dongjoon-hyun
Copy link
Member Author

Thank you so much always for your help, @viirya !
Merged to master for Apache Spark 3.5.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-42483 branch February 18, 2023 06:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants