[SPARK-46648][SQL] Use `zstd` as the default ORC compression #44654

dongjoon-hyun · 2024-01-10T02:58:54Z

What changes were proposed in this pull request?

This PR aims to use zstd as the default ORC compression.

Note that Apache ORC v2.0 also uses zstd as the default compression via ORC-1577.

The following was the presentation about the usage of ZStandard.

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
- Slides
- Youtube

Why are the changes needed?

In general, ZStandard is better in terms of the file size.

$ aws s3 ls s3://dongjoon/orc2/tpcds-sf-10-orc-snappy/ --recursive --summarize --human-readable | tail -n1
   Total Size: 2.8 GiB

$ aws s3 ls s3://dongjoon/orc2/tpcds-sf-10-orc-zstd/ --recursive --summarize --human-readable | tail -n1
   Total Size: 2.4 GiB

As a result, the performance is also better in general in the cloud storage .

$ JDK_JAVA_OPTIONS='-Dspark.sql.sources.default=orc' \
build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location s3a://dongjoon/orc2/tpcds-sf-1-orc-snappy"
...
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q1
[info]   Stopped after 2 iterations, 5712 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q1                                                 2708           2856         210          0.2        5869.3       1.0X
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q2
[info]   Stopped after 2 iterations, 7006 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q2                                                 3424           3503         113          0.7        1533.9       1.0X
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q3
[info]   Stopped after 2 iterations, 6577 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q3                                                 3146           3289         202          0.9        1059.0       1.0X
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q4
[info]   Stopped after 2 iterations, 36228 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q4                                                17592          18114         738          0.3        3375.5       1.0X
...

$ JDK_JAVA_OPTIONS='-Dspark.sql.sources.default=orc' \
build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location s3a://dongjoon/orc2/tpcds-sf-1-orc-zstd"
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q1
[info]   Stopped after 2 iterations, 5235 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q1                                                 2496           2618         172          0.2        5409.7       1.0X
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q2
[info]   Stopped after 2 iterations, 6765 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q2                                                 3338           3383          63          0.7        1495.6       1.0X
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q3
[info]   Stopped after 2 iterations, 5882 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q3                                                 2820           2941         172          1.1         949.1       1.0X
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q4
[info]   Stopped after 2 iterations, 32925 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q4                                                16315          16463         208          0.3        3130.5       1.0X
...

Does this PR introduce any user-facing change?

Yes, the default ORC compression is changed.

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

No.

…compression.codec`

dongjoon-hyun · 2024-01-10T03:06:03Z

Thank you, @HyukjinKwon . The PR description is updated.

yaooqinn · 2024-01-10T03:55:51Z

Running benchmark: TPCDS Snappy

The benchmark's name needs to be updated.

dongjoon-hyun · 2024-01-10T04:32:52Z

Yes, it's a hard-coded one.

spark/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala

Line 111 in 85b504d

val benchmark = new Benchmark(s"TPCDS Snappy", numRows, 2, output = output)

Since it's orthogonal from this configuration change PR, I'll handle it in a new TESTS JIRA, @yaooqinn .

The benchmark's name needs to be updated.

yaooqinn

Thank you @dongjoon-hyun, LGTM.

dongjoon-hyun · 2024-01-10T04:46:24Z

Thank you, @yaooqinn . Here is the PR to address your comment.

[SPARK-46652][SQL][TESTS] Remove Snappy from TPCDSQueryBenchmark benchmark case name #44657

beliefer

Late LGTM.

…benchmarks ### What changes were proposed in this pull request? This PR aims to use the default ORC compression in data source benchmarks. ### Why are the changes needed? Apache ORC 2.0 and Apache Spark 4.0 will use ZStandard as the default ORC compression codec. - apache/orc#1733 - #44654 `OrcReadBenchmark` was switched to use ZStandard for comparision. - #44761 And, this PR aims to change the remaining three data source benchmarks. ``` $ git grep OrcCompressionCodec | grep Benchmark sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala: OrcCompressionCodec.SNAPPY.lowerCaseName()) sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala: OrcCompressionCodec.SNAPPY.lowerCaseName()).orc(dir) sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala: .setIfMissing("orc.compression", OrcCompressionCodec.SNAPPY.lowerCaseName()) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44777 from dongjoon-hyun/SPARK-46752. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…benchmarks This PR aims to use the default ORC compression in data source benchmarks. Apache ORC 2.0 and Apache Spark 4.0 will use ZStandard as the default ORC compression codec. - apache/orc#1733 - apache#44654 `OrcReadBenchmark` was switched to use ZStandard for comparision. - apache#44761 And, this PR aims to change the remaining three data source benchmarks. ``` $ git grep OrcCompressionCodec | grep Benchmark sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala: OrcCompressionCodec.SNAPPY.lowerCaseName()) sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala: OrcCompressionCodec.SNAPPY.lowerCaseName()).orc(dir) sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala: .setIfMissing("orc.compression", OrcCompressionCodec.SNAPPY.lowerCaseName()) ``` No. Manual review. No. Closes apache#44777 from dongjoon-hyun/SPARK-46752. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

[SPARK-46648][SQL] Use zstd as the default value of `spark.sql.orc.…

e248386

…compression.codec`

github-actions bot added SQL DOCS labels Jan 10, 2024

dongjoon-hyun changed the title ~~[SPARK-46648][SQL] Use zstd as the default value of spark.sql.orc.compression.codec~~ [SPARK-46648][SQL] Use zstd as the default ORC compression Jan 10, 2024

HyukjinKwon approved these changes Jan 10, 2024

View reviewed changes

yaooqinn approved these changes Jan 10, 2024

View reviewed changes

dongjoon-hyun closed this in a3991b1 Jan 10, 2024

beliefer reviewed Jan 10, 2024

View reviewed changes

dongjoon-hyun deleted the SPARK-46648 branch January 10, 2024 07:10

dongjoon-hyun mentioned this pull request Jan 18, 2024

[SPARK-46752][SQL][TESTS] Use default ORC compression in data source benchmarks #44777

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-46648][SQL] Use `zstd` as the default ORC compression #44654

[SPARK-46648][SQL] Use `zstd` as the default ORC compression #44654

Uh oh!

dongjoon-hyun commented Jan 10, 2024 •

edited

Loading

Uh oh!

dongjoon-hyun commented Jan 10, 2024

Uh oh!

yaooqinn commented Jan 10, 2024

Uh oh!

dongjoon-hyun commented Jan 10, 2024 •

edited

Loading

Uh oh!

yaooqinn left a comment

Uh oh!

dongjoon-hyun commented Jan 10, 2024

Uh oh!

beliefer left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-46648][SQL] Use zstd as the default ORC compression #44654

[SPARK-46648][SQL] Use zstd as the default ORC compression #44654

Uh oh!

Conversation

dongjoon-hyun commented Jan 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun commented Jan 10, 2024

Uh oh!

yaooqinn commented Jan 10, 2024

Uh oh!

dongjoon-hyun commented Jan 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaooqinn left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jan 10, 2024

Uh oh!

beliefer left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-46648][SQL] Use `zstd` as the default ORC compression #44654

[SPARK-46648][SQL] Use `zstd` as the default ORC compression #44654

dongjoon-hyun commented Jan 10, 2024 •

edited

Loading

dongjoon-hyun commented Jan 10, 2024 •

edited

Loading