Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jan 10, 2024

What changes were proposed in this pull request?

This PR aims to use zstd as the default ORC compression.

Note that Apache ORC v2.0 also uses zstd as the default compression via ORC-1577.

The following was the presentation about the usage of ZStandard.

  • The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro

Why are the changes needed?

In general, ZStandard is better in terms of the file size.

$ aws s3 ls s3://dongjoon/orc2/tpcds-sf-10-orc-snappy/ --recursive --summarize --human-readable | tail -n1
   Total Size: 2.8 GiB

$ aws s3 ls s3://dongjoon/orc2/tpcds-sf-10-orc-zstd/ --recursive --summarize --human-readable | tail -n1
   Total Size: 2.4 GiB

As a result, the performance is also better in general in the cloud storage .

$ JDK_JAVA_OPTIONS='-Dspark.sql.sources.default=orc' \
build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location s3a://dongjoon/orc2/tpcds-sf-1-orc-snappy"
...
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q1
[info]   Stopped after 2 iterations, 5712 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q1                                                 2708           2856         210          0.2        5869.3       1.0X
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q2
[info]   Stopped after 2 iterations, 7006 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q2                                                 3424           3503         113          0.7        1533.9       1.0X
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q3
[info]   Stopped after 2 iterations, 6577 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q3                                                 3146           3289         202          0.9        1059.0       1.0X
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q4
[info]   Stopped after 2 iterations, 36228 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q4                                                17592          18114         738          0.3        3375.5       1.0X
...
$ JDK_JAVA_OPTIONS='-Dspark.sql.sources.default=orc' \
build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location s3a://dongjoon/orc2/tpcds-sf-1-orc-zstd"
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q1
[info]   Stopped after 2 iterations, 5235 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q1                                                 2496           2618         172          0.2        5409.7       1.0X
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q2
[info]   Stopped after 2 iterations, 6765 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q2                                                 3338           3383          63          0.7        1495.6       1.0X
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q3
[info]   Stopped after 2 iterations, 5882 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q3                                                 2820           2941         172          1.1         949.1       1.0X
[info] Running benchmark: TPCDS Snappy
[info]   Running case: q4
[info]   Stopped after 2 iterations, 32925 ms
[info] OpenJDK 64-Bit Server VM 17.0.9+9-LTS on Mac OS X 14.3
[info] Apple M1 Max
[info] TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q4                                                16315          16463         208          0.3        3130.5       1.0X
...

Does this PR introduce any user-facing change?

Yes, the default ORC compression is changed.

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

No.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-46648][SQL] Use zstd as the default value of spark.sql.orc.compression.codec [SPARK-46648][SQL] Use zstd as the default ORC compression Jan 10, 2024
@dongjoon-hyun
Copy link
Member Author

Thank you, @HyukjinKwon . The PR description is updated.

@yaooqinn
Copy link
Member

Running benchmark: TPCDS Snappy

The benchmark's name needs to be updated.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jan 10, 2024

Yes, it's a hard-coded one.

val benchmark = new Benchmark(s"TPCDS Snappy", numRows, 2, output = output)

Since it's orthogonal from this configuration change PR, I'll handle it in a new TESTS JIRA, @yaooqinn .

The benchmark's name needs to be updated.

Copy link
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dongjoon-hyun, LGTM.

@dongjoon-hyun
Copy link
Member Author

Thank you, @yaooqinn . Here is the PR to address your comment.

Copy link
Contributor

@beliefer beliefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late LGTM.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-46648 branch January 10, 2024 07:10
dongjoon-hyun added a commit that referenced this pull request Jan 18, 2024
…benchmarks

### What changes were proposed in this pull request?

This PR aims to use the default ORC compression in data source benchmarks.

### Why are the changes needed?

Apache ORC 2.0 and Apache Spark 4.0 will use ZStandard as the default ORC compression codec.
- apache/orc#1733
- #44654

`OrcReadBenchmark` was switched to use ZStandard for comparision.
- #44761

And, this PR aims to change the remaining three data source benchmarks.
```
$ git grep OrcCompressionCodec | grep Benchmark
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala:      OrcCompressionCodec.SNAPPY.lowerCaseName())
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala:      OrcCompressionCodec.SNAPPY.lowerCaseName()).orc(dir)
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala:      .setIfMissing("orc.compression", OrcCompressionCodec.SNAPPY.lowerCaseName())
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44777 from dongjoon-hyun/SPARK-46752.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Feb 7, 2024
…benchmarks

This PR aims to use the default ORC compression in data source benchmarks.

Apache ORC 2.0 and Apache Spark 4.0 will use ZStandard as the default ORC compression codec.
- apache/orc#1733
- apache#44654

`OrcReadBenchmark` was switched to use ZStandard for comparision.
- apache#44761

And, this PR aims to change the remaining three data source benchmarks.
```
$ git grep OrcCompressionCodec | grep Benchmark
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala:      OrcCompressionCodec.SNAPPY.lowerCaseName())
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala:      OrcCompressionCodec.SNAPPY.lowerCaseName()).orc(dir)
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala:      .setIfMissing("orc.compression", OrcCompressionCodec.SNAPPY.lowerCaseName())
```

No.

Manual review.

No.

Closes apache#44777 from dongjoon-hyun/SPARK-46752.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants