Evaluate rebasing Hudi's default compression from Gzip to Zstd

Currently, having Gzip as a default we prioritize Compression/Storage cost at the expense of
 * Compute (on the {+}write-path{+}): about *30%* of Compute burned during bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) 
 * Compute (on the {+}read-path{+}), as well as queries Latencies: queries scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put is *3-4x* less than Snappy, Zstd, [EX|https://stackoverflow.com/a/56410326/3520840])

P.S Spark switched its default compression algorithm to Snappy [a while ago|https://github.com/apache/spark/pull/12256].

 

*EDIT*

We should actually evaluate putting in [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] instead of Snappy. It has compression ratios comparable to Gzip, while bringing in much better performance:

!image-2021-12-03-13-13-02-892.png!

[https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/]

 

 

 

## JIRA info

- Link: https://issues.apache.org/jira/browse/HUDI-2928
- Type: Improvement
- Epic: https://issues.apache.org/jira/browse/HUDI-3249
- Attachment(s):
  - 03/Dec/21 21:03;alexey.kudinkin;Screen Shot 2021-12-03 at 12.36.13 PM.png;https://issues.apache.org/jira/secure/attachment/13036992/Screen+Shot+2021-12-03+at+12.36.13+PM.png
  - 06/Dec/21 19:49;alexey.kudinkin;Screen Shot 2021-12-06 at 11.49.05 AM.png;https://issues.apache.org/jira/secure/attachment/13037052/Screen+Shot+2021-12-06+at+11.49.05+AM.png
  - 03/Dec/21 21:13;alexey.kudinkin;image-2021-12-03-13-13-02-892.png;https://issues.apache.org/jira/secure/attachment/13036993/image-2021-12-03-13-13-02-892.png


---


## Comments

03/Dec/21 21:03;alexey.kudinkin;!Screen Shot 2021-12-03 at 12.36.13 PM.png!;;;

---

06/Dec/21 19:50;alexey.kudinkin;Running a benchmark upon small subset of the Amazon Reviews dataset we're able to see considerable improvement in bulk-insert times: bulk-insert was up to *40%* faster, while it had very similar footprint in the storage.

!Screen Shot 2021-12-06 at 11.49.05 AM.png|width=935,height=644!;;;

---

14/Dec/21 01:06;alexey.kudinkin;Unfortunately, the switching to Zstd might required a little more grinding than initially anticipated:

Current Parquet version (1.10.1, being handed down by Spark 2.4.4) only supports `ZstdCompressionCodec` as provided by "hadoop-common", which in turn requires it to be built with Native Libraries support (including compression codecs, etc). It only supports Linux/*nix.

Therefore if we're planning on supporting Spark 2.x we have following options: 
 # Implement our own version of `ZstdCompressionCodec` leveraging either [zstd-jni|https://github.com/luben/zstd-jni] (used by Spark internally) or airlift-aircompressor (claims to be faster than JNI impl).
 # Switch to `zstd` being default setting only for Spark 3 environments.

 ;;;

---

12/Jan/22 00:45;alexey.kudinkin;Unfortunately we won't be able to support Zstd w/o herculean effort of hacking around Parquet implementation as it's not unfortunately modularized well-enough to support outside extensions.

 

The only sensible way at this point seem to be waiting for Spark/Parquet upgrade to 1.12.;;;

---

03/Feb/22 17:29;alexey.kudinkin;Uber's example of leveraging Zstd in lieu of Gzip

https://eng.uber.com/cost-efficiency-big-data/;;;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate rebasing Hudi's default compression from Gzip to Zstd #14938

JIRA info

Comments

Implement our own version of `ZstdCompressionCodec` leveraging either [zstd-jni|https://github.com/luben/zstd-jni] (used by Spark internally) or airlift-aircompressor (claims to be faster than JNI impl).

Switch to `zstd` being default setting only for Spark 3 environments.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluate rebasing Hudi's default compression from Gzip to Zstd #14938

Description

JIRA info

Comments

Implement our own version of ZstdCompressionCodec leveraging either [zstd-jni|https://github.com/luben/zstd-jni] (used by Spark internally) or airlift-aircompressor (claims to be faster than JNI impl).

Switch to zstd being default setting only for Spark 3 environments.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Implement our own version of `ZstdCompressionCodec` leveraging either [zstd-jni|https://github.com/luben/zstd-jni] (used by Spark internally) or airlift-aircompressor (claims to be faster than JNI impl).

Switch to `zstd` being default setting only for Spark 3 environments.