Skip to content

Evaluate rebasing Hudi's default compression from Gzip to Zstd #14938

@hudi-bot

Description

@hudi-bot

Currently, having Gzip as a default we prioritize Compression/Storage cost at the expense of

  • Compute (on the {+}write-path{+}): about 30% of Compute burned during bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) 
  • Compute (on the {+}read-path{+}), as well as queries Latencies: queries scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put is 3-4x less than Snappy, Zstd, [EX|https://stackoverflow.com/a/56410326/3520840])

P.S Spark switched its default compression algorithm to Snappy [a while ago|https://github.com/apache/spark/pull/12256].

 

EDIT

We should actually evaluate putting in [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] instead of Snappy. It has compression ratios comparable to Gzip, while bringing in much better performance:

!image-2021-12-03-13-13-02-892.png!

[https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/]

 

 

 

JIRA info


Comments

03/Dec/21 21:03;alexey.kudinkin;!Screen Shot 2021-12-03 at 12.36.13 PM.png!;;;


06/Dec/21 19:50;alexey.kudinkin;Running a benchmark upon small subset of the Amazon Reviews dataset we're able to see considerable improvement in bulk-insert times: bulk-insert was up to 40% faster, while it had very similar footprint in the storage.

!Screen Shot 2021-12-06 at 11.49.05 AM.png|width=935,height=644!;;;


14/Dec/21 01:06;alexey.kudinkin;Unfortunately, the switching to Zstd might required a little more grinding than initially anticipated:

Current Parquet version (1.10.1, being handed down by Spark 2.4.4) only supports ZstdCompressionCodec as provided by "hadoop-common", which in turn requires it to be built with Native Libraries support (including compression codecs, etc). It only supports Linux/*nix.

Therefore if we're planning on supporting Spark 2.x we have following options: 

Implement our own version of ZstdCompressionCodec leveraging either [zstd-jni|https://github.com/luben/zstd-jni] (used by Spark internally) or airlift-aircompressor (claims to be faster than JNI impl).

Switch to zstd being default setting only for Spark 3 environments.

 ;;;


12/Jan/22 00:45;alexey.kudinkin;Unfortunately we won't be able to support Zstd w/o herculean effort of hacking around Parquet implementation as it's not unfortunately modularized well-enough to support outside extensions.

 

The only sensible way at this point seem to be waiting for Spark/Parquet upgrade to 1.12.;;;


03/Feb/22 17:29;alexey.kudinkin;Uber's example of leveraging Zstd in lieu of Gzip

https://eng.uber.com/cost-efficiency-big-data/;;;

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions