-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Currently, having Gzip as a default we prioritize Compression/Storage cost at the expense of
- Compute (on the {+}write-path{+}): about 30% of Compute burned during bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below)
- Compute (on the {+}read-path{+}), as well as queries Latencies: queries scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put is 3-4x less than Snappy, Zstd, [EX|https://stackoverflow.com/a/56410326/3520840])
P.S Spark switched its default compression algorithm to Snappy [a while ago|https://github.com/apache/spark/pull/12256].
EDIT
We should actually evaluate putting in [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/] instead of Snappy. It has compression ratios comparable to Gzip, while bringing in much better performance:
!image-2021-12-03-13-13-02-892.png!
[https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/]
JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-2928
- Type: Improvement
- Epic: https://issues.apache.org/jira/browse/HUDI-3249
- Attachment(s):
- 03/Dec/21 21:03;alexey.kudinkin;Screen Shot 2021-12-03 at 12.36.13 PM.png;https://issues.apache.org/jira/secure/attachment/13036992/Screen+Shot+2021-12-03+at+12.36.13+PM.png
- 06/Dec/21 19:49;alexey.kudinkin;Screen Shot 2021-12-06 at 11.49.05 AM.png;https://issues.apache.org/jira/secure/attachment/13037052/Screen+Shot+2021-12-06+at+11.49.05+AM.png
- 03/Dec/21 21:13;alexey.kudinkin;image-2021-12-03-13-13-02-892.png;https://issues.apache.org/jira/secure/attachment/13036993/image-2021-12-03-13-13-02-892.png
Comments
03/Dec/21 21:03;alexey.kudinkin;!Screen Shot 2021-12-03 at 12.36.13 PM.png!;;;
06/Dec/21 19:50;alexey.kudinkin;Running a benchmark upon small subset of the Amazon Reviews dataset we're able to see considerable improvement in bulk-insert times: bulk-insert was up to 40% faster, while it had very similar footprint in the storage.
!Screen Shot 2021-12-06 at 11.49.05 AM.png|width=935,height=644!;;;
14/Dec/21 01:06;alexey.kudinkin;Unfortunately, the switching to Zstd might required a little more grinding than initially anticipated:
Current Parquet version (1.10.1, being handed down by Spark 2.4.4) only supports ZstdCompressionCodec as provided by "hadoop-common", which in turn requires it to be built with Native Libraries support (including compression codecs, etc). It only supports Linux/*nix.
Therefore if we're planning on supporting Spark 2.x we have following options:
Implement our own version of ZstdCompressionCodec leveraging either [zstd-jni|https://github.com/luben/zstd-jni] (used by Spark internally) or airlift-aircompressor (claims to be faster than JNI impl).
Switch to zstd being default setting only for Spark 3 environments.
;;;
12/Jan/22 00:45;alexey.kudinkin;Unfortunately we won't be able to support Zstd w/o herculean effort of hacking around Parquet implementation as it's not unfortunately modularized well-enough to support outside extensions.
The only sensible way at this point seem to be waiting for Spark/Parquet upgrade to 1.12.;;;
03/Feb/22 17:29;alexey.kudinkin;Uber's example of leveraging Zstd in lieu of Gzip