Skip to content

Tuning knobs to tradeoff CPU and compression in parquet #8358

@alamb

Description

@alamb

Moving a PR discussion to a ticket so it doesn't get lost.

At a high level, the observation that @JigaoLuo and @mapleFU made is that there is a tradeoff when writing parquet between file size and encode/decode speed (typically the smaller the size, the slower the file is to decode)

Currently the arrow-rs parquet reader allows some configuration here:

  1. Setting the block compression to something like zstd decreases file size, but increases encode/decode speed

However there are other settings that we might consider:

  1. Automatically turning off block compression if the size savings isn't "good enough" (as described in Parquet: Do not compress v2 data page when compress is bad quality #8257)
  2. Adjusting the RLE lengths as suggested by @jhorstmann in Parquet LevelEncoder is much too eager to write short rle runs #7739

This ticket tracks potentially improving the ability of users to tune this tradeoff

Hello everyone,

I just came across this PR and noticed that most of the discussion is happening here, so I’d like to continue the conversation in this thread rather than on the issue page.

I believe the direction of this PR aligns well with a previous issue we discussed in XiangpengHao/liquid-cache#227. I’ve been working on my own parquet-rewrite tool that touches on similar ideas, particularly with the score metric—a kind of breakeven point to decide whether compression should be applied. The goal of this tool is to help the reader skip unnecessary compression that adds overhead without delivering meaningful size reduction, ultimately improving the reader's reading performance.

Setting this score is quite tricky and empirical. For now, I’ve set it at 10%, mainly to catch cases where compression offers no size benefit at all. Here is an example of this case (in the level of full column):

image

As a side note, I’ve also made some patches to Xiangpeng’s viewer tool, which I use to inspect my generated Parquet files. This has been instrumental in iterating on my reader implementation.

Originally posted by @JigaoLuo in #8257 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    parquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions