Tuning knobs to tradeoff CPU and compression in parquet

Moving a PR discussion to a ticket so it doesn't get lost.

At a high level, the observation that @JigaoLuo and @mapleFU made is that there is a tradeoff when writing parquet between file size and encode/decode speed (typically the smaller the size, the slower the file is to decode)

Currently the arrow-rs parquet reader allows some configuration here:
1. Setting the block compression to something like `zstd` decreases file size, but increases encode/decode speed

However there are other settings that we might consider:
1. Automatically turning off block compression if the size savings isn't "good enough" (as described in https://github.com/apache/arrow-rs/pull/8257)
2. Adjusting the RLE lengths as suggested by @jhorstmann in https://github.com/apache/arrow-rs/issues/7739

This ticket tracks potentially improving the ability of users to tune this tradeoff

Hello everyone,

I just came across this PR and noticed that most of the discussion is happening here, so I’d like to continue the conversation in this thread rather than on the issue page.

I believe the direction of this PR aligns well with a previous issue we discussed in https://github.com/XiangpengHao/liquid-cache/issues/227. I’ve been working on my own `parquet-rewrite` tool that touches on similar ideas, particularly with the **score** metric—a kind of breakeven point to decide whether compression should be applied. The goal of this tool is to help the reader skip unnecessary compression that adds overhead without delivering meaningful size reduction, ultimately improving the reader's reading performance.

Setting this **score** is quite tricky and empirical. For now, I’ve set it at 10%, mainly to catch cases where compression offers no size benefit at all. Here is an example of this case (in the level of full column):

<img width="2415" height="660" alt="image" src="https://github.com/user-attachments/assets/0ab7438c-6516-46b0-bc17-e9c8b9b14273" />


---

As a side note, I’ve also made some patches to Xiangpeng’s viewer tool, which I use to inspect my generated Parquet files. This has been instrumental in iterating on my reader implementation.

_Originally posted by @JigaoLuo in https://github.com/apache/arrow-rs/issues/8257#issuecomment-3266966160_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tuning knobs to tradeoff CPU and compression in parquet #8358

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tuning knobs to tradeoff CPU and compression in parquet #8358

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions