-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Moving a PR discussion to a ticket so it doesn't get lost.
At a high level, the observation that @JigaoLuo and @mapleFU made is that there is a tradeoff when writing parquet between file size and encode/decode speed (typically the smaller the size, the slower the file is to decode)
Currently the arrow-rs parquet reader allows some configuration here:
- Setting the block compression to something like
zstd
decreases file size, but increases encode/decode speed
However there are other settings that we might consider:
- Automatically turning off block compression if the size savings isn't "good enough" (as described in Parquet: Do not compress v2 data page when compress is bad quality #8257)
- Adjusting the RLE lengths as suggested by @jhorstmann in Parquet LevelEncoder is much too eager to write short rle runs #7739
This ticket tracks potentially improving the ability of users to tune this tradeoff
Hello everyone,
I just came across this PR and noticed that most of the discussion is happening here, so I’d like to continue the conversation in this thread rather than on the issue page.
I believe the direction of this PR aligns well with a previous issue we discussed in XiangpengHao/liquid-cache#227. I’ve been working on my own parquet-rewrite
tool that touches on similar ideas, particularly with the score metric—a kind of breakeven point to decide whether compression should be applied. The goal of this tool is to help the reader skip unnecessary compression that adds overhead without delivering meaningful size reduction, ultimately improving the reader's reading performance.
Setting this score is quite tricky and empirical. For now, I’ve set it at 10%, mainly to catch cases where compression offers no size benefit at all. Here is an example of this case (in the level of full column):

As a side note, I’ve also made some patches to Xiangpeng’s viewer tool, which I use to inspect my generated Parquet files. This has been instrumental in iterating on my reader implementation.
Originally posted by @JigaoLuo in #8257 (comment)