-
Notifications
You must be signed in to change notification settings - Fork 221
Added data_pagesize_limit to write parquet pages #1303
Added data_pagesize_limit to write parquet pages #1303
Conversation
Read write benchmark gists: https://gist.github.com/sundy-li/4984ec7cfeade556d60306a3a218ec8a We use TPCH's lineitem table, it's original file is generated by dremio, it's well paged, eg column
We use parquet_write.rs to generate the output file by input file and parquet_read.rs to test the performance read on parallel. Before (single large page per rowgroup):
As you can see, it's 5X worse in reading on 16 parallelism. After this pr:
|
Out of curiosity, what is the difference in file size? Usually smaller pages => larger files. |
It's a large file, 255MB snappy encoded.
|
Codecov ReportBase: 83.12% // Head: 83.14% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1303 +/- ##
==========================================
+ Coverage 83.12% 83.14% +0.02%
==========================================
Files 369 369
Lines 40180 40234 +54
==========================================
+ Hits 33399 33452 +53
- Misses 6781 6782 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There is a similar limit in arrow-rs: Depending on your usecase you may also find limiting the row count useful as some data compresses so well it fits on quite small pages: https://docs.rs/parquet/27.0.0/parquet/file/properties/struct.WriterProperties.html#method.data_page_row_count_limit |
Thank you @alamb, I can't figure out a better name so I keep it compatible with arrow-rs. The row count limit is less used than the bytes limit, I'll try to add it in the next pr. I found this best practice to optimize parquet files: https://docs.dremio.com/software/data-formats/parquet-files/ The page-size defaults to ~100KB seem to be really reasonable. (I have tested from 8192byte to 1MB today and found it has the best performance in this default config) Some users tested dremio/datafusion/databend in TPCH queries and report to me that databend works 2-3x slower in reading parquet files from loading. So I looked into it and found this bottleneck for a couple of days. With this pr, now databend can works with similar performance to duckdb/datafusion in tpch Q1 .
|
Thanks a lot @sundy-li for the PR, and everyone else for the ideas, suggestions and reviews 🙇 |
arrow2 writes single page by default which will hurt the read performance a lot (up to 4x slower than the original file).
This pr introduces an option named
data_pagesize_limit
to split large Array into small pages.fixes #1291