-
Notifications
You must be signed in to change notification settings - Fork 223
Improved performance of reading utf8 required from parquet (-15%) #670
Conversation
Codecov Report
@@ Coverage Diff @@
## main #670 +/- ##
==========================================
+ Coverage 69.59% 69.67% +0.07%
==========================================
Files 299 301 +2
Lines 16746 16803 +57
==========================================
+ Hits 11655 11707 +52
- Misses 5091 5096 +5
Continue to review full report at Codecov.
|
Very cool ! Will it be more efficient when reading? It can avoid |
The main reason is that if the row_group contains a lot of data_pages, it will continue to be very slow when the |
The In general a column chunk should have a small number of pages (usually 2-3 for a dict-encoded and 1 for a non-dict encoded. More pages means more fragmentation and thus less compression. Since many parquet writers do not write or support filter pushdown at the page level, more fragmentation does not help in skipping pages.
I think that this is the fastest possible under the constraints of the parquet format, since we can't tell the total size of a data page prior to reading its header (but I would love to be proven wrong and learn something new ^_^) |
Thank you for your reply. |
I think that the following holds:
pages may contain a dictionary page (we only know once we read the page), and individual pages may be encoded differently (e.g. mixing I think that the for reference, I took the equation above by looking at the writer, here and here. |
Thanks for your reply. I'll study again |
Closes #666
Two optimizations:
get_length
is not inlined and causes some perf problems)Thanks @ldn9638 for the idea!