-
Notifications
You must be signed in to change notification settings - Fork 224
Reduce memory usage in Parquet->Arrow decimal column chunk conversion #751
Reduce memory usage in Parquet->Arrow decimal column chunk conversion #751
Conversation
This PR reduces memory usage, both in terms of memory used as well as allocations, in the Parquet->Arrow conversion of Decimal chunks. There are two optimizations: 1. Instead of using `slice::concat` to expand buffers to 16 bytes, a stack-allocated 16 byte buffer is used instead. This removes an allocation per value. 2. Data is expanded from the encoded Parquet fixed-size binary pages into a byte buffer, which is then converted to a buffer of i128s. To reduce the size of the intermediate byte buffer, this conversion is now done page-by-page.
For reasons I don't fully understand, optimization #2 was yielding the wrong results, so I've backed it out in the second commit and replaced it with a FromIter call. |
Codecov Report
@@ Coverage Diff @@
## main #751 +/- ##
==========================================
+ Coverage 70.80% 71.00% +0.19%
==========================================
Files 313 313
Lines 16930 16912 -18
==========================================
+ Hits 11988 12008 +20
+ Misses 4942 4904 -38
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Clippy missing but otherwise ready to ship
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed an extra optimization we can do here
Co-authored-by: Jorge Leitao <jorgecarleitao@gmail.com>
@jorgecarleitao nice, good call. I had to insert a |
Woops, we are missing an iterator over the values of a |
Thanks again, very clean :) |
This PR reduces memory usage, both in terms of memory used as well as
allocations, in the Parquet->Arrow conversion of Decimal chunks. There
are two optimizations:
slice::concat
to expand buffers to 16 bytes, astack-allocated 16 byte buffer is used instead. This removes an
allocation per value.
into a byte buffer, which is then converted to a buffer of i128s. To
reduce the size of the intermediate byte buffer, this conversion is
now done page-by-page.