-
Notifications
You must be signed in to change notification settings - Fork 223
Added support to read parquet row groups in chunks #789
Conversation
48b6fb9
to
2bf4c00
Compare
This now reproduces all the existing functionality apart from:
I found a bug in |
18cfc09
to
d0b78b0
Compare
Codecov Report
@@ Coverage Diff @@
## main #789 +/- ##
==========================================
- Coverage 71.29% 71.19% -0.10%
==========================================
Files 321 326 +5
Lines 16834 17471 +637
==========================================
+ Hits 12001 12438 +437
- Misses 4833 5033 +200
Continue to review full report at Codecov.
|
fb6a54a
to
aef6394
Compare
@houqp, this is ready for a spin. There is a regression (20-40%) when reading a whole binary/utf8 column chunk at once (i.e. no chunk size). This is related to some tricks in pre-computed capacities of binary/utf8 that benefit from reading the whole column (we can recover this behavior). I deactivated structs of structs and lists of lists for now as I need to dig a bit into the dremel. The failing tests are just examples that I need to update. ^^ |
2aaf8dd
to
97fcf3b
Compare
09be25e
to
18b3be4
Compare
This PR allows reading of parquet columns in chunks, thereby allowing decompressing and deserializing pages on demand to reduce the memory footprint.
This is a draft as there is still some work to do, which I will continue
with_capacity
andreserve
to avoid un-needed reallocations(performance should improve when the chunk size is None, there should be minimal diff with chunked)Design
This PR follows the design of this crate:
unsafe
freeThe overall design of the changes enables pages to be deserialized to arrays of a different length, based on a new parameter
chunk_size: usize
:Broadly, this PR does the following (for a single row group):
Vec<u8>
)Vec<u8> -> Iterator<CompressedDataPage>
Iterator<CompressedDataPage> -> Iterator<&DataPage>
(decompression)Iterator<&DataPage> -> Iterator<Arc<dyn Array>>
(deserialization)All these iterators are CPU-bounded. On the last step, we track where we are on the
DataPage
(throughPageState
, see below) and the temporary mutable array, and either:chunk_size
, at which point we freeze the mutable and return itIn more detail:
read_exact
, instead of page by page. This was a mistake on the previous design, since we should not mix reading columns from deserializing pages&'a DataPage
is mapped to aPageState<'a>
based on its parquet physical type and encoding. This is the input state of a page and allows us to "suspend" an iteration over a page, thereby allowing a page to "extend" an array without having it completely consumedDecoder
, knows how to initialize mutable arrays and how to extend them state from&mut PageState<'a>
, advancing the page state accordingly. This is mostly a DRY traitCloses #768