-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generify ColumnReaderImpl and RecordReader #1040
Labels
parquet
Changes to the parquet crate
Comments
tustvold
added
the
enhancement
Any new improvement worthy of a entry in the changelog
label
Dec 13, 2021
tustvold
added a commit
to tustvold/arrow-rs
that referenced
this issue
Dec 13, 2021
tustvold
added a commit
to tustvold/arrow-rs
that referenced
this issue
Dec 13, 2021
tustvold
added a commit
to tustvold/arrow-rs
that referenced
this issue
Dec 13, 2021
@tustvold could you provide some examples of how the new API would look and how it could be used? |
tustvold
added a commit
to tustvold/arrow-rs
that referenced
this issue
Dec 21, 2021
tustvold
added a commit
to tustvold/arrow-rs
that referenced
this issue
Dec 21, 2021
alamb
pushed a commit
that referenced
this issue
Jan 11, 2022
* Simplify record reader * Generify ColumnReaderImpl and RecordReader (#1040) * Tweak count_records predicate * Pre-allocate bitmask * fix: TypedBuffer::split update len * Simplify GenericRecordReader * Move column decoders into module * Remove `RecordBuffer::create` method * Remove `TypedBuffer<i16>::count_records` * Pass null count to `ColumnValueDecoder::read` * Pull null padding out of column reader * Review feedback * Format * License headers * Further doc tweaks * Further docs * Restrict ScalarBuffer types
tustvold
added a commit
to tustvold/arrow-rs
that referenced
this issue
Jan 12, 2022
tustvold
added a commit
to tustvold/arrow-rs
that referenced
this issue
Jan 14, 2022
tustvold
added a commit
to tustvold/arrow-rs
that referenced
this issue
Jan 14, 2022
UTF-8 Validation (apache#786)
alamb
pushed a commit
that referenced
this issue
Jan 18, 2022
) * Optimized ByteArrayReader (#1040) UTF-8 Validation (#786) * Fix arrow_array_reader benchmark * Allow running subset of arrow_array_reader benchmarks * Faster UTF-8 validation * Tweak null handling * Add license * Refine `ValuesBuffer::pad_nulls` * Tweak error handling * Use page null count if available * Doc comments * Test DELTA_BYTE_ARRAY encoding * Support legacy Encoding::PLAIN_DICTIONARY * Add OffsetBuffer unit tests Review feedback * More tests * Fix lint * Review feedback
alamb
added
parquet
Changes to the parquet crate
and removed
enhancement
Any new improvement worthy of a entry in the changelog
labels
Jan 20, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently
RecordReader
andColumnReaderImpl
have a hard-coded assumption that they are decoding to contiguous array of values, or i16 levels. This complicates implementing #1037, #171 and potential future decode related optimisations, e.g. decoding directly to StringArray, or evaluating predicates directly, etc...Describe the solution you'd like
Create new
GenericColumnReader
andGenericRecordReader
whichRecordReader
andColumnReaderImpl
are type alias to. This preserves API compatibility whilst allowing the introduction of new type parameters. As these types need to be able to influence the buffer types, they aren't object-safe and therefore need to be generics and not simply trait objects.All decode and buffering would be provided by these generic types, allowing them to be swapped out. This would leave
ColumnReaderImpl
responsible for muxing the parquet file, i.e. extracting pages from thePageReader
and feeding them to the decoders.RecordReader
would be responsible for delimiting semantic records, as it is today.Describe alternatives you've considered
We could duplicate the logic in
ColumnReaderImpl
andRecordReader
into different reader implementations, but this seems unfortunate.Additional context
There is likely non-trivial overlap with #384 and #200 which sought to introduce generics at a different level. Unfortunately it is still coupled with the notion of contiguous value arrays, and I couldn't see a way to achieve the particular flexibility desired.
The text was updated successfully, but these errors were encountered: