-
Notifications
You must be signed in to change notification settings - Fork 875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose page-level arrow reader API (#4298) #4307
Conversation
/// | ||
/// Note: this is a low-level interface see [`ParquetRecordBatchReader::try_new`] for a | ||
/// higher-level interface for reading parquet data from a file | ||
pub fn try_new_with_row_groups( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lets you construct a ParquetRecordBatchReader
from an arbitrary source of pages
/// | ||
/// Note: this is a low-level API, most users will want to make use of the higher-level | ||
/// [`parquet_to_arrow_schema`] for decoding metadata from a parquet file. | ||
pub fn parquet_to_arrow_field_levels( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the alternative to making parquet_to_array_schema_and_fields
public, and does not expose the details of what ParquetField
is.
} | ||
|
||
/// Extracts the arrow metadata | ||
pub(crate) fn parquet_to_array_schema_and_fields( | ||
pub(crate) fn parquet_to_arrow_schema_and_fields( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realised the name was a typo, so fixed it 😄
@sundy-li could you perhaps take a look and see if this would suit your use-case |
row_groups: &dyn RowGroups, | ||
batch_size: usize, | ||
selection: Option<RowSelection>, | ||
) -> Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to include a ProjectionMask
as this can be passed to parquet_to_arrow_field_levels
instead
Almost LGTM, how about exposing |
They're fairly tightly coupled with how ParquetRecordBatchStream does IO, something you indicated you wished to handle differently. What do you think of implementing |
Yes, we would have our own |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet/src/arrow/schema/mod.rs
Outdated
Ok((schema, field_levels.levels)) | ||
} | ||
|
||
/// Stores the parquet level information for a set of arrow [`Fields`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why the comment refers to parquet level information
when the actual structure contains the an optional mapping to a ParquetField
🤔
Maybe the docs are out of date? Or perhaps mean to highlight the fact that the level information is stored in ParquetField
?
/// Stores the parquet level information for a set of arrow [`Fields`] | ||
#[derive(Debug, Clone)] | ||
pub struct FieldLevels { | ||
pub(crate) fields: Fields, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to get access to fields
and levels
outside of the parquet crate? Should there be?
Maybe they need accessors 🤔 Or maybe the point is to pass it opaquely to try_new_with_row_groups
? If so perhaps that is worth a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention of this container is to hide the internal notion of what is necessary to decode a field. Will clarify
parquet/src/arrow/schema/complex.rs
Outdated
@@ -35,6 +35,7 @@ fn get_repetition(t: &Type) -> Repetition { | |||
} | |||
|
|||
/// Representation of a parquet file, in terms of arrow schema elements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When reading this PR I see this comment and wonder "is it really a parquet file" or is it more like a "parquet field"? Or a "possibly nested field" 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded, I agree the use of file here is very misleading
/// | ||
/// This is useful for determining what byte ranges to fetch from underlying storage | ||
/// | ||
/// Note: this method does not make any effort to combine consecutive ranges, nor coalesce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Sure, I can help later after this pr is merged. |
@@ -34,7 +34,7 @@ fn get_repetition(t: &Type) -> Repetition { | |||
} | |||
} | |||
|
|||
/// Representation of a parquet file, in terms of arrow schema elements | |||
/// Representation of a parquet schema element, in terms of arrow schema elements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Which issue does this PR close?
Closes #4298
Rationale for this change
This makes it possible to use the arrow conversion plumbing with custom logic for fetching the sourcing the underlying data pages.
What changes are included in this PR?
Are there any user-facing changes?
The only changes are to experimental APIs, so no breaking changes