Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose page-level arrow reader API (#4298) #4307

Merged
merged 4 commits into from
May 31, 2023

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #4298

Rationale for this change

This makes it possible to use the arrow conversion plumbing with custom logic for fetching the sourcing the underlying data pages.

What changes are included in this PR?

Are there any user-facing changes?

The only changes are to experimental APIs, so no breaking changes

@github-actions github-actions bot added the parquet Changes to the parquet crate label May 29, 2023
///
/// Note: this is a low-level interface see [`ParquetRecordBatchReader::try_new`] for a
/// higher-level interface for reading parquet data from a file
pub fn try_new_with_row_groups(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lets you construct a ParquetRecordBatchReader from an arbitrary source of pages

///
/// Note: this is a low-level API, most users will want to make use of the higher-level
/// [`parquet_to_arrow_schema`] for decoding metadata from a parquet file.
pub fn parquet_to_arrow_field_levels(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the alternative to making parquet_to_array_schema_and_fields public, and does not expose the details of what ParquetField is.

}

/// Extracts the arrow metadata
pub(crate) fn parquet_to_array_schema_and_fields(
pub(crate) fn parquet_to_arrow_schema_and_fields(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realised the name was a typo, so fixed it 😄

@tustvold
Copy link
Contributor Author

@sundy-li could you perhaps take a look and see if this would suit your use-case

row_groups: &dyn RowGroups,
batch_size: usize,
selection: Option<RowSelection>,
) -> Result<Self> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to include a ProjectionMask as this can be passed to parquet_to_arrow_field_levels instead

@sundy-li
Copy link
Contributor

sundy-li commented May 30, 2023

Almost LGTM, how about exposing InMemoryRowGroup and scan_ranges of RowSelector ? Otherwise, we need a similar struct to work.

@tustvold
Copy link
Contributor Author

They're fairly tightly coupled with how ParquetRecordBatchStream does IO, something you indicated you wished to handle differently. What do you think of implementing RowGroups for your custom in-memory representation?

@sundy-li
Copy link
Contributor

sundy-li commented May 30, 2023

What do you think of implementing RowGroups for your custom in-memory representation?

Yes, we would have our own RowGroups implementation but we also want to access the function of RowSelector to inject the page index, so how about making scan_ranges be public.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me -- thanks @sundy-li and @tustvold

I also suggest someone write up a tested example of using these new APIs so they don't get inadvertently broken during some future refactor

Ok((schema, field_levels.levels))
}

/// Stores the parquet level information for a set of arrow [`Fields`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the comment refers to parquet level information when the actual structure contains the an optional mapping to a ParquetField 🤔

Maybe the docs are out of date? Or perhaps mean to highlight the fact that the level information is stored in ParquetField?

/// Stores the parquet level information for a set of arrow [`Fields`]
#[derive(Debug, Clone)]
pub struct FieldLevels {
pub(crate) fields: Fields,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to get access to fields and levels outside of the parquet crate? Should there be?

Maybe they need accessors 🤔 Or maybe the point is to pass it opaquely to try_new_with_row_groups? If so perhaps that is worth a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention of this container is to hide the internal notion of what is necessary to decode a field. Will clarify

@@ -35,6 +35,7 @@ fn get_repetition(t: &Type) -> Repetition {
}

/// Representation of a parquet file, in terms of arrow schema elements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reading this PR I see this comment and wonder "is it really a parquet file" or is it more like a "parquet field"? Or a "possibly nested field" 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded, I agree the use of file here is very misleading

///
/// This is useful for determining what byte ranges to fetch from underlying storage
///
/// Note: this method does not make any effort to combine consecutive ranges, nor coalesce
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@sundy-li
Copy link
Contributor

I also suggest someone write up a tested example of using these new APIs so they don't get inadvertently broken during some future refactor

Sure, I can help later after this pr is merged.

@tustvold tustvold merged commit bbb47a6 into apache:master May 31, 2023
@@ -34,7 +34,7 @@ fn get_repetition(t: &Type) -> Repetition {
}
}

/// Representation of a parquet file, in terms of arrow schema elements
/// Representation of a parquet schema element, in terms of arrow schema elements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

export function parquet_to_array_schema_and_fields
3 participants