Page index partial missing behavior #8892
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Page index behavior: handle partial / missing indexes without panics
Goal
Ensure Parquet page indexes behave correctly when some or all columns lack indexes by:
Noneat the leaf ofParquetColumnIndexandParquetOffsetIndexas “no index for this column chunk”Which issue does this PR close?
Closes #8818.
Rationale for this change
With the earlier PRs:
Vec<Vec<Option<...>>>NoneHowever, several call sites and higher-level APIs still effectively assumed that page indexes were present for all requested columns once page index reading was enabled. This could lead to:
unwrap()/as_ref().unwrap()panics when a specific column chunk had no indexThis PR focuses on making the runtime behavior robust for these scenarios and documenting it via tests.
What changes are included in this PR?
Safer handling of per-column
Optionpage indexesUpdate page-iterator, selection, and in-memory row-group code paths to:
NoneinParquetOffsetIndex[row_group][column]orParquetColumnIndex[row_group][column]as “no page index for this column chunk”Where
expect(...)is still used, it is reserved for true internal invariants such as:Row selection and read-plan behavior
RowSelectionand related selection helpers are updated to:&[Option<OffsetIndexMetaData>]where appropriateNoneRowSelectionPolicywhile avoiding panics for missing indexesIn-memory row group and async/arrow readers
InMemoryRowGroupnow:Option<&[Option<OffsetIndexMetaData>]>for offset indexesoffset_index[idx]Async and synchronous readers are updated to:
PageIndexPolicybehaviorWith per-column
Optionmodeling,parse_offset_indexnow:Vec<Vec<Option<OffsetIndexMetaData>>>so missing offset indexes are represented asNoneRequiredpoliciesThe practical effect is:
Nonein a defined, non-panicking wayTests for partial and missing page indexes
parquet/tests/arrow_reader/statistics.rs) that cover:Noneat the page index leafNonefor top-level page indexesAre these changes tested?
Yes.