Page index partial missing behavior #8892

joe-ucp · 2025-11-20T19:01:39Z

Page index behavior: handle partial / missing indexes without panics

Goal

Ensure Parquet page indexes behave correctly when some or all columns lack indexes by:

Handling None at the leaf of ParquetColumnIndex and ParquetOffsetIndex as “no index for this column chunk”
Avoiding panics in normal usage when indexes are legitimately missing
Adding tests that lock in the expected behavior for partial and fully missing page indexes

Which issue does this PR close?

Closes #8818.

Rationale for this change

With the earlier PRs:

Page index types now model per-column optionality as Vec<Vec<Option<...>>>
Column index semantics are normalized so missing indexes use None

However, several call sites and higher-level APIs still effectively assumed that page indexes were present for all requested columns once page index reading was enabled. This could lead to:

unwrap() / as_ref().unwrap() panics when a specific column chunk had no index
Confusing behavior when some columns had page indexes and others did not
No explicit tests covering these partial- and no-index cases

This PR focuses on making the runtime behavior robust for these scenarios and documenting it via tests.

What changes are included in this PR?

Safer handling of per-column `Option` page indexes

Update page-iterator, selection, and in-memory row-group code paths to:
- Treat None in ParquetOffsetIndex[row_group][column] or ParquetColumnIndex[row_group][column] as “no page index for this column chunk”
- Fall back to non-indexed behavior (e.g., scanning full columns or skipping page-level pruning) instead of panicking
Where expect(...) is still used, it is reserved for true internal invariants such as:
- “page index should be present when page index reading is explicitly enabled and this code path is only constructed under that assumption”
- These invariants are now spelled out in error messages to make assumptions explicit.

Row selection and read-plan behavior

RowSelection and related selection helpers are updated to:
- Take &[Option<OffsetIndexMetaData>] where appropriate
- Skip page-level pruning for columns whose offset index is None
- Respect the current RowSelectionPolicy while avoiding panics for missing indexes

In-memory row group and async/arrow readers

InMemoryRowGroup now:
- Uses Option<&[Option<OffsetIndexMetaData>]> for offset indexes
- Uses page indexes where present to compute sparse read ranges
- Falls back to reading the entire column when no offset index is present, instead of unconditionally indexing into offset_index[idx]
Async and synchronous readers are updated to:
- Assume offset/column indexes may be missing on a per-column basis
- Still assert the invariants they depend on when page index usage is explicitly requested (e.g., in tests that are specifically about index behavior)

`PageIndexPolicy` behavior

With per-column Option modeling, parse_offset_index now:
- Builds a Vec<Vec<Option<OffsetIndexMetaData>>> so missing offset indexes are represented as None
- Avoids invalidating all page indexes when some columns are missing indexes under non-Required policies
The practical effect is:
- For optional page index use, partial presence no longer forces “invalidate everything”
- Call sites must and do handle per-column None in a defined, non-panicking way

Tests for partial and missing page indexes

Add focused tests (e.g., in parquet/tests/arrow_reader/statistics.rs) that cover:
1. All columns have page indexes (baseline behavior)
2. Some columns have column indexes while others do not:
  - Verify that metadata reading and page-index-based logic do not panic
  - Verify that the “missing” columns are represented as None at the page index leaf
3. No page indexes at all (page indexes disabled):
  - Verify that metadata reports None for top-level page indexes
  - Verify that reading still works without panics and falls back to non-indexed behavior
These tests document expected behavior and guard against future regressions.

Are these changes tested?

Yes.

All existing parquet tests continue to pass.
New and updated tests are included to explicitly cover:
- Partial column index presence

Goal: Introduce Vec<Vec<Option<...>>> for column & offset indexes and wire it up just enough so everything compiles and tests pass, without changing semantics more than necessary.

Normalize column index semantics: make “no column index for this column chunk” be expressed as None (not Some(NONE)), update consumers, and adjust tests accordingly.

etseidl · 2025-11-20T19:12:43Z

depends on #8894 and #8893

alamb · 2025-11-23T14:49:58Z

depends on #8894 and #8893

marking as draft to make sure the review queue makes sense

joe-ucp added 3 commits November 20, 2025 13:27

page-index-types

5dbdb49

Goal: Introduce Vec<Vec<Option<...>>> for column & offset indexes and wire it up just enough so everything compiles and tests pass, without changing semantics more than necessary.

page-index-column-semantics

79f0495

Normalize column index semantics: make “no column index for this column chunk” be expressed as None (not Some(NONE)), update consumers, and adjust tests accordingly.

page-index-partial-missing

95c109f

github-actions bot added the parquet Changes to the parquet crate label Nov 20, 2025

etseidl added api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Nov 20, 2025

alamb marked this pull request as draft November 23, 2025 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Page index partial missing behavior #8892

Page index partial missing behavior #8892

Uh oh!

joe-ucp commented Nov 20, 2025

Uh oh!

etseidl commented Nov 20, 2025

Uh oh!

alamb commented Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Page index partial missing behavior #8892

Are you sure you want to change the base?

Page index partial missing behavior #8892

Uh oh!

Conversation

joe-ucp commented Nov 20, 2025

Page index behavior: handle partial / missing indexes without panics

Goal

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Safer handling of per-column Option page indexes

Row selection and read-plan behavior

In-memory row group and async/arrow readers

PageIndexPolicy behavior

Tests for partial and missing page indexes

Are these changes tested?

Uh oh!

etseidl commented Nov 20, 2025

Uh oh!

alamb commented Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Safer handling of per-column `Option` page indexes

`PageIndexPolicy` behavior