Skip to content

Conversation

@joe-ucp
Copy link

@joe-ucp joe-ucp commented Nov 20, 2025

Page index behavior: handle partial / missing indexes without panics


Goal

Ensure Parquet page indexes behave correctly when some or all columns lack indexes by:

  • Handling None at the leaf of ParquetColumnIndex and ParquetOffsetIndex as “no index for this column chunk”
  • Avoiding panics in normal usage when indexes are legitimately missing
  • Adding tests that lock in the expected behavior for partial and fully missing page indexes

Which issue does this PR close?

Closes #8818.


Rationale for this change

With the earlier PRs:

  • Page index types now model per-column optionality as Vec<Vec<Option<...>>>
  • Column index semantics are normalized so missing indexes use None

However, several call sites and higher-level APIs still effectively assumed that page indexes were present for all requested columns once page index reading was enabled. This could lead to:

  • unwrap() / as_ref().unwrap() panics when a specific column chunk had no index
  • Confusing behavior when some columns had page indexes and others did not
  • No explicit tests covering these partial- and no-index cases

This PR focuses on making the runtime behavior robust for these scenarios and documenting it via tests.


What changes are included in this PR?

Safer handling of per-column Option page indexes

  • Update page-iterator, selection, and in-memory row-group code paths to:

    • Treat None in ParquetOffsetIndex[row_group][column] or ParquetColumnIndex[row_group][column] as “no page index for this column chunk”
    • Fall back to non-indexed behavior (e.g., scanning full columns or skipping page-level pruning) instead of panicking
  • Where expect(...) is still used, it is reserved for true internal invariants such as:

    • “page index should be present when page index reading is explicitly enabled and this code path is only constructed under that assumption”
    • These invariants are now spelled out in error messages to make assumptions explicit.

Row selection and read-plan behavior

  • RowSelection and related selection helpers are updated to:
    • Take &[Option<OffsetIndexMetaData>] where appropriate
    • Skip page-level pruning for columns whose offset index is None
    • Respect the current RowSelectionPolicy while avoiding panics for missing indexes

In-memory row group and async/arrow readers

  • InMemoryRowGroup now:

    • Uses Option<&[Option<OffsetIndexMetaData>]> for offset indexes
    • Uses page indexes where present to compute sparse read ranges
    • Falls back to reading the entire column when no offset index is present, instead of unconditionally indexing into offset_index[idx]
  • Async and synchronous readers are updated to:

    • Assume offset/column indexes may be missing on a per-column basis
    • Still assert the invariants they depend on when page index usage is explicitly requested (e.g., in tests that are specifically about index behavior)

PageIndexPolicy behavior

  • With per-column Option modeling, parse_offset_index now:

    • Builds a Vec<Vec<Option<OffsetIndexMetaData>>> so missing offset indexes are represented as None
    • Avoids invalidating all page indexes when some columns are missing indexes under non-Required policies
  • The practical effect is:

    • For optional page index use, partial presence no longer forces “invalidate everything”
    • Call sites must and do handle per-column None in a defined, non-panicking way

Tests for partial and missing page indexes

  • Add focused tests (e.g., in parquet/tests/arrow_reader/statistics.rs) that cover:
    1. All columns have page indexes (baseline behavior)
    2. Some columns have column indexes while others do not:
      • Verify that metadata reading and page-index-based logic do not panic
      • Verify that the “missing” columns are represented as None at the page index leaf
    3. No page indexes at all (page indexes disabled):
      • Verify that metadata reports None for top-level page indexes
      • Verify that reading still works without panics and falls back to non-indexed behavior
  • These tests document expected behavior and guard against future regressions.

Are these changes tested?

Yes.

  • All existing parquet tests continue to pass.
  • New and updated tests are included to explicitly cover:
    • Partial column index presence

Goal: Introduce Vec<Vec<Option<...>>> for column & offset indexes and wire it up just enough so everything compiles and tests pass, without changing semantics more than necessary.
Normalize column index semantics: make “no column index for this column chunk” be expressed as None (not Some(NONE)), update consumers, and adjust tests accordingly.
@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 20, 2025
@etseidl etseidl added api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Nov 20, 2025
@etseidl
Copy link
Contributor

etseidl commented Nov 20, 2025

depends on #8894 and #8893

@alamb
Copy link
Contributor

alamb commented Nov 23, 2025

depends on #8894 and #8893

marking as draft to make sure the review queue makes sense

@alamb alamb marked this pull request as draft November 23, 2025 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Explore changing the form of the Parquet page indexes in ParquetMetaData

3 participants