Skip to content

Conversation

@joe-ucp
Copy link

@joe-ucp joe-ucp commented Nov 20, 2025

Page index column semantics: use None instead of ColumnIndexMetaData::NONE


Goal

Normalize column page index semantics so that missing column indexes are represented as None at the leaf of ParquetColumnIndex, treating ColumnIndexMetaData::NONE as a legacy/special marker instead of the general “no index” representation.


Which issue does this PR close?

Part of #8818.


Rationale for this change

After introducing:

pub type ParquetColumnIndex = Vec<Vec<Option<ColumnIndexMetaData>>>;

there were still code paths that used Some(ColumnIndexMetaData::NONE) to represent “no column index for this column chunk”. That left us with two parallel representations for “missing”:

  • None
  • Some(ColumnIndexMetaData::NONE)

This is confusing for callers and undermines the advantages of modeling optionality with Option. The more idiomatic and type-safe approach is:

  • Use None to express “no index for this (row_group, column)”
  • Use Some(ColumnIndexMetaData::...) only when an actual index is present
  • Treat ColumnIndexMetaData::NONE as a legacy/special-case sentinel, not the general “missing index” marker

This PR updates parsing, writing, and consumers to align with that model.


What changes are included in this PR?

  • Parser semantics for column indexes

    • parse_column_index now sets:
      • ParquetColumnIndex[row_group][column] = None when there is no column index range for that column chunk
      • Some(ColumnIndexMetaData::...) when column index data is present
    • This replaces the previous behavior that could return Some(ColumnIndexMetaData::NONE) in these cases.
  • Writer / metadata plumbing

    • Ensure writer paths and internal builders only produce Some(ColumnIndexMetaData::...) when a column index actually exists.
    • When no column index is produced, the leaf is set to None instead of Some(NONE).
  • ColumnIndexMetaData::NONE documentation and usage

    • Document ColumnIndexMetaData::NONE as a legacy marker and clarify it must NOT be used to represent missing column indexes in ParquetColumnIndex; None on the outer Option should be used instead.
    • Update consumers (e.g., in arrow_reader/statistics.rs) to:
      • Treat None as “no column index for this column chunk”
      • Treat Some(ci) as “column index present”
      • Only special-case ColumnIndexMetaData::NONE where it has specific legacy meaning for page-level stats, not as the general “no index” signal.
  • Consumers and statistics logic

    • Update iterator macros and helpers to work with Iterator<Item = (usize, &'a Option<ColumnIndexMetaData>)> instead of the non-optional type.
    • For statistics:
      • None → fill with len None entries (no stats for that chunk)
      • Some(ColumnIndexMetaData::...) → derive min/max/null_count as before
    • This ensures:
      • Column-index-based pruning is skipped when None is encountered
      • The code no longer relies on ColumnIndexMetaData::NONE as the primary “missing” signal
  • Tests

    • Update tests that previously expected Some(ColumnIndexMetaData::NONE) to now expect None for missing column indexes.
    • Preserve existing expectations where ColumnIndexMetaData::NONE has specific meaning, but not as the default representation of “no index”.

Are these changes tested?

Yes.

  • Existing parquet tests have been updated where necessary and are passing.
  • Page-index-related tests (including statistics and reader tests) are updated to assert the new semantics and are passing.

Example commands:

cargo test --package parquet --lib
cargo test --package parquet --test arrow_reader -- test_page_index* # or equivalent page-index tests


Are there any user-facing changes?

Yes, in terms of how missing column indexes are surfaced:

  • Callers that inspect ParquetColumnIndex should now treat:
    • None at the leaf as “no column index for this column chunk”
    • Some(ColumnIndexMetaData::...) as “column index present”
  • Any downstream code relying on Some(ColumnIndexMetaData::NONE) as a missing-index marker will need to be updated to check for None instead.

The on-disk Parquet encoding remains unchanged.

Most observable differences should be:

  • More consistent representation of missing column indexes
  • Fewer surprises around mixed None / Some(NONE) representations
  • No new panics introduced by this normalization

Goal: Introduce Vec<Vec<Option<...>>> for column & offset indexes and wire it up just enough so everything compiles and tests pass, without changing semantics more than necessary.
Normalize column index semantics: make “no column index for this column chunk” be expressed as None (not Some(NONE)), update consumers, and adjust tests accordingly.
@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 20, 2025
@etseidl etseidl added api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Nov 20, 2025
@etseidl
Copy link
Contributor

etseidl commented Nov 20, 2025

depends on #8894

@etseidl etseidl marked this pull request as draft November 24, 2025 01:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants