Page index column semantics #8893
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Page index column semantics: use
Noneinstead ofColumnIndexMetaData::NONEGoal
Normalize column page index semantics so that missing column indexes are represented as
Noneat the leaf ofParquetColumnIndex, treatingColumnIndexMetaData::NONEas a legacy/special marker instead of the general “no index” representation.Which issue does this PR close?
Part of #8818.
Rationale for this change
After introducing:
there were still code paths that used
Some(ColumnIndexMetaData::NONE)to represent “no column index for this column chunk”. That left us with two parallel representations for “missing”:NoneSome(ColumnIndexMetaData::NONE)This is confusing for callers and undermines the advantages of modeling optionality with
Option. The more idiomatic and type-safe approach is:Noneto express “no index for this (row_group, column)”Some(ColumnIndexMetaData::...)only when an actual index is presentColumnIndexMetaData::NONEas a legacy/special-case sentinel, not the general “missing index” markerThis PR updates parsing, writing, and consumers to align with that model.
What changes are included in this PR?
Parser semantics for column indexes
parse_column_indexnow sets:ParquetColumnIndex[row_group][column] = Nonewhen there is no column index range for that column chunkSome(ColumnIndexMetaData::...)when column index data is presentSome(ColumnIndexMetaData::NONE)in these cases.Writer / metadata plumbing
Some(ColumnIndexMetaData::...)when a column index actually exists.Noneinstead ofSome(NONE).ColumnIndexMetaData::NONEdocumentation and usageColumnIndexMetaData::NONEas a legacy marker and clarify it must NOT be used to represent missing column indexes inParquetColumnIndex;Noneon the outerOptionshould be used instead.arrow_reader/statistics.rs) to:Noneas “no column index for this column chunk”Some(ci)as “column index present”ColumnIndexMetaData::NONEwhere it has specific legacy meaning for page-level stats, not as the general “no index” signal.Consumers and statistics logic
Iterator<Item = (usize, &'a Option<ColumnIndexMetaData>)>instead of the non-optional type.None→ fill withlenNoneentries (no stats for that chunk)Some(ColumnIndexMetaData::...)→ derive min/max/null_count as beforeNoneis encounteredColumnIndexMetaData::NONEas the primary “missing” signalTests
Some(ColumnIndexMetaData::NONE)to now expectNonefor missing column indexes.ColumnIndexMetaData::NONEhas specific meaning, but not as the default representation of “no index”.Are these changes tested?
Yes.
Example commands:
cargo test --package parquet --lib
cargo test --package parquet --test arrow_reader -- test_page_index* # or equivalent page-index tests
Are there any user-facing changes?
Yes, in terms of how missing column indexes are surfaced:
ParquetColumnIndexshould now treat:Noneat the leaf as “no column index for this column chunk”Some(ColumnIndexMetaData::...)as “column index present”Some(ColumnIndexMetaData::NONE)as a missing-index marker will need to be updated to check forNoneinstead.The on-disk Parquet encoding remains unchanged.
Most observable differences should be:
None/Some(NONE)representations