Skip to content

19848: Support reverse page ordering in sort pushdown phase 1#191

Open
martin-augment wants to merge 6 commits intomainfrom
pr-19848-2026-01-16-14-19-29
Open

19848: Support reverse page ordering in sort pushdown phase 1#191
martin-augment wants to merge 6 commits intomainfrom
pr-19848-2026-01-16-14-19-29

Conversation

@martin-augment
Copy link
Owner

19848: To review by AI

- Add reverse_pages flag to ParquetSource and ParquetOpener
- Wire reverse_pages through try_reverse_output() alongside reverse_row_groups
- Extend try_reverse_output() to set both flags when optimizing descending sorts
- Add comprehensive test coverage for reverse_pages functionality
- Align with existing reverse_row_groups infrastructure pattern

Phase 1 establishes the foundation for page-level reverse ordering. Actual
page-level reversal implementation will be completed in Phase 2 when arrow-rs
provides page-level API support.

Tests:
- All 27 reverse-related tests pass
- Added 4 new tests for reverse_pages flag
- Verified backward compatibility

Closes apache#19486
This attribute suppresses the dead code warning for Phase 1 implementation.
The method will be used in Phase 2 when actual page reversal is implemented.
All test snapshots now reflect that reverse_pages flag is set
alongside reverse_row_groups when optimizing descending sorts.
All EXPLAIN plan outputs now reflect that reverse_pages flag is set
alongside reverse_row_groups when optimizing descending sorts.

Updated test files:
- sort_pushdown.slt (18 instances)
- topk.slt (1 instance)
- create_external_table.slt (1 instance)
@coderabbitai
Copy link

coderabbitai bot commented Jan 16, 2026

Walkthrough

The changes introduce a new reverse_pages boolean feature flag across the Parquet data source infrastructure to support page-level reverse scanning. This includes adding a reverse_pages field to ParquetOpener and ParquetSource, introducing builder-style setter methods, and threading the flag through the Parquet configuration path. The try_reverse_output method is updated to apply reverse_pages alongside reverse_row_groups for sort pushdown optimization. Test expectations in physical plan outputs are updated to reflect reverse_pages=true alongside existing reverse_row_groups=true assertions. A test case is added to verify that combining IF NOT EXISTS with REPLACE in CREATE EXTERNAL TABLE is not permitted.

✨ Finishing touches
  • 📝 Generate docstrings

🧹 Recent nitpick comments
datafusion/sqllogictest/test_files/create_external_table.slt (1)

301-306: New test case appears unrelated to the PR's main objective.

This test validates that IF NOT EXISTS cannot coexist with REPLACE in CREATE EXTERNAL TABLE statements. While the test itself is valid, it seems unrelated to the reverse_pages feature being introduced in this PR.

Consider moving this test to a separate commit or PR for cleaner change tracking, unless there's a specific reason it's bundled here.


📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 33c4ab7 and 38f95e8.

📒 Files selected for processing (6)
  • datafusion/core/tests/physical_optimizer/pushdown_sort.rs
  • datafusion/datasource-parquet/src/opener.rs
  • datafusion/datasource-parquet/src/source.rs
  • datafusion/sqllogictest/test_files/create_external_table.slt
  • datafusion/sqllogictest/test_files/sort_pushdown.slt
  • datafusion/sqllogictest/test_files/topk.slt
🧰 Additional context used
🧬 Code graph analysis (2)
datafusion/datasource-parquet/src/opener.rs (1)
datafusion/datasource-parquet/src/source.rs (1)
  • with_reverse_pages (487-490)
datafusion/datasource-parquet/src/source.rs (1)
datafusion/datasource-parquet/src/opener.rs (4)
  • with_reverse_pages (1154-1157)
  • new (775-786)
  • new (848-859)
  • new (1072-1096)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: Analyze (rust)
🔇 Additional comments (15)
datafusion/sqllogictest/test_files/topk.slt (1)

343-343: LGTM!

The test expectation correctly includes both reverse_row_groups=true and reverse_pages=true in the DataSourceExec output, consistent with the Phase 1 changes where try_reverse_output() sets both flags for descending sort optimizations.

datafusion/core/tests/physical_optimizer/pushdown_sort.rs (1)

21-25: LGTM!

Documentation correctly updated to reflect that Phase 1 enables both reverse_row_groups=true and reverse_pages=true flags. The test expectations throughout the file consistently include both flags in the DataSourceExec output, properly verifying the sort pushdown optimization behavior.

datafusion/sqllogictest/test_files/create_external_table.slt (1)

267-267: LGTM!

The test expectation correctly updated to include both reverse_row_groups=true and reverse_pages=true in the physical plan output for the descending sort query.

datafusion/sqllogictest/test_files/sort_pushdown.slt (2)

37-46: LGTM!

Test 1.1 correctly updated with comments and expectations reflecting both reverse_row_groups=true and reverse_pages=true flags for descending sort optimization.


103-158: LGTM!

Test 1.6 comments and expectations properly updated. The comment at line 104 accurately describes that when reverse scanning is enabled, both reverse_row_groups=true and reverse_pages=true are set, and the RowSelection is properly reversed.

datafusion/datasource-parquet/src/opener.rs (4)

123-127: LGTM!

The reverse_pages field is properly added as Phase 1 infrastructure with:

  • Clear documentation explaining its purpose
  • #[expect(dead_code)] attribute to suppress warnings until Phase 2 implementation
  • Consistent visibility (pub) with other fields in the struct

The field establishes the foundation for page-level reverse ordering without affecting current runtime behavior.


1066-1067: LGTM!

Test builder correctly initialized with reverse_pages: false default, consistent with the existing reverse_row_groups field initialization pattern.

Also applies to: 1093-1094


1152-1158: LGTM!

The with_reverse_pages builder method is properly implemented with #[expect(dead_code)] to suppress warnings until Phase 2. This follows the same pattern as the main struct field and provides test infrastructure for future use.


1209-1209: LGTM!

The reverse_pages field is correctly propagated from the builder to the ParquetOpener instance in the build() method, maintaining consistency with other field assignments.

datafusion/datasource-parquet/src/source.rs (6)

291-327: LGTM — reverse_pages field and default initialization look consistent.
Clear docs and a safe default (false) match existing reverse_row_groups semantics.


482-500: LGTM — builder + test accessor are straightforward.
The builder pattern mirrors existing flags, and the cfg(test) accessor keeps production API surface tight.


587-589: LGTM — reverse_pages is correctly propagated into ParquetOpener.
This keeps the config path consistent with reverse_row_groups.


647-649: LGTM — plan formatting now reports reverse_pages when enabled.
This improves plan visibility without changing default output.


829-834: LGTM — try_reverse_output correctly flips both reverse flags.
This is consistent with the phase‑1 “inexact” semantics and keeps the optimization behavior explicit.


947-1001: LGTM — reverse_pages tests cover default, setter, clone, and independence.
Good coverage for the new flag’s basic behavior.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Jan 16, 2026

Code Review: Support reverse page ordering in sort pushdown phase 1

Summary

This PR adds infrastructure for reverse page ordering in Parquet sort pushdown optimization (Phase 1). The implementation adds a reverse_pages flag that mirrors the existing reverse_row_groups flag and wires it through the ParquetSource and ParquetOpener.


✅ Strengths

1. Well-structured Phase 1 approach

The PR clearly separates infrastructure setup from actual implementation, which is a sensible engineering approach. The code includes clear documentation noting that actual page-level reversal will be implemented when arrow-rs provides the necessary API support.

2. Comprehensive test coverage

  • Added 4 new unit tests covering default value, setter functionality, clone preservation, and independence from reverse_row_groups
  • Updated 27 existing test snapshots to reflect the new flag
  • All tests follow existing patterns and are well-structured

3. Consistent with existing patterns

The implementation follows the same pattern as reverse_row_groups:

  • Similar builder methods
  • Consistent display formatting in query plans
  • Parallel structure in both ParquetSource and ParquetOpener

4. Good documentation

Comments clearly explain the phase 1 nature and future plans:

/// Note: This is infrastructure for phase 1; actual page reversal may be implemented
/// in future phases depending on arrow-rs capabilities.

🔍 Issues & Concerns

1. Potential logic issue: Setting flags independently ⚠️ Medium Priority

Location: datafusion/datasource-parquet/src/source.rs:831-834

The test test_reverse_pages_independent_of_reverse_row_groups verifies that reverse_pages can be set without reverse_row_groups. However, in try_reverse_output(), both flags are always set together:

let new_source = self
    .clone()
    .with_reverse_row_groups(true)
    .with_reverse_pages(true);

Questions:

  • Is it semantically valid to reverse pages without reversing row groups?
  • Would this produce incorrect results when eventually implemented?
  • If they should always be set together, should the test be removed or should there be validation to prevent invalid states?

Recommendation: Either:

  1. Add a comment explaining why the test allows this combination even though it won't be used in practice, OR
  2. Add validation to prevent invalid flag combinations, OR
  3. Remove the test if the combination is not meaningful

2. Dead code attributes may hide issues ⚠️ Low Priority

Location:

  • datafusion/datasource-parquet/src/opener.rs:125
  • datafusion/datasource-parquet/src/opener.rs:1153

The #[expect(dead_code)] attributes suppress warnings for unused code. While this makes sense for Phase 1, it could hide issues if:

  • The field is accidentally not propagated through some code path
  • Future refactoring breaks the plumbing

Recommendation: Consider adding TODO comments referencing the Phase 2 work or the related issue number (e.g., // TODO: Phase 2 - Use in actual page reversal implementation #19486)

3. Missing field documentation in ParquetOpener ℹ️ Documentation

Location: datafusion/datasource-parquet/src/opener.rs:123-126

The reverse_pages field in ParquetOpener has a comment, but it's less detailed than in ParquetSource. For consistency and maintainability, the documentation should be equally thorough in both places.

Current:

/// Whether to read pages in reverse order within row groups
/// Phase 1: Infrastructure flag for future page-level reverse support

Suggested:

/// If true, read pages within row groups in reverse order.
/// Used in sort pushdown phase 1 to optimize ORDER BY ... DESC on sorted data.
/// Note: This is infrastructure for phase 1; actual page reversal implementation
/// will be completed in Phase 2 when arrow-rs provides page-level API support.

4. Display format consistency ℹ️ Minor

The display formatting is good, but I notice both flags are shown independently. Consider if there's value in showing them as a combined concept (e.g., reverse_scan=true (row_groups+pages)) for better UX in query plans. This is purely a suggestion and the current approach is also valid.


🔒 Security Considerations

✅ No security concerns identified. The changes are:

  • Internal optimization flags with no user input
  • No new attack surfaces
  • No credential handling or sensitive data exposure

⚡ Performance Considerations

No performance impact in Phase 1 - The flag is set but not used in actual execution, so there's no runtime overhead.

📋 Future consideration: When Phase 2 is implemented, ensure that:

  • Page-level reversal doesn't negate row group statistics benefits
  • Memory usage remains bounded when reversing pages
  • The interaction between page filtering and reversal is efficient

🧪 Test Coverage

Excellent test coverage:

  • 4 new unit tests for the reverse_pages flag
  • 27 updated snapshot tests showing the flag in query plans
  • Tests cover all basic operations: default value, setter, clone, independence

⚠️ Potential gap: Consider adding a test that verifies both flags are set together in try_reverse_output(). Example:

#[test]
fn test_try_reverse_output_sets_both_flags() {
    use arrow::datatypes::{Schema, Field, DataType};
    
    let schema = Arc::new(Schema::new(vec![
        Field::new("a", DataType::Int32, false),
    ]));
    let source = ParquetSource::new(schema);
    
    // Simulate try_reverse_output() behavior
    let reversed = source.clone()
        .with_reverse_row_groups(true)
        .with_reverse_pages(true);
    
    assert!(reversed.reverse_row_groups());
    assert!(reversed.reverse_pages());
}

📝 Code Quality

Style & Conventions

✅ Follows Rust and DataFusion conventions
✅ Consistent naming and structure
✅ Proper use of #[cfg(test)] for test-only methods
✅ Good use of builder pattern

Potential Improvements

  1. Consider adding a helper method to set both flags together since they're always used together:

    pub(crate) fn with_reverse_scan(mut self, enable: bool) -> Self {
        self.reverse_row_groups = enable;
        self.reverse_pages = enable;
        self
    }

    This would make the intent clearer and reduce the risk of forgetting to set both flags.

  2. Comment in TODO (line 843) could be more specific about Phase 2 scope:

    // TODO Phase 2: Add support for:
    // - Actual page-level reversal implementation (requires arrow-rs API support)
    // - File reordering based on min/max statistics
    // - Detection of exact ordering (return Exact to remove Sort operator)
    // - Partial sort pushdown for prefix matches

🎯 Summary & Recommendation

Verdict: ✅ Approve with minor suggestions

This is a well-implemented Phase 1 foundation for page-level reverse ordering. The code is clean, well-tested, and follows existing patterns. The main concern is clarifying the semantic validity of setting reverse_pages independently of reverse_row_groups.

Action Items (Optional improvements):

  1. Address the flag independence question (see Issue 18481: feat: Add selectivity metric to NestedLoopJoinExec for EXPLAIN ANALYZE #1 above)
  2. Add TODO comments with issue references for Phase 2 work
  3. Consider adding a helper method to set both flags together
  4. Enhance documentation consistency between ParquetOpener and ParquetSource

Merge Readiness

The PR is technically sound and ready to merge once the flag independence question is clarified. The current implementation is safe since the flags aren't used in actual execution yet.


Great work on the implementation! The phased approach and thorough testing demonstrate good engineering practices. 🎉

@augmentcode
Copy link

augmentcode bot commented Jan 16, 2026

🤖 Augment PR Summary

Summary: Extends sort pushdown (phase 1) for Parquet scans by introducing a new reverse_pages flag alongside existing reverse_row_groups support.

Changes:

  • Adds reverse_pages plumbing to ParquetSource and propagates it into ParquetOpener
  • Updates DataSourceExec formatting so EXPLAIN plans show reverse_pages=true when enabled
  • Updates Parquet sort-order pushdown to set both reverse_row_groups and reverse_pages for reverse-order requests (still returns Inexact)
  • Adds unit tests validating default/clone/setter behavior for reverse_pages
  • Refreshes physical optimizer and sqllogictest golden outputs to include reverse_pages=true

Technical Notes: This PR appears to be “infrastructure only” for page-level reversal; query plans now advertise the flag even though actual page reversal can be implemented in a later phase (e.g., depending on arrow-rs capabilities).

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

let new_source = self
.clone()
.with_reverse_row_groups(true)
.with_reverse_pages(true);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverse_pages is now enabled here and surfaced in EXPLAIN, but it doesn’t appear to be consumed in ParquetOpener yet (only reverse_row_groups affects the prepared plan), so it currently looks like a plan-only flag. Consider clarifying in the docs/comments that this doesn’t change scan order yet (also applies to other locations in the PR).

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments