Skip to content

Conversation

@martin-g
Copy link
Collaborator

@martin-g martin-g commented Oct 28, 2025

8496: To review by AI


Note

Adds Arrow-hinted handling for repeated Parquet fields as lists (incl. nested), updates field-id propagation to list containers, and introduces extensive list/map compatibility tests.

  • Schema conversion (parquet → arrow):
    • Treat repeated primitives/structs as Arrow lists using embedded Arrow schema hints via treat_repeated_as_list_arrow_hint.
    • New ParquetField::into_list_with_arrow_list_hint to build List/LargeList/FixedSizeList from hints.
    • Update visitors (visit_primitive, visit_struct, visit_list, visit_map) to unwrap list hints, validate types, and construct appropriate Arrow types.
    • convert_field now takes &ParquetField and supports add_field_id to control metadata propagation.
  • Field ID metadata:
    • Only attach Parquet field IDs to the outer list container for repeated fields, not the element.
  • Tests:
    • Add comprehensive cases covering backward-compatible LIST/MAP encodings, nested repeated structures, inferred Arrow schema overrides (e.g., Binary→Utf8), and field-id preservation.

Written by Cursor Bugbot for commit 4d7485a. This will update automatically on new commits. Configure here.

@coderabbitai
Copy link

coderabbitai bot commented Oct 28, 2025

Walkthrough

Enhanced Parquet to Arrow schema conversion with hint-driven logic for repeated types and lists. Introduced a context flag to track list handling behavior, added a new helper function for list conversion, revised the convert_field signature to preserve field IDs, and extended visitor traversal logic to utilize hints when converting repeated structures and nested types.

Changes

Cohort / File(s) Summary
Schema conversion implementation
parquet/src/arrow/schema/complex.rs
Introduced into_list_with_arrow_list_hint helper for converting Parquet fields to Arrow lists using data type hints. Added treat_repeated_as_list_arrow_hint context flag to control repeated type handling. Extended Visitor traversal logic to propagate hints through visit_primitive, visit_struct, visit_map, and visit_list. Revised convert_field signature to accept add_field_id parameter for field ID preservation. Updated convert_schema and convert_type entry points to initialize context with appropriate hint values.
Tests and validation
parquet/src/arrow/schema/complex.rs
Added comprehensive test module validating backward-compatibility scenarios for lists, maps, and nested structures, including round-trip conversions, expected Arrow schema shapes for various Parquet configurations, field ID preservation verification, and tests for inferred schema changes with alternate arrow hints.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-8496-2025-10-28-20-40-29

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
parquet/src/arrow/schema/complex.rs (4)

75-121: New into_list_with_arrow_list_hint: solid, matches semantics; a couple of nits

  • Logic correctly copies element metadata (when hinted), keeps element non-nullable, and adds field-id only to the outer list via add_field_id=false. LGTM.
  • Nits:
    • Consider using the crate Result alias for consistency: Result instead of Result<Self, ParquetError>.
    • The list-child extraction pattern appears multiple times across the file; a small helper (e.g., fn list_child(hint: &DataType) -> Option<&Field>) would DRY things up.

313-319: Propagate the repeated-as-list flag instead of forcing true for struct children

Hard-coding treat_repeated_as_list_arrow_hint: true for all struct children may change behavior when convert_type() (which starts with false) traverses nested structs. Prefer propagating the parent flag to avoid surprising conversions.

Apply this diff:

-            let child_ctx = VisitorContext {
+            let child_ctx = VisitorContext {
                 rep_level,
                 def_level,
                 data_type,
-                treat_repeated_as_list_arrow_hint: true,
+                treat_repeated_as_list_arrow_hint: context.treat_repeated_as_list_arrow_hint,
             };

If you intended to always enable list-hint unwrapping under structs, please add a brief comment explaining why, and consider a targeted test for convert_type() on nested repeated fields without hints.


656-694: convert_field: consider applying extension metadata even when a hint is present

In the Some(hint) branch you preserve dict metadata and copy hint metadata, but you don’t run try_add_extension_type. If the hint lacks extension metadata derivable from parquet_type (e.g., logical/extension types), you may miss it.

Suggestion: call try_add_extension_type after merging hint metadata, so parquet-derived extension metadata is still applied unless the hint already specifies it.

Apply this diff:

-        Some(hint) => {
+        Some(hint) => {
             // If the inferred type is a dictionary, preserve dictionary metadata
             #[allow(deprecated)]
-            let field = match (&data_type, hint.dict_id(), hint.dict_is_ordered()) {
+            let field = match (&data_type, hint.dict_id(), hint.dict_is_ordered()) {
                 (DataType::Dictionary(_, _), Some(id), Some(ordered)) =>
                 {
                     #[allow(deprecated)]
                     Field::new_dict(name, data_type, nullable, id, ordered)
                 }
                 _ => Field::new(name, data_type, nullable),
             };
-
-            Ok(field.with_metadata(hint.metadata().clone()))
+            // Merge hint metadata first, then attempt to add extension metadata from parquet_type
+            let merged = field.with_metadata(hint.metadata().clone());
+            try_add_extension_type(merged, parquet_type)
         }

If precedence should always favor the embedded Arrow schema, document that decision and add a test asserting no extension metadata is added when a hint is present.


739-1798: Tests: great coverage; consider a couple of extras

  • Coverage is strong for back-compat lists/maps, nested repeated, field-id placement, and list type (List/LargeList/FixedSizeList) inference.
  • Please add:
    • A negative test asserting the specific error when a repeated field receives a non-list hint.
    • A test for extension metadata behavior when a hint is present vs absent (to lock in the desired precedence).

If you want, I can sketch those tests quickly.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 477bfe4 and 4d7485a.

📒 Files selected for processing (1)
  • parquet/src/arrow/schema/complex.rs (16 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
parquet/src/arrow/schema/complex.rs (2)
arrow-schema/src/field.rs (4)
  • new (192-202)
  • metadata (373-375)
  • metadata (963-967)
  • with_metadata (366-369)
parquet/src/arrow/schema/primitive.rs (1)
  • convert_primitive (27-36)
🔇 Additional comments (14)
parquet/src/arrow/schema/complex.rs (14)

152-158: Context flag docs are clear

The flag and its docs make intent explicit. No code change needed.


201-219: Deriving primitive arrow type from list hints is correct

For REPEATED primitives with a list hint, unwrapping the inner field type before convert_primitive is the right call and surfaces helpful errors on mismatches.

Please confirm apply_hint in convert_primitive tolerates all element-level coercions you expect here (e.g., BYTE_ARRAY→Utf8, Decimal, etc.).


233-236: Choosing into_list_with_arrow_list_hint only when hints apply

Branching to the new list helper only when the flag is set avoids behavior changes for callers that don’t use hints.


253-271: Struct element hint unwrapping looks right

Unwrapping the list hint to a Struct inner type and validating arity catches common schema mismatches early.


321-325: Preserving field IDs at struct children

Passing add_field_id=true here ensures parquet field ids land on Arrow fields at the correct (outer) level. Matches tests.


342-345: Struct→list conversion with hints mirrors primitive path

Consistent with primitive handling; inner element gets no field-id, outer list does.


439-453: Map context propagation: key=false, value=true

Correct: keys can’t be repeated; values may contain repeated structures requiring list-hint handling.


460-467: Field-id placement for map key/value

  • Key: explicitly non-nullable and add_field_id=true.
  • Value: add_field_id=true.
    Both align with spec and your tests.

549-550: List primitive branch: disabling list-hint unwrapping is correct

Inside an explicit LIST, we shouldn’t unwrap a second list level unless the child itself is repeated—this respects the spec.


582-582: List one-tuple/compat branch: also correct to disable unwrapping

Same rationale as above.


602-603: Enable list-hint unwrapping for nested list items

Turning it back on for the item traversal allows nested lists to use LargeList/FixedSizeList hints.


606-608: add_field_id=true for list item field

Ensures field-id is attached to the array field (outer list) not the element; aligns with tests asserting only the list carries the id.


711-716: Top-level: enable list-hint unwrapping

Setting the flag to true at the root matches the PR goal of honoring embedded Arrow list forms.


728-734: convert_type: disable list-hint unwrapping

Correct: when no embedded Arrow schema is provided, don’t unwrap by default.

@martin-augment
Copy link
Owner

cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no bugs!


@claude
Copy link

claude bot commented Oct 28, 2025

Pull Request Review: Backward Compatible Parquet Schema Conversion

Summary

This PR adds support for converting backward-compatible repeated Parquet fields to Arrow lists using embedded Arrow schema hints. The implementation handles repeated primitives, structs, and nested structures while properly managing field ID metadata propagation.

Code Quality & Best Practices ✅

Strengths:

  • Well-structured code with clear separation of concerns
  • Good documentation with links to the Parquet format specification (lines 78-80)
  • Consistent error handling using Result<T, ParquetError> pattern
  • Proper use of Rust idioms (pattern matching, Arc for shared ownership)
  • Clean function signatures with descriptive parameter names

Suggestions:

  1. Line 90-92: The error message says "should be validated earlier" but validation happens at runtime. Consider adding a debug_assert or moving validation earlier in the call chain:

    Some(_) => return Err(general_err!(
        "Invalid list_data_type: expected List, LargeList, or FixedSizeList variant"
    )),
  2. Lines 201-216: The primitive_arrow_data_type extraction logic is duplicated in visit_struct (lines 253-270). Consider extracting to a helper function:

    fn extract_element_type_from_list(data_type: Option<&DataType>) -> Result<Option<DataType>> {
        match data_type {
            Some(DataType::List(f)) | Some(DataType::LargeList(f)) | Some(DataType::FixedSizeList(f, _)) => {
                Ok(Some(f.data_type().clone()))
            }
            Some(d) => Err(arrow_err!("Expected list type, got {}", d)),
            None => Ok(None),
        }
    }

Potential Bugs & Issues ⚠️

  1. Field ID Propagation (line 100, 658): The add_field_id parameter controls whether field IDs are attached. The logic looks correct but ensure this doesn't break existing code that expects field IDs on elements. Consider adding a test that specifically validates field ID metadata placement.

  2. Nested Repeated Structures: Lines 233-235 and 342-344 handle repeated primitives/structs by wrapping them in lists. Verify that deeply nested repeated structures (e.g., repeated group { repeated group { ... } }) are handled correctly. The test at line 738 covers nested lists but may need more edge cases.

  3. Error Messages: Lines 209-212 and 260-265 have slightly different error messages for the same validation. Ensure consistency:

    // Line 209
    "incompatible arrow schema, expected list got {} for repeated primitive field"
    // Line 260
    "incompatible arrow schema, expected list got {} for repeated struct field"

    These are fine but could share a helper for consistency.

Test Coverage 🧪

Excellent test coverage! The PR includes:

  • Backward-compatible list encodings (lines 771-868)
  • Backward-compatible map encodings (lines 870-962)
  • Nested repeated structures (lines 738-746)
  • Field ID preservation (lines 748-795)
  • Type inference from Arrow schema hints (lines 797-945)
  • All 5 backward compatibility rules from the Parquet spec

Additional test suggestions:

  1. Error path testing: Add tests that verify proper error handling when incompatible Arrow schema hints are provided (e.g., providing a non-list type for a repeated field).

  2. Edge case: Test a schema with mixed repeated and non-repeated fields at the same level:

    message schema {
        repeated int32 repeated_col;
        optional int32 optional_col;
        required int32 required_col;
    }
  3. Deeply nested: Test 3+ levels of nested repeated groups to ensure recursion is stable.

  4. Projection: Verify that the new treat_repeated_as_list_arrow_hint logic works correctly with ProjectionMask filtering.

Performance Considerations 🚀

Positive:

  • No unnecessary allocations in the hot path
  • Appropriate use of references (&self at line 657)
  • Cloning only when necessary

Concerns:

  1. Lines 96-102: convert_field is called during list conversion, which could involve recursive traversal. For deeply nested schemas, this might impact performance. Consider profiling with complex schemas.

  2. VisitorContext cloning: The context is passed by value and contains data_type: Option<DataType>. DataType contains Arc-wrapped fields so cloning is cheap, but verify with benchmarks if this shows up in profiles.

Security Concerns 🔒

No significant security issues identified. The code:

  • Properly validates input schema compatibility (lines 86-94, 204-216)
  • Uses safe Rust idioms with no unsafe blocks
  • Handles errors gracefully without panicking in library code
  • Good separation between schema validation and data processing

Minor note: The error at line 90 mentions "Internal error" which suggests this shouldn't happen. Consider using debug_assert! for debug builds or adding validation earlier.

Architecture & Design 🏗️

Well-designed changes:

  • The treat_repeated_as_list_arrow_hint flag cleanly controls the new behavior
  • Backward compatibility is maintained by only applying the new logic when the flag is set
  • The visitor pattern is used consistently throughout

Suggestions:

  1. Consider documenting when treat_repeated_as_list_arrow_hint should be true vs false. Currently it's set differently in various contexts (lines 318, 439, 451, 581, 602, 732).

  2. The function into_list_with_arrow_list_hint is quite similar to into_list. Consider refactoring to reduce duplication:

    fn into_list(self, name: &str) -> Self {
        self.into_list_with_arrow_list_hint(..., None).unwrap()
    }

Documentation 📝

Good:

  • Function-level documentation with links to spec
  • Clear comments explaining complex logic

Could improve:

  • Add module-level documentation explaining the backward compatibility handling
  • Document the treat_repeated_as_list_arrow_hint field in VisitorContext more thoroughly
  • Add examples in doc comments for into_list_with_arrow_list_hint

Recommendations

Before merging:

  1. Add error path tests for invalid schema hints
  2. Consider extracting the repeated element type extraction logic to reduce duplication
  3. Add a benchmark for schema conversion with deeply nested repeated fields

Nice to have:

  1. Module-level documentation explaining the backward compatibility strategy
  2. Refactor into_list and into_list_with_arrow_list_hint to reduce duplication

Overall Assessment ⭐

This is a high-quality PR that addresses a complex schema compatibility issue with careful attention to detail. The extensive test coverage (1183 additions!) demonstrates thorough validation of the implementation. The code is well-structured, follows Rust best practices, and maintains backward compatibility.

Recommendation: Approve with minor suggestions

The core logic is sound and the test coverage is comprehensive. The suggested improvements are mostly about code organization and additional edge case testing rather than critical issues.

Great work on tackling this complex schema conversion problem! 🎉

@github-actions
Copy link

Review

  • I didn’t spot any functional issues—the new treat_repeated_as_list_arrow_hint flow and the broader test suite cover the tricky repeated-column back-compat cases cleanly.
  • Optional follow-up: add a test that exercises the FixedSizeList arm of into_list_with_arrow_list_hint (parquet/src/arrow/schema/complex.rs:109) so that branch is covered alongside the List/LargeList cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants