-
Notifications
You must be signed in to change notification settings - Fork 0
fix(parquet): converting parquet schema with backward compatible repeated struct/primitive with provided arrow schema #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ated struct/primitive with provided arrow schema closes: - apache#8495
…primitive-with-inferred-schema # Conflicts: # parquet/src/arrow/schema/complex.rs
…primitive-with-inferred-schema
WalkthroughEnhanced Parquet to Arrow schema conversion with hint-driven logic for repeated types and lists. Introduced a context flag to track list handling behavior, added a new helper function for list conversion, revised the Changes
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review completed. No suggestions at this time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
parquet/src/arrow/schema/complex.rs (4)
75-121: New into_list_with_arrow_list_hint: solid, matches semantics; a couple of nits
- Logic correctly copies element metadata (when hinted), keeps element non-nullable, and adds field-id only to the outer list via add_field_id=false. LGTM.
- Nits:
- Consider using the crate Result alias for consistency: Result instead of Result<Self, ParquetError>.
- The list-child extraction pattern appears multiple times across the file; a small helper (e.g., fn list_child(hint: &DataType) -> Option<&Field>) would DRY things up.
313-319: Propagate the repeated-as-list flag instead of forcing true for struct childrenHard-coding treat_repeated_as_list_arrow_hint: true for all struct children may change behavior when convert_type() (which starts with false) traverses nested structs. Prefer propagating the parent flag to avoid surprising conversions.
Apply this diff:
- let child_ctx = VisitorContext { + let child_ctx = VisitorContext { rep_level, def_level, data_type, - treat_repeated_as_list_arrow_hint: true, + treat_repeated_as_list_arrow_hint: context.treat_repeated_as_list_arrow_hint, };If you intended to always enable list-hint unwrapping under structs, please add a brief comment explaining why, and consider a targeted test for convert_type() on nested repeated fields without hints.
656-694: convert_field: consider applying extension metadata even when a hint is presentIn the Some(hint) branch you preserve dict metadata and copy hint metadata, but you don’t run try_add_extension_type. If the hint lacks extension metadata derivable from parquet_type (e.g., logical/extension types), you may miss it.
Suggestion: call try_add_extension_type after merging hint metadata, so parquet-derived extension metadata is still applied unless the hint already specifies it.
Apply this diff:
- Some(hint) => { + Some(hint) => { // If the inferred type is a dictionary, preserve dictionary metadata #[allow(deprecated)] - let field = match (&data_type, hint.dict_id(), hint.dict_is_ordered()) { + let field = match (&data_type, hint.dict_id(), hint.dict_is_ordered()) { (DataType::Dictionary(_, _), Some(id), Some(ordered)) => { #[allow(deprecated)] Field::new_dict(name, data_type, nullable, id, ordered) } _ => Field::new(name, data_type, nullable), }; - - Ok(field.with_metadata(hint.metadata().clone())) + // Merge hint metadata first, then attempt to add extension metadata from parquet_type + let merged = field.with_metadata(hint.metadata().clone()); + try_add_extension_type(merged, parquet_type) }If precedence should always favor the embedded Arrow schema, document that decision and add a test asserting no extension metadata is added when a hint is present.
739-1798: Tests: great coverage; consider a couple of extras
- Coverage is strong for back-compat lists/maps, nested repeated, field-id placement, and list type (List/LargeList/FixedSizeList) inference.
- Please add:
- A negative test asserting the specific error when a repeated field receives a non-list hint.
- A test for extension metadata behavior when a hint is present vs absent (to lock in the desired precedence).
If you want, I can sketch those tests quickly.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
parquet/src/arrow/schema/complex.rs(16 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
parquet/src/arrow/schema/complex.rs (2)
arrow-schema/src/field.rs (4)
new(192-202)metadata(373-375)metadata(963-967)with_metadata(366-369)parquet/src/arrow/schema/primitive.rs (1)
convert_primitive(27-36)
🔇 Additional comments (14)
parquet/src/arrow/schema/complex.rs (14)
152-158: Context flag docs are clearThe flag and its docs make intent explicit. No code change needed.
201-219: Deriving primitive arrow type from list hints is correctFor REPEATED primitives with a list hint, unwrapping the inner field type before convert_primitive is the right call and surfaces helpful errors on mismatches.
Please confirm apply_hint in convert_primitive tolerates all element-level coercions you expect here (e.g., BYTE_ARRAY→Utf8, Decimal, etc.).
233-236: Choosing into_list_with_arrow_list_hint only when hints applyBranching to the new list helper only when the flag is set avoids behavior changes for callers that don’t use hints.
253-271: Struct element hint unwrapping looks rightUnwrapping the list hint to a Struct inner type and validating arity catches common schema mismatches early.
321-325: Preserving field IDs at struct childrenPassing add_field_id=true here ensures parquet field ids land on Arrow fields at the correct (outer) level. Matches tests.
342-345: Struct→list conversion with hints mirrors primitive pathConsistent with primitive handling; inner element gets no field-id, outer list does.
439-453: Map context propagation: key=false, value=trueCorrect: keys can’t be repeated; values may contain repeated structures requiring list-hint handling.
460-467: Field-id placement for map key/value
- Key: explicitly non-nullable and add_field_id=true.
- Value: add_field_id=true.
Both align with spec and your tests.
549-550: List primitive branch: disabling list-hint unwrapping is correctInside an explicit LIST, we shouldn’t unwrap a second list level unless the child itself is repeated—this respects the spec.
582-582: List one-tuple/compat branch: also correct to disable unwrappingSame rationale as above.
602-603: Enable list-hint unwrapping for nested list itemsTurning it back on for the item traversal allows nested lists to use LargeList/FixedSizeList hints.
606-608: add_field_id=true for list item fieldEnsures field-id is attached to the array field (outer list) not the element; aligns with tests asserting only the list carries the id.
711-716: Top-level: enable list-hint unwrappingSetting the flag to true at the root matches the PR goal of honoring embedded Arrow list forms.
728-734: convert_type: disable list-hint unwrappingCorrect: when no embedded Arrow schema is provided, don’t unwrap by default.
|
cursor review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Bugbot reviewed your changes and found no bugs!
Pull Request Review: Backward Compatible Parquet Schema ConversionSummaryThis PR adds support for converting backward-compatible repeated Parquet fields to Arrow lists using embedded Arrow schema hints. The implementation handles repeated primitives, structs, and nested structures while properly managing field ID metadata propagation. Code Quality & Best Practices ✅Strengths:
Suggestions:
Potential Bugs & Issues
|
|
Review
|
8496: To review by AI
Note
Adds Arrow-hinted handling for repeated Parquet fields as lists (incl. nested), updates field-id propagation to list containers, and introduces extensive list/map compatibility tests.
treat_repeated_as_list_arrow_hint.ParquetField::into_list_with_arrow_list_hintto buildList/LargeList/FixedSizeListfrom hints.visit_primitive,visit_struct,visit_list,visit_map) to unwrap list hints, validate types, and construct appropriate Arrow types.convert_fieldnow takes&ParquetFieldand supportsadd_field_idto control metadata propagation.Written by Cursor Bugbot for commit 4d7485a. This will update automatically on new commits. Configure here.