fix: Reproduce nested partition columns pruning data validation failure #17759
+69
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Describe the issue this Pull Request addresses
There's a change in behavior for for SparkHoodieTableFileIndex since 0.14.1. The StructType(partitionFields) returned doesn't have the full path and causing the data validation failures. This behavior was changed as part of this PR https://github.com/apache/hudi/pull/9863/changes
Summary and Changelog
If there's a table with a nested partition column whose leaf name conflicts with another top level field the partitionedSchema passed to the new file group reader is incorrect. When I tried reverting the previous change found another issue where we are relying on
HoodieSchemaConversionUtils.convertStructTypeToHoodieSchemato get requestedSchema in buildReaderWithPartitionValues but this fails because HoodieSchema doesn't like dots in the names.Looking for guidance or feedback on how to read nested partition columns through parquet reader?
Impact
High
Risk Level
High
Documentation Update
None.
Contributor's checklist