-
Notifications
You must be signed in to change notification settings - Fork 2.5k
feat(schema): Migrate HoodieFileGroupReader and related classes to use HoodieSchema #17536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(schema): Migrate HoodieFileGroupReader and related classes to use HoodieSchema #17536
Conversation
9a2446d to
467eac4
Compare
voonhous
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some minor NIT comments.
...t/hudi-flink-client/src/main/java/org/apache/hudi/util/OrderingValueEngineTypeConverter.java
Show resolved
Hide resolved
...ient/hudi-java-client/src/test/java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java
Show resolved
Hide resolved
|
@hudi-bot run azure |
7ba5787 to
31461f2
Compare
|
LGTM. @balaji-varadarajan-ai @yihua do you wanna take another pass? |
| HoodieSchema dataSchema = HoodieSchemaCache.intern(HoodieSchemaUtils.addMetadataFields(HoodieSchema.parse(dataWriteConfig.getWriteSchema()), dataWriteConfig.allowOperationMetadataField())); | ||
| HoodieSchema requestedSchema = metaClient.getTableConfig().populateMetaFields() ? getRecordKeySchema() | ||
| : HoodieSchemaUtils.projectSchema(dataSchema, Arrays.asList(metaClient.getTableConfig().getRecordKeyFields().orElse(new String[0]))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To follow up in a separate PR: We should start adding more methods in HoodieSchema so we can avoid nested method calls (i.e., AUtil.method1(BUtil.method2(C.method3(x), y), z) which can be improved by x.method(y, z)) that reduce readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was to have this be an easy replacement by copying over the methods to similarly named classes. I agree we can avoid these utils in the future though.
...lient-common/src/main/java/org/apache/hudi/metadata/SecondaryIndexRecordGenerationUtils.java
Show resolved
Hide resolved
...n/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
Show resolved
Hide resolved
| import org.apache.hudi.common.table.read.BufferedRecord; | ||
| import org.apache.hudi.common.util.Option; | ||
|
|
||
| import org.apache.avro.Schema; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious to see how many classes still import org.apache.avro.Schema after quite a few refactoring PRs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably have at least 10 more PRs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite a lot of refactoring, reminding me of the changes introducing the HoodieStorage abstraction :)
hudi-common/src/main/java/org/apache/hudi/common/table/PartitionPathParser.java
Show resolved
Hide resolved
| // Iterate over the paths | ||
| logFormatReaderWrapper = new HoodieLogFormatReader(storage, logFiles, | ||
| readerSchema, reverseReader, bufferSize, shouldLookupRecords(), recordKeyField, internalSchema); | ||
| readerSchema.toAvroSchema(), reverseReader, bufferSize, shouldLookupRecords(), recordKeyField, internalSchema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should log reader also take HoodieSchema? Is the related refactoring separated to another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I started a separate branch to keep the size of this change smaller: #17548
| .withDataSchema(HoodieSchema.fromAvroSchema(dataAvroSchema)) | ||
| .withRequestedSchema(HoodieSchema.fromAvroSchema(requestedAvroSchema)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the Avro schema conversion be removed in a separate PR so the StructType schema is directly converted to HoodieSchema?
val requestedSchema = StructType(requiredSchema.fields ++ partitionSchema.fields.filter(f => mandatoryFields.contains(f.name)))
val requestedAvroSchema = AvroSchemaUtils.pruneDataSchema(avroTableSchema, AvroConversionUtils.convertStructTypeToAvroSchema(requestedSchema, sanitizedTableName), exclusionFields)
val dataAvroSchema = AvroSchemaUtils.pruneDataSchema(avroTableSchema, AvroConversionUtils.convertStructTypeToAvroSchema(dataSchema, sanitizedTableName), exclusionFields)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rahil-c can you add this to your PR for the spark reader changes?
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Describe the issue this Pull Request addresses
Addresses #17469
The
HoodieFileGroupReaderis our main reader abstraction but it is still using the Avro schema instead of the internalHoodieSchema.Summary and Changelog
HoodieFileGroupReaderandFileGroupReaderSchemaHandlerto operate solely onHoodieSchemainstead of the Avro schema class.HoodieSchema. If the caller requires some schema manipulation or fetching before the call to theHoodieFileGroupReader, this is also updated to useHoodieSchemaImpact
Migrates core reader paths to use new
HoodieSchemaRisk Level
Low
Documentation Update
Contributor's checklist