Skip to content

Conversation

@the-other-tim-brown
Copy link
Contributor

@the-other-tim-brown the-other-tim-brown commented Dec 9, 2025

Describe the issue this Pull Request addresses

Addresses #17469
The HoodieFileGroupReader is our main reader abstraction but it is still using the Avro schema instead of the internal HoodieSchema.

Summary and Changelog

  • Updates the HoodieFileGroupReader and FileGroupReaderSchemaHandler to operate solely on HoodieSchema instead of the Avro schema class.
  • Updates callers to pass in HoodieSchema. If the caller requires some schema manipulation or fetching before the call to the HoodieFileGroupReader, this is also updated to use HoodieSchema

Impact

Migrates core reader paths to use new HoodieSchema

Risk Level

Low

Documentation Update

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Dec 9, 2025
@the-other-tim-brown the-other-tim-brown force-pushed the hoodie-schema-file-group-reader branch from 9a2446d to 467eac4 Compare December 10, 2025 02:49
Copy link
Member

@voonhous voonhous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some minor NIT comments.

@the-other-tim-brown
Copy link
Contributor Author

@hudi-bot run azure

@apache apache deleted a comment from hudi-bot Dec 10, 2025
@the-other-tim-brown the-other-tim-brown force-pushed the hoodie-schema-file-group-reader branch from 7ba5787 to 31461f2 Compare December 11, 2025 02:30
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@voonhous
Copy link
Member

LGTM. @balaji-varadarajan-ai @yihua do you wanna take another pass?

Comment on lines +874 to +876
HoodieSchema dataSchema = HoodieSchemaCache.intern(HoodieSchemaUtils.addMetadataFields(HoodieSchema.parse(dataWriteConfig.getWriteSchema()), dataWriteConfig.allowOperationMetadataField()));
HoodieSchema requestedSchema = metaClient.getTableConfig().populateMetaFields() ? getRecordKeySchema()
: HoodieSchemaUtils.projectSchema(dataSchema, Arrays.asList(metaClient.getTableConfig().getRecordKeyFields().orElse(new String[0])));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To follow up in a separate PR: We should start adding more methods in HoodieSchema so we can avoid nested method calls (i.e., AUtil.method1(BUtil.method2(C.method3(x), y), z) which can be improved by x.method(y, z)) that reduce readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to have this be an easy replacement by copying over the methods to similarly named classes. I agree we can avoid these utils in the future though.

import org.apache.hudi.common.table.read.BufferedRecord;
import org.apache.hudi.common.util.Option;

import org.apache.avro.Schema;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious to see how many classes still import org.apache.avro.Schema after quite a few refactoring PRs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably have at least 10 more PRs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a lot of refactoring, reminding me of the changes introducing the HoodieStorage abstraction :)

// Iterate over the paths
logFormatReaderWrapper = new HoodieLogFormatReader(storage, logFiles,
readerSchema, reverseReader, bufferSize, shouldLookupRecords(), recordKeyField, internalSchema);
readerSchema.toAvroSchema(), reverseReader, bufferSize, shouldLookupRecords(), recordKeyField, internalSchema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should log reader also take HoodieSchema? Is the related refactoring separated to another PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I started a separate branch to keep the size of this change smaller: #17548

Comment on lines +264 to +265
.withDataSchema(HoodieSchema.fromAvroSchema(dataAvroSchema))
.withRequestedSchema(HoodieSchema.fromAvroSchema(requestedAvroSchema))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the Avro schema conversion be removed in a separate PR so the StructType schema is directly converted to HoodieSchema?

val requestedSchema = StructType(requiredSchema.fields ++ partitionSchema.fields.filter(f => mandatoryFields.contains(f.name)))
    val requestedAvroSchema = AvroSchemaUtils.pruneDataSchema(avroTableSchema, AvroConversionUtils.convertStructTypeToAvroSchema(requestedSchema, sanitizedTableName), exclusionFields)
    val dataAvroSchema = AvroSchemaUtils.pruneDataSchema(avroTableSchema, AvroConversionUtils.convertStructTypeToAvroSchema(dataSchema, sanitizedTableName), exclusionFields)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rahil-c can you add this to your PR for the spark reader changes?

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yihua yihua merged commit 1479548 into apache:master Dec 12, 2025
130 of 137 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants