Skip to content

Conversation

@voonhous
Copy link
Member

@voonhous voonhous commented Dec 9, 2025

Describe the issue this Pull Request addresses

Reference issue: #14282

This PR focuses on migrating usages of AvroSchemaUtils to the internal HoodieSchema abstraction and HoodieSchemaUtils.

Key Changes:

  1. Migration: Migrated logic in key write/read paths (e.g., HoodieTable, HoodieWriteHandle, HoodieRowParquetWriteSupport) to use HoodieSchema and HoodieSchemaCompatibility.
  2. Full Qualification: For classes where migration was not immediately feasible, calls to AvroSchemaUtils static functions have been fully qualified. This explicitly marks technical debt and makes these usages easily searchable for future refactoring.
  3. Utility Enhancements: Added necessary bridge methods to HoodieSchemaUtils and HoodieSchemaCompatibility to support HoodieSchema objects directly.
Classes NOT included in migration (Fully Qualified)

The following classes retain AvroSchemaUtils usage but are now fully qualified:

  1. AvroRecordContext
  2. TestAvroSchemaUtils
  3. AvroSchemaUtils (Self)
  4. AvroSchemaComparatorForRecordProjection
  5. HoodieAvroUtils
  6. HoodieSchemaCompatibility
  7. MissingSchemaFieldException
  8. HoodieSchemaUtils
  9. HiveAvroSerializer
  10. HiveTypeUtils
  11. SchemaBackwardsCompatibilityException
  12. AvroSchemaRepair
  13. TestAvroSchemaRepair
  14. TestHoodieSchemaCompatibility (For consistency testing)
Classes Addressed in dependent PR #17536
  1. FileGroupReaderSchemaHandler
  2. OrderingValueEngineTypeConverter
  3. ParquetRowIndexBasedSchemaHandler
  4. HoodieAvroReaderContext
Specific Ignored Usages (getAvroRecordQualifiedName)

The following classes ignore AvroSchemaUtils.getAvroRecordQualifiedName:

  • BaseHoodieWriteClient
  • HoodieCatalog
  • HoodieHiveCatalog
  • HoodieTableFactory
  • StreamSync
  • TestHoodieTableFactory

Summary and Changelog

This PR is a refactoring effort to improve schema abstraction within Hudi. By moving away from raw Avro utils, we pave the way for better type safety and cleaner internal APIs.

Impact

  • Internal API Change: Methods in HoodieSchemaUtils and HoodieSchemaCompatibility now play a larger role in schema validation and evolution logic.
  • No User-Facing Change: This is a code health and refactoring PR; there are no changes to public configs or external behaviors.

Risk Level

Low

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added size:M PR with lines of changes in (100, 300] size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Dec 9, 2025
@voonhous voonhous marked this pull request as draft December 10, 2025 04:21
@voonhous voonhous force-pushed the phase-17-AvroSchemaUtils-removal branch 4 times, most recently from 228d0dc to ac69a39 Compare December 10, 2025 05:39
@github-actions github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:L PR with lines of changes in (300, 1000] labels Dec 10, 2025
@voonhous voonhous force-pushed the phase-17-AvroSchemaUtils-removal branch 2 times, most recently from 1531814 to 72aa418 Compare December 10, 2025 06:29
@github-actions github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Dec 10, 2025
@voonhous voonhous force-pushed the phase-17-AvroSchemaUtils-removal branch 8 times, most recently from f9ba546 to 89cd44a Compare December 10, 2025 11:06
@github-actions github-actions bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Dec 10, 2025
@voonhous voonhous force-pushed the phase-17-AvroSchemaUtils-removal branch from 00b71d6 to a1695a2 Compare December 10, 2025 16:42
@voonhous voonhous marked this pull request as ready for review December 10, 2025 16:47
@apache apache deleted a comment from hudi-bot Dec 10, 2025
Comment on lines 1010 to 1012
HoodieSchemaField tableField = tableSchema.getField(columnName).get();

if (tableField == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update this to first return an Option and then we can check if it is present instead of checking for null here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (writerField != null && !tableField.schema().equals(writerField.schema())) {
// Check if this is just making the field nullable/non-nullable, which is safe from SI perspective
if (getNonNullTypeFromUnion(tableField.schema()).equals(getNonNullTypeFromUnion(writerField.schema()))) {
HoodieSchema nonNullTableField = HoodieSchemaUtils.getNonNullTypeFromUnion(tableField.schema());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the getNonNullType method on the HoodieSchema when possible

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


@Override
public ClosableIterator<HoodieRecord<InternalRow>> getRecordIterator(HoodieSchema schema) throws IOException {
//TODO boundary to revisit in later pr to use HoodieSchema directly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this TODO now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* @return the writer field, if any does correspond, or None.
*/
public static HoodieSchemaField lookupWriterField(final HoodieSchema writerSchema, final HoodieSchemaField readerField) {
assert (writerSchema.getType() == HoodieSchemaType.RECORD);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the ValidationUtils here so we can return a more customized error message if the type is not RECORD

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*/
public static HoodieSchemaField lookupWriterField(final HoodieSchema writerSchema, final HoodieSchemaField readerField) {
assert (writerSchema.getType() == HoodieSchemaType.RECORD);
final List<HoodieSchemaField> writerFields = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the list is expected to have 0 or 1 elements so an Option may fit this usecase better, what do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap make sense!

Comment on lines 61 to 64
Option<HoodieSchemaField> fieldOpt = schema.getField(partitionField);
// if the field is not present in the schema, we assume it is a string
Schema fieldSchema = field == null ? Schema.create(Schema.Type.STRING) : getNonNullTypeFromUnion(field.schema());
LogicalType logicalType = fieldSchema.getLogicalType();
HoodieSchema fieldSchema = fieldOpt.isEmpty() ? HoodieSchema.create(HoodieSchemaType.STRING) : HoodieSchemaUtils.getNonNullTypeFromUnion(fieldOpt.get().schema());
HoodieSchemaType logicalType = fieldSchema.getType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Option<HoodieSchemaField> fieldOpt = schema.getField(partitionField);
// if the field is not present in the schema, we assume it is a string
Schema fieldSchema = field == null ? Schema.create(Schema.Type.STRING) : getNonNullTypeFromUnion(field.schema());
LogicalType logicalType = fieldSchema.getLogicalType();
HoodieSchema fieldSchema = fieldOpt.isEmpty() ? HoodieSchema.create(HoodieSchemaType.STRING) : HoodieSchemaUtils.getNonNullTypeFromUnion(fieldOpt.get().schema());
HoodieSchemaType logicalType = fieldSchema.getType();
Option<HoodieSchemaField> field = schema.getField(partitionField);
// if the field is not present in the schema, we assume it is a string
HoodieSchema fieldSchema = field.map(f -> f.schema().getNonNullType()).orElseGet(() -> HoodieSchema.create(HoodieSchemaType.STRING));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

|| logicalType instanceof LogicalTypes.TimeMicros
|| logicalType instanceof LogicalTypes.LocalTimestampMicros
|| logicalType instanceof LogicalTypes.LocalTimestampMillis;
private static boolean isTimeBasedLogicalType(HoodieSchemaType logicalType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static boolean isTimeBasedLogicalType(HoodieSchemaType logicalType) {
private static boolean isTimeBasedType(HoodieSchemaType type) {

The logical type is an Avro concept. We have proper types for date, time, timestamp, etc now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will modify, once your PR is merged, I'll prioritise your changes over mine for this.

assertEquals(f.schema(), HoodieSchema.createNullable(HoodieSchemaType.STRING));

// case5: user_partition is in originSchema, but partition_path is in originSchema
String[] pts4 = {"user_partition", "partition_path"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this is an error in the original test? It seems like this should be used below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using pts4 for the last test will cause the test to fail. I checked the git blame. This variable is there since the creation of this file. I hazard that this is WIP var that the original author forgot to remove this.

}

@Override
public HoodieSchema processSchema(HoodieSchema schema) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not overridden on purpose in case a user is extending the deprecated method.

Copy link
Member Author

@voonhous voonhous Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, will remove then, will keep the AvroSchemaUtils usage in the old deprecated, but will fully qualify it.

@voonhous voonhous force-pushed the phase-17-AvroSchemaUtils-removal branch from 8e94f63 to 0354017 Compare December 11, 2025 02:56
Comment on lines 89 to 90
.or(() -> Option.of(schema))
.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: while we're updating this, let's change this to use orElseGet instead of or and get

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

private Option<HoodieSchema> getTableCreateSchemaWithoutMetaField() {
return metaClient.getTableConfig().getTableCreateSchema()
.map(HoodieSchema::fromAvroSchema)
.or(Option.empty());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need the or in this case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

AvroSchemaUtils.checkSchemaCompatible(tableSchema, writerSchema, shouldValidate, allowProjection, getDropPartitionColNames());

HoodieSchema writerSchema = HoodieSchemaUtils.createHoodieWriteSchema(config.getSchema(), false);
HoodieSchema tableSchema = HoodieSchema.createHoodieWriteSchema(existingTableSchema.get().toString(), false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of converting existingTableSchema to a string, let's create a method that takes in a HoodieSchema similar to the AvroSchemaUtils

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It invokes org.apache.hudi.avro.HoodieAvroUtils#addMetadataFields(org.apache.avro.Schema, boolean), I don't think we need a new method, let's just call org.apache.hudi.common.schema.HoodieSchemaUtils#addMetadataFields(org.apache.hudi.common.schema.HoodieSchema).

private final boolean writeLegacyListFormat;
private final ValueWriter[] rootFieldWriters;
private final Schema avroSchema;
private final HoodieSchema hoodieSchema;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: let's update hoodieSchema variable names to simply schema?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return (row, ordinal) -> recordConsumer.addLong((long) timestampRebaseFunction.apply(row.getLong(ordinal)));
} else if (logicalType.getName().equals(LogicalTypes.timestampMillis().getName())) {
return (row, ordinal) -> recordConsumer.addLong(DateTimeUtils.microsToMillis((long) timestampRebaseFunction.apply(row.getLong(ordinal))));
if (resolvedSchema instanceof HoodieSchema.Timestamp) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use the HoodieSchemaType instead of instanceof checks to determine the type of the field

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return (row, ordinal) -> recordConsumer.addLong(row.getLong(ordinal));
} else if (logicalType.getName().equals(LogicalTypes.localTimestampMillis().getName())) {
return (row, ordinal) -> recordConsumer.addLong(DateTimeUtils.microsToMillis(row.getLong(ordinal)));
if (resolvedSchema instanceof HoodieSchema.Timestamp) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here let's use the type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

ValidationUtils.checkArgument(writerSchema.getType() == HoodieSchemaType.RECORD, writerSchema + " is not a record");
Option<HoodieSchemaField> result = Option.empty();
final Option<HoodieSchemaField> directOpt = writerSchema.getField(readerField.name());
if (directOpt.isPresent()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a separate variable for the result? Can we just use directOpt directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, simplified this. Was too fixated on keeping directOpt final, my bad,

Object javaInput = ObjectInspectorConverters.getConverter(writableOIOld, oldObjectInspector).convert(oldWritable);
if (isDecimalSchema(oldSchema)) {
javaInput = HoodieAvroUtils.DECIMAL_CONVERSION.toFixed(getDecimalValue(javaInput, oldSchema), oldSchema, oldSchema.getLogicalType());
if (oldSchema instanceof HoodieSchema.Decimal) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the schema type in this class as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@voonhous voonhous force-pushed the phase-17-AvroSchemaUtils-removal branch 6 times, most recently from e88014e to 219eb35 Compare December 12, 2025 08:20
@voonhous
Copy link
Member Author

Had to squash as i had too many commits and resolving conflicts one by one was too painful.

@voonhous voonhous force-pushed the phase-17-AvroSchemaUtils-removal branch from 219eb35 to 264c04c Compare December 12, 2025 08:27
@voonhous voonhous changed the title feat(schema): phase 17 - Remove AvroSchemaUtils usage feat(schema): phase 17 - Remove AvroSchemaUtils usage (part 1) Dec 12, 2025
import static org.apache.hudi.avro.HoodieAvroUtils.toJavaDate;
import static org.apache.hudi.common.util.StringUtils.getUTF8Bytes;

public class HoodieArrayWritableAvroUtils {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update this class name to just be HoodieArrayWritableSchemaUtils

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, cool with me.

HoodieSchemaField oldField = oldFieldOpt.get();
values[i] = rewriteRecordWithNewSchema(arrayWritable.get()[oldField.pos()], oldField.schema(), newField.schema(), renameCols, fieldNames);
} else if (newField.defaultVal() instanceof JsonProperties.Null) {
} else if (newField.defaultVal().isPresent() && newField.defaultVal().get() instanceof JsonProperties.Null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JsonProperties needs to be HoodieJsonProperties now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this to HoodieSchema.NULL_VALUE they are the same underlying value.

* @return true if reader schema can read data written with writer schema
* @throws IllegalArgumentException if schemas are null
*/
public static boolean isSchemaCompatible(HoodieSchema readerSchema, HoodieSchema writerSchema,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there existing test cases we can migrate to TestHoodieSchemaCompatibility as part of this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I'll add some.

@voonhous voonhous force-pushed the phase-17-AvroSchemaUtils-removal branch 4 times, most recently from fb8531b to 4567f05 Compare December 13, 2025 09:27
- Fix checkstyle
- Add tests
- Address comments
- Fix failing azure IT
- Fix compilation errors after rebase
- Fix failing TestHoodieFileGroupReaderOnFlink tests
- Fix failing TestHoodieFileGroupReaderOnHive
- Fix failing TestPartitionPathParser tests
- Address comments
- Fix Java17 compilation error
- Fix checkstyle
- Fix direct #get() usages in HoodieRowParquetWRiteSupport
- Fix removal
- Remove AvroSchemaUtils usage from TestParquetUtils
- Remove AvroSchemaUtils usage from HoodieSparkParquetReader
- Remove AvroSchemaUtils usage from HoodieTable
- Remove AvroSchemaUtils usage from HoodieTestDataGenerator
- Fix tests from missing namespace and doc
- Remove Avro.Schema and AvroSchemaUtils usage from HoodieArrayWritableAvroUtils
- Adapt KafkaOffsetPostProcessor
- Ignore AvroSchemaUtils#getAvroRecordQualifiedName usages
- Remove AvroSchemaUtils from TestHoodieCommitMetadata
- Remove AvroSchemaUtils from HoodieRowParquetWriteSupport
- Remove AvroSchemaUtils from HoodieRealtimeRecordReaderUtils
- Remove AvroSchemaUtils from TestMergeHandle
- Ignore ParquetRowIndexBasedSchemaHandler
- Add ignore flag in FileGroupReaderSchemaHandler
- Add ignore flag in OrderingValueEngineTypeConverter
- Remove AvroSchemaUtils usage from ConcurrentSchemaEvolutionTableSchemaGetter and TableSchemaResolver
- Remove AvroSchemaUtils usages from PartitionPathParser and TestPartitionPathParser
- Remove AvroSchemaUtils usages from TestHoodieSchemaCompatibility
- Remove AvroSchemaUtils usages from TestSparkSortAndSizeClustering
- Remove AvroSchemaUtils usages from TestHoodieAvroReaderContext
- Ignore AvroSchemaRepair and TestAvroSchemaRepair
- Remove AvroSchemaUtils usage from FlinkRowDataReaderContext
- Ignore SchemaBackwardsCompatibilityException
- Ignore HiveAvroSerializer and HiveTypeUtils + fully qualify AvroSchemaUtils usages
- Ignore HoodieSchemaUtils
- Ignore MissingSchemaFieldException
- Remove AvroSchemaUtils from HoodieMergeHelper
- Ignore classes:
    - HoodieAvroReaderContext
    - AvroRecordContext
    - HoodieSchemaCompatibility
- Fully qualify AvroSchemaUtils call in HoodieAvroUtils
- Remove flags from *AvroSchemaUtils
- Add more references to AvroSchemaUtils missed out on first scan
- Remove AvroSchemaUtils and Avro.Schema usage in HoodieTable and HoodieWriteHandle
- Remove AvroSchemaUtils and Avro.Schema usage in TestTableSchemaEvolution
- Find all AvroSchemaUtils usages and prefix them with comment /*~~>*/
@voonhous voonhous force-pushed the phase-17-AvroSchemaUtils-removal branch from 4567f05 to 2654cac Compare December 14, 2025 11:21
} else {
throw new UnsupportedOperationException("Unsupported Avro logical type for TimestampType: " + logicalType);
// Default to micros precision when no timestamp schema is available
return (row, ordinal) -> recordConsumer.addLong((long) timestampRebaseFunction.apply(row.getLong(ordinal)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously it looks like this threw an UnsupportedOperationException, should we keep that?

Copy link
Member Author

@voonhous voonhous Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it's unlikely that we'll step into this clause, so, Exception might be safer.

Copy link
Member Author

@voonhous voonhous Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it's I believe it's unlikely that we'll step into this clause, so, Exception might be safer.

I was trying to capture the case of logicalType == null. I am not sure if HoodieSchema will resolve an Avro schema to a case where logicalType == null.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment addressed.

@voonhous voonhous force-pushed the phase-17-AvroSchemaUtils-removal branch from bcd1764 to 3be646d Compare December 15, 2025 16:29
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@voonhous
Copy link
Member Author

image

CI + Azure CI passed.

Squashing and merging this in.

@voonhous voonhous merged commit bf70d4c into apache:master Dec 16, 2025
135 of 137 checks passed
@voonhous voonhous deleted the phase-17-AvroSchemaUtils-removal branch December 16, 2025 02:33
@voonhous voonhous linked an issue Jan 3, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 17: AvroSchemaUtils Method Removal

3 participants