Skip to content

Conversation

@voonhous
Copy link
Member

@voonhous voonhous commented Dec 12, 2025

Describe the issue this Pull Request addresses

Reference issue: #14282

This PR focuses on migrating usages of AvroSchemaUtils to the internal HoodieSchema abstraction and HoodieSchemaUtils.

Key Changes:

  1. Migration: Migrated logic in key write/read paths (e.g., HoodieTable, HoodieWriteHandle, HoodieRowParquetWriteSupport) to use HoodieSchema and HoodieSchemaCompatibility.
  2. Full Qualification: For classes where migration was not immediately feasible, calls to AvroSchemaUtils static functions have been fully qualified. This explicitly marks technical debt and makes these usages easily searchable for future refactoring.
  3. Utility Enhancements: Added necessary bridge methods to HoodieSchemaUtils and HoodieSchemaCompatibility to support HoodieSchema objects directly.
Classes NOT included in migration (Fully Qualified)

The following classes retain AvroSchemaUtils usage but are now fully qualified:

  1. TestAvroSchemaUtils
  2. AvroSchemaUtils (Self)
  3. AvroSchemaComparatorForRecordProjection
  4. HoodieAvroUtils
  5. HoodieSchemaCompatibility
  6. MissingSchemaFieldException
  7. HoodieSchemaUtils
  8. SchemaBackwardsCompatibilityException
  9. AvroSchemaRepair
  10. TestAvroSchemaRepair
  11. TestHoodieSchemaCompatibility (For consistency testing)
Specific Ignored Usages (getAvroRecordQualifiedName)

The following classes ignore AvroSchemaUtils.getAvroRecordQualifiedName:

  • BaseHoodieWriteClient
  • HoodieCatalog
  • HoodieHiveCatalog
  • HoodieTableFactory
  • StreamSync
  • TestHoodieTableFactory

Summary and Changelog

This PR is a refactoring effort to improve schema abstraction within Hudi. By moving away from raw Avro utils, we pave the way for better type safety and cleaner internal APIs.

Impact

  • Internal API Change: Methods in HoodieSchemaUtils and HoodieSchemaCompatibility now play a larger role in schema validation and evolution logic.
  • No User-Facing Change: This is a code health and refactoring PR; there are no changes to public configs or external behaviors.

Risk Level

Low

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Dec 12, 2025
@voonhous
Copy link
Member Author

This PR is stacked on top of #17535, please ONLY merge this after #17535 is merged.

@voonhous voonhous force-pushed the phase-17-P2-AvroSchemaUtils-removal branch 7 times, most recently from 64ac0dd to 1d7fdbc Compare December 16, 2025 02:38
@github-actions github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:XL PR with lines of changes > 1000 labels Dec 16, 2025
@voonhous voonhous force-pushed the phase-17-P2-AvroSchemaUtils-removal branch 2 times, most recently from 311cfd7 to 4b9814f Compare December 17, 2025 11:49
@voonhous voonhous force-pushed the phase-17-P2-AvroSchemaUtils-removal branch from 4b9814f to 91797f9 Compare December 19, 2025 18:59
@voonhous
Copy link
Member Author

Rebased

@voonhous voonhous force-pushed the phase-17-P2-AvroSchemaUtils-removal branch 2 times, most recently from f395ea6 to 2ca8f9f Compare December 20, 2025 08:05
.orElse(null);

if (nonNullType == null) {
throw new org.apache.hudi.internal.schema.HoodieSchemaException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, let's import the class

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

HoodieSchema hoodieResult = HoodieSchemaUtils.resolveUnionSchema(hoodieFieldSchema, "TypeA");

// Should produce equivalent schemas
assertEquals(avroResult.toString(), hoodieResult.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assert on the objects being equal instead of the string representation by calling .toAvroSchema?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump on this, it looks like it is not in the latest commit

Copy link
Member Author

@voonhous voonhous Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, removing this as we're removing HoodieAvroUtils#resolveUnionSchema. But I've verified the changes passes before removing it:

image

@voonhous voonhous force-pushed the phase-17-P2-AvroSchemaUtils-removal branch from 2ca8f9f to 41163c4 Compare December 22, 2025 13:43
Copy link
Contributor

@the-other-tim-brown the-other-tim-brown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM. Just one minor item remaining on the tests

@voonhous voonhous force-pushed the phase-17-P2-AvroSchemaUtils-removal branch 3 times, most recently from c16ca0a to 13ac494 Compare December 23, 2025 07:37
@apache apache deleted a comment from hudi-bot Dec 23, 2025
@voonhous voonhous force-pushed the phase-17-P2-AvroSchemaUtils-removal branch from 000b93c to b9e691a Compare December 23, 2025 14:14
HoodieSchema schema = HoodieSchema.parse(schemaWithTimestampMicros);

// HiveTypeUtils.generateColumnTypes throws an exception for timestamp-micros since it's not supported by AvroSerDe
assertThrows(Exception.class, () -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for this case it will currently fall back to the underlying primitive and return a long

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, running the same code on master, i am getting this. Let me recheck all the logical types on master.

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testGenerateColumnTypesForTimeMicros()
testGenerateColumnTypesForTimeMillis()
testGenerateColumnTypesForTimestampMicros()

These 3 tests are failing on master without the HiveTypeUtils migration. Let me fix them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the test + fixed the HiveTypeUtils#generateTypeInfo.


HoodieSchemaType type = schema.getType();
if (type == DECIMAL && AvroSerDe.DECIMAL_TYPE_NAME
.equalsIgnoreCase((String) schema.getProp(AvroSerDe.AVRO_PROP_LOGICAL_TYPE))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These logical type checks seem to be very avro specific, can we just rely on the HoodieSchemaType now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, it's doable for all the types in this function except VARCHAR and CHAR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, PTAL

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@voonhous voonhous force-pushed the phase-17-P2-AvroSchemaUtils-removal branch 2 times, most recently from badeef9 to 82775b6 Compare December 24, 2025 07:57
@voonhous voonhous force-pushed the phase-17-P2-AvroSchemaUtils-removal branch from 82775b6 to 03b8fb4 Compare December 24, 2025 11:30
@voonhous
Copy link
Member Author

voonhous commented Dec 24, 2025

Gosh, the CI is timing out, I'm going to increase the timeout to check if we the Azure CI can succeed without throwing any errors

NOTE: Before we merge this in, we will need to undo the last commit on this PR to revert the Azure CI timeout.

Edit to add: Not reverting Azure CI timeout increase here as increasing it to 3 hours is better than trying to get it to pass for 8 hours.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@voonhous
Copy link
Member Author

Azure CI + GitHub CI are all green! (FINALLY)

image

Merging this PR in.

@voonhous voonhous merged commit 91c15b1 into apache:master Dec 24, 2025
390 of 417 checks passed
@voonhous voonhous deleted the phase-17-P2-AvroSchemaUtils-removal branch December 24, 2025 18:19
@voonhous voonhous linked an issue Jan 3, 2026 that may be closed by this pull request
PavithranRick pushed a commit to PavithranRick/hudi that referenced this pull request Jan 8, 2026
…e#17581)

* Remove AvroSchemaUtils from HiveAvroSerializer and HiveTypeUtils

* Address comments

* Address comments

* Remove AvroSchemaUtils#resolveUnionSchema

* Add more test to cover generateColumnTypes

* Fix HiveTypeUtils#generateTypeInfo behaviour and ensure that it is not changed

* Address comments

* Increase Azure CI timeout
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 17: AvroSchemaUtils Method Removal

4 participants