Skip to content

Conversation

@voonhous
Copy link
Member

@voonhous voonhous commented Dec 15, 2025

Describe the issue this Pull Request addresses

Reference issue: #14283

Remove methods that were migrated to HoodieSchemaUtils, consolidate remaining Avro-specific utilities, update documentation. The scope here only covers:

  1. hudi-cli
  2. hudi-client-common
Specific Ignored Usages

The following classes ignore HoodieAvroUtils:

  1. HoodieCDCLogger
  2. KeyGenUtils
  3. TimestampBasedAvroKeyGenerator
  4. TestHoodieAvroParquetWriter
  5. RawTripTestPayloadKeyGenerator

Key Changes:

  1. Migration: Swapping out HoodieAvroUtils wherever possible.
  2. Full Qualification: For classes where migration was not immediately feasible, calls to HoodieAvroUtils static functions have been fully qualified. This explicitly marks technical debt and makes these usages easily searchable for future refactoring.

Summary and Changelog

Swap out HoodieAvroUtils to HoodieSchema equivalent.

Impact

None

Risk Level

Low

Documentation Update

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@voonhous voonhous force-pushed the phase-18-HoodieAvroUtils-removal branch from 4d983c9 to 9df0e72 Compare December 15, 2025 16:06
@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Dec 15, 2025
@voonhous voonhous force-pushed the phase-18-HoodieAvroUtils-removal branch 2 times, most recently from 96738d6 to 4aadac8 Compare December 16, 2025 02:53
@voonhous voonhous changed the title feat(schema): Phase - 18 hoodie avro utils removal (hudi-client-common) feat(schema): Phase 18 - hoodie avro utils removal (hudi-client-common) Dec 18, 2025
@voonhous voonhous force-pushed the phase-18-HoodieAvroUtils-removal branch 2 times, most recently from f440c96 to 5f6daa5 Compare December 20, 2025 09:41
@voonhous voonhous changed the title feat(schema): Phase 18 - hoodie avro utils removal (hudi-client-common) feat(schema): Phase 18 - HoodieAvroUtils removal (hudi-client-common) Dec 20, 2025
@voonhous voonhous changed the title feat(schema): Phase 18 - HoodieAvroUtils removal (hudi-client-common) feat(schema): Phase 18 - HoodieAvroUtils removal (Part 1) Dec 20, 2025
@voonhous
Copy link
Member Author

@hudi-bot run azure

.map(this::handlePartitionColumnsIfNeeded);
}

public Option<Schema> getTableAvroSchemaIfPresent(boolean includeMetadataFields, Option<HoodieInstant> instant) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mark this as deprecated?

Copy link
Member Author

@voonhous voonhous Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will remove this directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will also do a Avro.Schema -> HoodieSchema migration for this class.

if (!historySchemaStr.isEmpty() || Boolean.parseBoolean(config.getString(HoodieCommonConfig.RECONCILE_SCHEMA.key()))) {
InternalSchema internalSchema;
Schema avroSchema = HoodieAvroUtils.createHoodieWriteSchema(config.getSchema(), config.allowOperationMetadataField());
HoodieSchema schema = HoodieSchemaUtils.addMetadataFields(HoodieSchema.parse(config.getSchema()), config.allowOperationMetadataField());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just use the HoodieSchemaUtils#createHoodieWriteSchema here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap! Defn!

return sourceFields.stream().allMatch(fieldToIndex -> {
Schema schema = getNestedFieldSchemaFromWriteSchema(tableSchema, fieldToIndex);
return isSecondaryIndexSupportedType(schema);
Option<Pair<String, HoodieSchemaField>> schema = HoodieSchemaUtils.getNestedField(tableSchema, fieldToIndex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the option is empty, should we return false here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, no harm being more defensive here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd opt for throwing an error like what the original HoodieAvroUtils#createHoodieWriteSchema does. similar to the comment for line 154.

return sourceFields.stream().anyMatch(fieldToIndex -> {
Schema schema = getNestedFieldSchemaFromWriteSchema(tableSchema, fieldToIndex);
return schema.getType() != Schema.Type.RECORD && schema.getType() != Schema.Type.ARRAY && schema.getType() != Schema.Type.MAP;
Option<Pair<String, HoodieSchemaField>> nestedFieldOpt = HoodieSchemaUtils.getNestedField(tableSchema, fieldToIndex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's throw an exception if the option is not present?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

case TIME:
return true;
case TIMESTAMP:
// LOCAL timestamps are not supported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we file a follow up ticket to add support for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, we can. I'm just transferring the test expectations to actual code since i don't recall seeing it documented anywhere other than tests here. (Have tagged u separately for this)

String recordNamespace = "hoodie." + tableName;

return AvroConversionUtils.convertStructTypeToAvroSchema(parquetSchema, structName, recordNamespace);
return HoodieSchema.fromAvroSchema(AvroConversionUtils.convertStructTypeToAvroSchema(parquetSchema, structName, recordNamespace));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use HoodieSchemaConversionUtils now to convert directly to HoodieSchema

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

return null;
}
Object value = currentRecord.get(field.pos());
Object value = currentRecord.get(fieldOpt.get().pos());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: let's create a local variable field = fieldOpt.get()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Schema nullableSchema = Schema.createUnion(Schema.create(Schema.Type.NULL),fieldSchema);
public static HoodieSchema createSchemaWithDefaultValue(TypeDescription orcSchema, String recordName, String namespace, boolean nullable) {
HoodieSchema hoodieSchema = createSchemaWithNamespace(orcSchema,recordName,namespace);
List<HoodieSchemaField> fields = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: when we know the size of the list, we should initialize the array list with that size

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!,

@voonhous voonhous force-pushed the phase-18-HoodieAvroUtils-removal branch from 5f6daa5 to b8b088e Compare December 23, 2025 14:45
@voonhous voonhous changed the base branch from master to phase-17-P2-AvroSchemaUtils-removal December 23, 2025 14:45
@voonhous
Copy link
Member Author

Note: this is a stacked PR, the base of this needs to be modified after #17581 is merged.

Comment on lines -237 to -240
Schema decimal = LogicalTypes.decimal(10, 2).addToSchema(Schema.create(Schema.Type.BYTES));
Schema uuid = LogicalTypes.uuid().addToSchema(Schema.create(Schema.Type.STRING));
Schema localTimestampMillis = LogicalTypes.localTimestampMillis().addToSchema(Schema.create(Schema.Type.LONG));
Schema localTimestampMicros = LogicalTypes.localTimestampMicros().addToSchema(Schema.create(Schema.Type.LONG));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@the-other-tim-brown Original tests where it expects local-timestamp to be unsupported.

@voonhous voonhous force-pushed the phase-18-HoodieAvroUtils-removal branch from 5a8eb8b to 762b1e3 Compare December 23, 2025 19:09
@voonhous voonhous changed the base branch from phase-17-P2-AvroSchemaUtils-removal to master December 24, 2025 06:40
@voonhous voonhous force-pushed the phase-18-HoodieAvroUtils-removal branch 3 times, most recently from 89da765 to a7fd9fa Compare December 27, 2025 08:23
Comment on lines 175 to 185
case DECIMAL:
case TIME:
case TIMESTAMP:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should break up the logical type method now that we can handle the types in the switch statement more easily and perform the checks here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, let me add that in and replace logicalTypeEquals accordingly.

Copy link
Member Author

@voonhous voonhous Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (s1IsDecimal) {
HoodieSchema.Decimal d1 = (HoodieSchema.Decimal) s1;
HoodieSchema.Decimal d2 = (HoodieSchema.Decimal) s2;
// Check if both use same underlying representation (FIXED vs BYTES)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they are both fixed, the fixed size should also be compared

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, i pushed to the changes to wrong branch... Can you please review again.

return logicalTypeSchemaEquals(s1, s2);
}

private static boolean logicalTypeSchemaEquals(HoodieSchema s1, HoodieSchema s2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should break these up into functions for each type so we can directly call them from schemaEqualsInternal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, broken them up, i pushed to the wrong branch. (origin instead of voon). Changes should be reflected now, my bad.

@voonhous voonhous force-pushed the phase-18-HoodieAvroUtils-removal branch from 7cac821 to 1235700 Compare December 29, 2025 04:00
Comment on lines 182 to 184
case DATE:
case UUID:
return logicalTypeSchemaEquals(s1, s2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these two be moved up to use primitiveSchemaEquals?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*/
public class HoodieSchemaComparatorForSchemaEvolution {

protected HoodieSchemaComparatorForSchemaEvolution() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this private?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

HoodieSchema.parse(timeMicros)
));
}
} No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: newline at end of file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Schema userSchema = new Schema.Parser().parse(writeConfig.getSchema());
if (!HoodieAvroUtils.getNullSchema().equals(userSchema)) {
HoodieSchema userSchema = HoodieSchema.parse(writeConfig.getSchema());
if (!HoodieSchema.create(HoodieSchemaType.NULL).equals(userSchema)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: reuse HoodieSchema.NULL_SCHEMA

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

.name("localTimestampMillisField").type(localTimestampMillis).noDefault()
.name("localTimestampMicrosField").type(localTimestampMicros).noDefault()
.endRecord();
HoodieSchemaField.of("decimalField", decimal),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we test both FIXED and BYTES for decimal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*/
@Test
public void testIsEligibleForExpressionIndexWithNullableFields() {
// An int with default 0 must have the int type defined first.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a restriction of Avro?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

public void testIsEligibleForExpressionIndexWithNullableFields() {
// An int with default 0 must have the int type defined first.
// If null is defined first, which HoodieSchema#createNullable does, an error will be thrown
HoodieSchema nullableIntWithDefault = HoodieSchema.createUnion(HoodieSchema.create(HoodieSchemaType.INT), HoodieSchema.create(HoodieSchemaType.NULL));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does HoodieSchema.createNullable work in this case, instead of calling HoodieSchema.createUnion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it's a restriction of Avro, here's the error if we use HoodieSchema.createNullable:

image

return schema.getType().hashCode();
}
}
} No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's add a new line at the end of the file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 287 to 285
private boolean fixedSchemaEquals(HoodieSchema s1, HoodieSchema s2) {
return validateFixed(s1, s2);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API seems redundant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will pop validateFixed's logic to fixedSchemaEquals, i.e. validateFixed will be removed

return true; // Regular LONG
case DOUBLE:
return true; // Support DOUBLE type
case DATE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test says FLOAT is supported but it seems that FLOAT type check is missing here?

  public void testIsEligibleForSecondaryIndexWithUnsupportedDataTypes() {
    // Given: A schema with unsupported data types for secondary index (Boolean, Decimal)
    // Note: Float and Double are now supported

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me verify this, the comment is a little confusing. It says: Float and Double are now supported, but the test itself for Float is a test for assertThrows but for Double, it's a assertDoesNotThrow. Might need to check separately with @linliu-code what's going on here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since comment and configs suggests that Float is supported, i will add a case Float.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@voonhous , i do not see there are any reason that Float cannot be supported. Please add a Float case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, fixed the test and enabled Float support.

@ParameterizedTest
@MethodSource("schemaTestParams")
void testGetTableAvroSchema(Schema inputSchema, boolean includeMetadataFields, Schema expectedSchema) throws Exception {
void testGetTableAvroSchema(HoodieSchema inputSchema, boolean includeMetadataFields, HoodieSchema expectedSchema) throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename the test methods so that there's no Avro in the method names?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

* @return true if each field's data types are supported, false otherwise
*/
public static boolean validateDataTypeForSecondaryOrExpressionIndex(List<String> sourceFields, Schema tableSchema) {
public static boolean validateDataTypeForSecondaryOrExpressionIndex(List<String> sourceFields, HoodieSchema tableSchema) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to follow up separately why there are two methods: validateDataTypeForSecondaryIndex, validateDataTypeForSecondaryOrExpressionIndex. From the naming, they seem to overlap.

Copy link
Member Author

@voonhous voonhous Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will add a sub task for this.

#17750

Comment on lines 141 to 144
Option<Pair<String, HoodieSchemaField>> schema = HoodieSchemaUtils.getNestedField(tableSchema, fieldToIndex);
if (schema.isEmpty()) {
throw new HoodieException("Failed to get schema. Not a valid field name: " + fieldToIndex);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be wrapped into a util getFieldOrThrow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

List<String> symbols1 = s1.getEnumSymbols();
List<String> symbols2 = s2.getEnumSymbols();

// Quick size check before creating sets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Comment is outdated mentioning sets, but List is used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed comment.

Additional compilation error fixes
- AvroOrcUtils
- HoodieBootstrapSchemaProvider
- HoodieSparkBootstrapSchemaProvider
- TestOrcBootstrap
- Enable float support in secondary index
- Fix test: test Secondary Index With All DataTypes
- Fix test: testValidateDataTypeForSecondaryIndex
@voonhous voonhous force-pushed the phase-18-HoodieAvroUtils-removal branch from 8224033 to c94921b Compare December 31, 2025 03:53
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@voonhous
Copy link
Member Author

image

CI green, merging this in.

@voonhous voonhous merged commit 5124ad7 into apache:master Dec 31, 2025
124 of 210 checks passed
@voonhous voonhous linked an issue Jan 3, 2026 that may be closed by this pull request
PavithranRick pushed a commit to PavithranRick/hudi that referenced this pull request Jan 8, 2026
Additional compilation error fixes
- AvroOrcUtils
- HoodieBootstrapSchemaProvider
- HoodieSparkBootstrapSchemaProvider
- TestOrcBootstrap
- Enable float support in secondary index
- Fix test: test Secondary Index With All DataTypes
- Fix test: testValidateDataTypeForSecondaryIndex
@voonhous voonhous deleted the phase-18-HoodieAvroUtils-removal branch January 15, 2026 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 18: HoodieAvroUtils Method Removal

5 participants