-
Notifications
You must be signed in to change notification settings - Fork 2.5k
feat(schema): Phase 18 - HoodieAvroUtils removal (Part 1) #17599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(schema): Phase 18 - HoodieAvroUtils removal (Part 1) #17599
Conversation
4d983c9 to
9df0e72
Compare
96738d6 to
4aadac8
Compare
f440c96 to
5f6daa5
Compare
|
@hudi-bot run azure |
| .map(this::handlePartitionColumnsIfNeeded); | ||
| } | ||
|
|
||
| public Option<Schema> getTableAvroSchemaIfPresent(boolean includeMetadataFields, Option<HoodieInstant> instant) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mark this as deprecated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will remove this directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will also do a Avro.Schema -> HoodieSchema migration for this class.
| if (!historySchemaStr.isEmpty() || Boolean.parseBoolean(config.getString(HoodieCommonConfig.RECONCILE_SCHEMA.key()))) { | ||
| InternalSchema internalSchema; | ||
| Schema avroSchema = HoodieAvroUtils.createHoodieWriteSchema(config.getSchema(), config.allowOperationMetadataField()); | ||
| HoodieSchema schema = HoodieSchemaUtils.addMetadataFields(HoodieSchema.parse(config.getSchema()), config.allowOperationMetadataField()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just use the HoodieSchemaUtils#createHoodieWriteSchema here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap! Defn!
| return sourceFields.stream().allMatch(fieldToIndex -> { | ||
| Schema schema = getNestedFieldSchemaFromWriteSchema(tableSchema, fieldToIndex); | ||
| return isSecondaryIndexSupportedType(schema); | ||
| Option<Pair<String, HoodieSchemaField>> schema = HoodieSchemaUtils.getNestedField(tableSchema, fieldToIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the option is empty, should we return false here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, no harm being more defensive here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd opt for throwing an error like what the original HoodieAvroUtils#createHoodieWriteSchema does. similar to the comment for line 154.
| return sourceFields.stream().anyMatch(fieldToIndex -> { | ||
| Schema schema = getNestedFieldSchemaFromWriteSchema(tableSchema, fieldToIndex); | ||
| return schema.getType() != Schema.Type.RECORD && schema.getType() != Schema.Type.ARRAY && schema.getType() != Schema.Type.MAP; | ||
| Option<Pair<String, HoodieSchemaField>> nestedFieldOpt = HoodieSchemaUtils.getNestedField(tableSchema, fieldToIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's throw an exception if the option is not present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| case TIME: | ||
| return true; | ||
| case TIMESTAMP: | ||
| // LOCAL timestamps are not supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we file a follow up ticket to add support for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, we can. I'm just transferring the test expectations to actual code since i don't recall seeing it documented anywhere other than tests here. (Have tagged u separately for this)
| String recordNamespace = "hoodie." + tableName; | ||
|
|
||
| return AvroConversionUtils.convertStructTypeToAvroSchema(parquetSchema, structName, recordNamespace); | ||
| return HoodieSchema.fromAvroSchema(AvroConversionUtils.convertStructTypeToAvroSchema(parquetSchema, structName, recordNamespace)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use HoodieSchemaConversionUtils now to convert directly to HoodieSchema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| return null; | ||
| } | ||
| Object value = currentRecord.get(field.pos()); | ||
| Object value = currentRecord.get(fieldOpt.get().pos()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: let's create a local variable field = fieldOpt.get()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| Schema nullableSchema = Schema.createUnion(Schema.create(Schema.Type.NULL),fieldSchema); | ||
| public static HoodieSchema createSchemaWithDefaultValue(TypeDescription orcSchema, String recordName, String namespace, boolean nullable) { | ||
| HoodieSchema hoodieSchema = createSchemaWithNamespace(orcSchema,recordName,namespace); | ||
| List<HoodieSchemaField> fields = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: when we know the size of the list, we should initialize the array list with that size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!,
5f6daa5 to
b8b088e
Compare
|
Note: this is a stacked PR, the base of this needs to be modified after #17581 is merged. |
| Schema decimal = LogicalTypes.decimal(10, 2).addToSchema(Schema.create(Schema.Type.BYTES)); | ||
| Schema uuid = LogicalTypes.uuid().addToSchema(Schema.create(Schema.Type.STRING)); | ||
| Schema localTimestampMillis = LogicalTypes.localTimestampMillis().addToSchema(Schema.create(Schema.Type.LONG)); | ||
| Schema localTimestampMicros = LogicalTypes.localTimestampMicros().addToSchema(Schema.create(Schema.Type.LONG)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@the-other-tim-brown Original tests where it expects local-timestamp to be unsupported.
5a8eb8b to
762b1e3
Compare
89da765 to
a7fd9fa
Compare
| case DECIMAL: | ||
| case TIME: | ||
| case TIMESTAMP: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should break up the logical type method now that we can handle the types in the switch statement more easily and perform the checks here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, let me add that in and replace logicalTypeEquals accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| if (s1IsDecimal) { | ||
| HoodieSchema.Decimal d1 = (HoodieSchema.Decimal) s1; | ||
| HoodieSchema.Decimal d2 = (HoodieSchema.Decimal) s2; | ||
| // Check if both use same underlying representation (FIXED vs BYTES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they are both fixed, the fixed size should also be compared
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, i pushed to the changes to wrong branch... Can you please review again.
| return logicalTypeSchemaEquals(s1, s2); | ||
| } | ||
|
|
||
| private static boolean logicalTypeSchemaEquals(HoodieSchema s1, HoodieSchema s2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should break these up into functions for each type so we can directly call them from schemaEqualsInternal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, broken them up, i pushed to the wrong branch. (origin instead of voon). Changes should be reflected now, my bad.
7cac821 to
1235700
Compare
| case DATE: | ||
| case UUID: | ||
| return logicalTypeSchemaEquals(s1, s2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these two be moved up to use primitiveSchemaEquals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| */ | ||
| public class HoodieSchemaComparatorForSchemaEvolution { | ||
|
|
||
| protected HoodieSchemaComparatorForSchemaEvolution() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| HoodieSchema.parse(timeMicros) | ||
| )); | ||
| } | ||
| } No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: newline at end of file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| Schema userSchema = new Schema.Parser().parse(writeConfig.getSchema()); | ||
| if (!HoodieAvroUtils.getNullSchema().equals(userSchema)) { | ||
| HoodieSchema userSchema = HoodieSchema.parse(writeConfig.getSchema()); | ||
| if (!HoodieSchema.create(HoodieSchemaType.NULL).equals(userSchema)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: reuse HoodieSchema.NULL_SCHEMA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| .name("localTimestampMillisField").type(localTimestampMillis).noDefault() | ||
| .name("localTimestampMicrosField").type(localTimestampMicros).noDefault() | ||
| .endRecord(); | ||
| HoodieSchemaField.of("decimalField", decimal), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we test both FIXED and BYTES for decimal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| */ | ||
| @Test | ||
| public void testIsEligibleForExpressionIndexWithNullableFields() { | ||
| // An int with default 0 must have the int type defined first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a restriction of Avro?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
| public void testIsEligibleForExpressionIndexWithNullableFields() { | ||
| // An int with default 0 must have the int type defined first. | ||
| // If null is defined first, which HoodieSchema#createNullable does, an error will be thrown | ||
| HoodieSchema nullableIntWithDefault = HoodieSchema.createUnion(HoodieSchema.create(HoodieSchemaType.INT), HoodieSchema.create(HoodieSchemaType.NULL)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does HoodieSchema.createNullable work in this case, instead of calling HoodieSchema.createUnion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return schema.getType().hashCode(); | ||
| } | ||
| } | ||
| } No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's add a new line at the end of the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| private boolean fixedSchemaEquals(HoodieSchema s1, HoodieSchema s2) { | ||
| return validateFixed(s1, s2); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API seems redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will pop validateFixed's logic to fixedSchemaEquals, i.e. validateFixed will be removed
| return true; // Regular LONG | ||
| case DOUBLE: | ||
| return true; // Support DOUBLE type | ||
| case DATE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test says FLOAT is supported but it seems that FLOAT type check is missing here?
public void testIsEligibleForSecondaryIndexWithUnsupportedDataTypes() {
// Given: A schema with unsupported data types for secondary index (Boolean, Decimal)
// Note: Float and Double are now supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me verify this, the comment is a little confusing. It says: Float and Double are now supported, but the test itself for Float is a test for assertThrows but for Double, it's a assertDoesNotThrow. Might need to check separately with @linliu-code what's going on here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since comment and configs suggests that Float is supported, i will add a case Float.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@voonhous , i do not see there are any reason that Float cannot be supported. Please add a Float case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, fixed the test and enabled Float support.
| @ParameterizedTest | ||
| @MethodSource("schemaTestParams") | ||
| void testGetTableAvroSchema(Schema inputSchema, boolean includeMetadataFields, Schema expectedSchema) throws Exception { | ||
| void testGetTableAvroSchema(HoodieSchema inputSchema, boolean includeMetadataFields, HoodieSchema expectedSchema) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename the test methods so that there's no Avro in the method names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| * @return true if each field's data types are supported, false otherwise | ||
| */ | ||
| public static boolean validateDataTypeForSecondaryOrExpressionIndex(List<String> sourceFields, Schema tableSchema) { | ||
| public static boolean validateDataTypeForSecondaryOrExpressionIndex(List<String> sourceFields, HoodieSchema tableSchema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to follow up separately why there are two methods: validateDataTypeForSecondaryIndex, validateDataTypeForSecondaryOrExpressionIndex. From the naming, they seem to overlap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will add a sub task for this.
| Option<Pair<String, HoodieSchemaField>> schema = HoodieSchemaUtils.getNestedField(tableSchema, fieldToIndex); | ||
| if (schema.isEmpty()) { | ||
| throw new HoodieException("Failed to get schema. Not a valid field name: " + fieldToIndex); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be wrapped into a util getFieldOrThrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea!
| List<String> symbols1 = s1.getEnumSymbols(); | ||
| List<String> symbols2 = s2.getEnumSymbols(); | ||
|
|
||
| // Quick size check before creating sets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Comment is outdated mentioning sets, but List is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed comment.
c477154 to
3bd409f
Compare
a4939d1 to
8224033
Compare
Additional compilation error fixes - AvroOrcUtils - HoodieBootstrapSchemaProvider - HoodieSparkBootstrapSchemaProvider - TestOrcBootstrap - Enable float support in secondary index - Fix test: test Secondary Index With All DataTypes - Fix test: testValidateDataTypeForSecondaryIndex
8224033 to
c94921b
Compare
Additional compilation error fixes - AvroOrcUtils - HoodieBootstrapSchemaProvider - HoodieSparkBootstrapSchemaProvider - TestOrcBootstrap - Enable float support in secondary index - Fix test: test Secondary Index With All DataTypes - Fix test: testValidateDataTypeForSecondaryIndex


Describe the issue this Pull Request addresses
Reference issue: #14283
Remove methods that were migrated to
HoodieSchemaUtils, consolidate remaining Avro-specific utilities, update documentation. The scope here only covers:hudi-clihudi-client-commonSpecific Ignored Usages
The following classes ignore
HoodieAvroUtils:HoodieCDCLoggerKeyGenUtilsTimestampBasedAvroKeyGeneratorTestHoodieAvroParquetWriterRawTripTestPayloadKeyGeneratorKey Changes:
HoodieAvroUtilsstatic functions have been fully qualified. This explicitly marks technical debt and makes these usages easily searchable for future refactoring.Summary and Changelog
Swap out HoodieAvroUtils to HoodieSchema equivalent.
Impact
None
Risk Level
Low
Documentation Update
Contributor's checklist