-
Notifications
You must be signed in to change notification settings - Fork 35
merge hive native schema and avro schema literal if they are inconsistent #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ey are not consistent, the merging logic is to use hive schema as the source of truth for types, while augmenting other metadata/attributes such as casing, docstring, default value etc from avro schema
wmoustafa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, seems there are lots of opportunities to simplify. In addition to regular comments, I have pointed out the places where the patch can be simplified. Let us simplify and see how it looks then.
| // We don't cache the structType because otherwise it could be possible that a field | ||
| // "lastname" is of type "firstname", where firstname is a compiled class. | ||
| // This will lead to ambiguity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per a previous comment in #55, we may remove this comment if it is not explainable.
| List<String> fieldNames = typeInfo.getAllStructFieldNames(); | ||
| for (String fieldName : fieldNames) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge these two lines into one.
|
|
||
| List<String> fieldNames = typeInfo.getAllStructFieldNames(); | ||
| for (String fieldName : fieldNames) { | ||
| final TypeInfo fieldTypeInfo = typeInfo.getStructFieldTypeInfo(fieldName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to iterate on fieldNames and fieldTypeInfo in parallel. Looking at the implementation of getStructFieldTypeInfo(), it iterates on the list again for every call.
| ObjectInspector.Category category = typeInfo.getCategory(); | ||
|
|
||
| switch (category) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for designated variable. You may combine to one line.
| private static final String BYTE_TYPE_NAME = "byte"; | ||
|
|
||
| static Schema convertTypeInfoToAvroSchema(TypeInfo typeInfo, String recordNamespace, String recordName) { | ||
| Schema schema; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this line. Just return the value immediately inside each case branch.
| } catch (Exception e) { | ||
| tableSchema = avroSchema; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird. What is the reason we need to do this?
| boolean isHiveSchemaEvolved = | ||
| LegacyHiveSchemaUtils.isRecordSchemaEvolved(avroSchemaWithoutNullable, hiveSchemaWithoutNullable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we describing this in terms of evolution? We are simply comparing a Hive schema to a Supplemental schema. They either match or not match (according to some criteria that is defined in the method). Otherwise, there is no evolution, new, old, etc.
| boolean isHiveSchemaEvolved = | ||
| LegacyHiveSchemaUtils.isRecordSchemaEvolved(avroSchemaWithoutNullable, hiveSchemaWithoutNullable); | ||
|
|
||
| if (isHiveSchemaEvolved) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why this check is required. Could you explain why it matters? If it turns it is not needed, this patch will be significantly simplified.
| for (Schema.Field oldField : oldSchemaFields) { | ||
| Schema.Field newField = newSchemaFieldsMap.get(oldField.name().toLowerCase()); | ||
|
|
||
| if (isSchemaEvolved(oldField.schema(), newField.schema())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think there is a reason to used the word evolved in this patch.
| return Schema.createUnion(unionSchemas); | ||
| } | ||
|
|
||
| static boolean isRecordSchemaEvolved(Schema oldSchema, Schema newSchema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have two methods with slightly different names isRecordSchemaEvolved and isSchemaEvolved?
|
The bugs and differences I observed through the tests implemented in #57 Bugs
Differences
We may need more comprehensive testing for deeper levels of nesting of list/map/structs. I didn't feel is was absolutely necessary in #57 because of the way the visitor works, but it might be helpful to add them. |
add support to merge hive native schema and avro schema literal if they are not consistent, the merging logic is to use hive schema as the source of truth for types, while augmenting other metadata/attributes such as casing, docstring, default value etc from avro schema.