Skip to content

Conversation

@rzhang10
Copy link
Member

add support to merge hive native schema and avro schema literal if they are not consistent, the merging logic is to use hive schema as the source of truth for types, while augmenting other metadata/attributes such as casing, docstring, default value etc from avro schema.

…ey are not consistent, the merging logic is to use hive schema as the source of truth for types, while augmenting other metadata/attributes such as casing, docstring, default value etc from avro schema
Copy link

@wmoustafa wmoustafa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, seems there are lots of opportunities to simplify. In addition to regular comments, I have pointed out the places where the patch can be simplified. Let us simplify and see how it looks then.

Comment on lines +52 to +54
// We don't cache the structType because otherwise it could be possible that a field
// "lastname" is of type "firstname", where firstname is a compiled class.
// This will lead to ambiguity.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per a previous comment in #55, we may remove this comment if it is not explainable.

Comment on lines +112 to +113
List<String> fieldNames = typeInfo.getAllStructFieldNames();
for (String fieldName : fieldNames) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge these two lines into one.


List<String> fieldNames = typeInfo.getAllStructFieldNames();
for (String fieldName : fieldNames) {
final TypeInfo fieldTypeInfo = typeInfo.getStructFieldTypeInfo(fieldName);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to iterate on fieldNames and fieldTypeInfo in parallel. Looking at the implementation of getStructFieldTypeInfo(), it iterates on the list again for every call.

Comment on lines +48 to +50
ObjectInspector.Category category = typeInfo.getCategory();

switch (category) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for designated variable. You may combine to one line.

private static final String BYTE_TYPE_NAME = "byte";

static Schema convertTypeInfoToAvroSchema(TypeInfo typeInfo, String recordNamespace, String recordName) {
Schema schema;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this line. Just return the value immediately inside each case branch.

Comment on lines +73 to +75
} catch (Exception e) {
tableSchema = avroSchema;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird. What is the reason we need to do this?

Comment on lines +67 to +68
boolean isHiveSchemaEvolved =
LegacyHiveSchemaUtils.isRecordSchemaEvolved(avroSchemaWithoutNullable, hiveSchemaWithoutNullable);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we describing this in terms of evolution? We are simply comparing a Hive schema to a Supplemental schema. They either match or not match (according to some criteria that is defined in the method). Otherwise, there is no evolution, new, old, etc.

boolean isHiveSchemaEvolved =
LegacyHiveSchemaUtils.isRecordSchemaEvolved(avroSchemaWithoutNullable, hiveSchemaWithoutNullable);

if (isHiveSchemaEvolved) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why this check is required. Could you explain why it matters? If it turns it is not needed, this patch will be significantly simplified.

for (Schema.Field oldField : oldSchemaFields) {
Schema.Field newField = newSchemaFieldsMap.get(oldField.name().toLowerCase());

if (isSchemaEvolved(oldField.schema(), newField.schema())) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think there is a reason to used the word evolved in this patch.

return Schema.createUnion(unionSchemas);
}

static boolean isRecordSchemaEvolved(Schema oldSchema, Schema newSchema) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have two methods with slightly different names isRecordSchemaEvolved and isSchemaEvolved?

@shardulm94
Copy link
Contributor

shardulm94 commented Feb 22, 2021

The bugs and differences I observed through the tests implemented in #57

Bugs

  • TestMergeHiveSchemaWithAvro#shouldRetainExtraFieldsFromHive If there are extra top level struct fields in Hive, the nullability of those fields is not handled correctly. e.g if the extra field is a struct<a:int>, the struct is marked as optional, but a is marked as required. a should also be nullable.
  • TestMergeHiveSchemaWithAvro#shouldHandleLists and TestMergeHiveSchemaWithAvro#shouldHandleMaps If there are extra top level list or map fields in Hive, there are not retained in the resultant schema.
  • TestMergeHiveSchemaWithAvro#shouldSanitizeIncompatibleFieldNames If the table contains field names not allowed by Avro identifier spec i.e. [A-Za-z0-9_], they are not sanitized and Hive to Avro conversion fails

Differences

  • TestMergeHiveSchemaWithAvro#shouldFailIfContainerTypesDontMatch This implementation does not fail if Hive type is a container type viz. STRUCT, LIST and MAP and if they Avro type does not match.

We may need more comprehensive testing for deeper levels of nesting of list/map/structs. I didn't feel is was absolutely necessary in #57 because of the way the visitor works, but it might be helpful to add them.

@wmoustafa wmoustafa closed this Apr 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants