merge hive native schema and avro schema literal if they are inconsistent #56

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

rzhang10 wants to merge 2 commits into linkedin:master from rzhang10:rzhang-merge-schema

Member

rzhang10 commented Feb 18, 2021

add support to merge hive native schema and avro schema literal if they are not consistent, the merging logic is to use hive schema as the source of truth for types, while augmenting other metadata/attributes such as casing, docstring, default value etc from avro schema.

rzhang10 added 2 commits

February 18, 2021 13:21


          add support to merge hive native schema and avro schema literal if th…

51f3e1a

…ey are not consistent, the merging logic is to use hive schema as the source of truth for types, while augmenting other metadata/attributes such as casing, docstring, default value etc from avro schema


          make methods in HiveTypeToAvroType static

37867ff

wmoustafa requested changes

View reviewed changes

wmoustafa left a comment

Overall, seems there are lots of opportunities to simplify. In addition to regular comments, I have pointed out the places where the patch can be simplified. Let us simplify and see how it looks then.

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/HiveTypeToAvroType.java

Comment on lines +52 to +54

    
                      // We don't cache the structType because otherwise it could be possible that a field

                      // "lastname" is of type "firstname", where firstname is a compiled class.

                      // This will lead to ambiguity.

wmoustafa Feb 20, 2021

As per a previous comment in #55, we may remove this comment if it is not explainable.

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/HiveTypeToAvroType.java

Comment on lines +112 to +113

    
                  List<String> fieldNames =  typeInfo.getAllStructFieldNames();

                  for (String fieldName : fieldNames) {

wmoustafa Feb 20, 2021

Merge these two lines into one.

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/HiveTypeToAvroType.java

    
                  List<String> fieldNames =  typeInfo.getAllStructFieldNames();

                  for (String fieldName : fieldNames) {

                    final TypeInfo fieldTypeInfo = typeInfo.getStructFieldTypeInfo(fieldName);

wmoustafa Feb 20, 2021

It is better to iterate on fieldNames and fieldTypeInfo in parallel. Looking at the implementation of getStructFieldTypeInfo(), it iterates on the list again for every call.

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/HiveTypeToAvroType.java

Comment on lines +48 to +50

    
                  ObjectInspector.Category category = typeInfo.getCategory();

                  switch (category) {

wmoustafa Feb 20, 2021

No need for designated variable. You may combine to one line.

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/HiveTypeToAvroType.java

    
                private static final String BYTE_TYPE_NAME = "byte";

                static Schema convertTypeInfoToAvroSchema(TypeInfo typeInfo, String recordNamespace, String recordName) {

                  Schema schema;

wmoustafa Feb 20, 2021

No need for this line. Just return the value immediately inside each case branch.

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/LegacyHiveTableUtils.java

Comment on lines +73 to +75

    
                      } catch (Exception e) {

                        tableSchema = avroSchema;

                      }

wmoustafa Feb 20, 2021

This is weird. What is the reason we need to do this?

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/LegacyHiveTableUtils.java

Comment on lines +67 to +68

    
                    boolean isHiveSchemaEvolved =

                        LegacyHiveSchemaUtils.isRecordSchemaEvolved(avroSchemaWithoutNullable, hiveSchemaWithoutNullable);

wmoustafa Feb 20, 2021

Why are we describing this in terms of evolution? We are simply comparing a Hive schema to a Supplemental schema. They either match or not match (according to some criteria that is defined in the method). Otherwise, there is no evolution, new, old, etc.

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/LegacyHiveTableUtils.java

    
                    boolean isHiveSchemaEvolved =

                        LegacyHiveSchemaUtils.isRecordSchemaEvolved(avroSchemaWithoutNullable, hiveSchemaWithoutNullable);

                    if (isHiveSchemaEvolved) {

wmoustafa Feb 20, 2021

I am not sure why this check is required. Could you explain why it matters? If it turns it is not needed, this patch will be significantly simplified.

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/LegacyHiveSchemaUtils.java

    
                  for (Schema.Field oldField : oldSchemaFields) {

                    Schema.Field newField = newSchemaFieldsMap.get(oldField.name().toLowerCase());

                    if (isSchemaEvolved(oldField.schema(), newField.schema())) {

wmoustafa Feb 20, 2021

I do not think there is a reason to used the word evolved in this patch.

hive-metastore/src/main/java/org/apache/iceberg/hive/legacy/LegacyHiveSchemaUtils.java

    
                  return Schema.createUnion(unionSchemas);

                }

                static boolean isRecordSchemaEvolved(Schema oldSchema, Schema newSchema) {

wmoustafa Feb 20, 2021

Why do we have two methods with slightly different names isRecordSchemaEvolved and isSchemaEvolved?

Contributor

shardulm94 commented Feb 22, 2021 •

edited

Loading

The bugs and differences I observed through the tests implemented in #57

Bugs

TestMergeHiveSchemaWithAvro#shouldRetainExtraFieldsFromHive If there are extra top level struct fields in Hive, the nullability of those fields is not handled correctly. e.g if the extra field is a struct<a:int>, the struct is marked as optional, but a is marked as required. a should also be nullable.
TestMergeHiveSchemaWithAvro#shouldHandleLists and TestMergeHiveSchemaWithAvro#shouldHandleMaps If there are extra top level list or map fields in Hive, there are not retained in the resultant schema.
TestMergeHiveSchemaWithAvro#shouldSanitizeIncompatibleFieldNames If the table contains field names not allowed by Avro identifier spec i.e. [A-Za-z0-9_], they are not sanitized and Hive to Avro conversion fails

Differences

TestMergeHiveSchemaWithAvro#shouldFailIfContainerTypesDontMatch This implementation does not fail if Hive type is a container type viz. STRUCT, LIST and MAP and if they Avro type does not match.

We may need more comprehensive testing for deeper levels of nesting of list/map/structs. I didn't feel is was absolutely necessary in #57 because of the way the visitor works, but it might be helpful to add them.

wmoustafa closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet