Hive: Implement Deserializer for Hive writes by pvary · Pull Request #1854 · apache/iceberg

pvary · 2020-12-01T15:59:59Z

Implements the Deserializer for Hive writes, and creates the corresponding test suite

pvary · 2020-12-02T08:26:12Z

@marton-bod, @lcspinter: Could you please review?
Thanks,
Peter

marton-bod

generally looks good, just a few minor questions

mr/src/main/java/org/apache/iceberg/mr/hive/Deserializer.java

...n/java/org/apache/iceberg/mr/hive/serde/objectinspector/IcebergTimestampObjectInspector.java

mr/src/main/java/org/apache/iceberg/mr/hive/Deserializer.java

mr/src/test/java/org/apache/iceberg/mr/hive/TestDeserializer.java

pvary · 2020-12-02T16:58:10Z

@rdblue, @shardulm94: Could you please review if you have some time? This is a part of #1407 (Hive: HiveIcebergOutputFormat first implementation for handling Hive inserts into unpartitioned Iceberg tables)

Thanks!
Peter

mr/src/main/java/org/apache/iceberg/mr/hive/Deserializer.java

rdblue · 2020-12-05T01:45:48Z

mr/src/main/java/org/apache/iceberg/mr/hive/Deserializer.java

+    Object value(Object object);
+  }
+
+  private static FieldDeserializer deserializer(Type type, ObjectInspector fieldInspector) throws SerDeException {


Most of the other modules use the visitor pattern to traverse a type. That keeps the logic for traversing a schema in just one place so you don't need to mix it in with your domain-specific code.

It looks like this method is called from with the StructDeserializer constructor, so there is a recursive traversal of the schema that goes back and forth between object constructors and this method. That's a bit hard to follow, so I think it would be simpler if you took the visitor approach.

A GenericAvroReader is similar to what you're building here, so take a look at that as an example: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/avro/GenericAvroReader.java

Ok. I moved to the Visitor pattern. The the code is quite nice, and compact 😄

Hive struct names are not matched to the Iceberg struct name fields in case of the main struct which contains the result columns of a query. In this case we have the following col names: 0:col1, 1:col2 and they should be matched by the positions of the fields and not by the names.

Handling this will add more complexity. I propose to handle this in the next PR. What do you think?

mr/src/main/java/org/apache/iceberg/mr/hive/Deserializer.java

.../main/java/org/apache/iceberg/mr/hive/serde/objectinspector/IcebergWriteObjectInspector.java

mr/src/test/java/org/apache/iceberg/mr/hive/HiveIcebergTestUtils.java

rdblue · 2020-12-05T01:54:16Z

mr/src/test/java/org/apache/iceberg/mr/hive/TestDeserializer.java

+              PrimitiveObjectInspectorFactory.writableLongObjectInspector,
+              PrimitiveObjectInspectorFactory.writableStringObjectInspector
+          ));
+


Nit: double whitespace.

Shall we use 2 spaces for Continuation indent as well?

Continuation indents are 2 indents, which are 4 spaces. What you have here looks correct to me.

Then I am not sure what was the comment about. 😢

I think it was an extra newline.

mr/src/test/java/org/apache/iceberg/mr/hive/TestDeserializer.java

rdblue · 2020-12-05T01:57:14Z

Thanks, @pvary! This looks great so far. I had a few comments, but it looks like a good improvement!

rdblue · 2020-12-08T21:28:05Z

mr/src/main/java/org/apache/iceberg/mr/hive/Deserializer.java

+          return null;
+        }
+
+        List<Object> result = new ArrayList<>();


Nit: prefer Lists.newArrayList().

I remember on some other PR got a review from someone else to avoid unnecessary guava uses.
I am find with both, but I think we should stick to one or the other. Shall it be Lists.newArrayList then?

We don't want to bring in any additional Guava classes if possible, but using the ones that are already there is a good thing. For this case, we can easily replace map class implementations with an import change later.

rdblue · 2020-12-08T21:29:07Z

mr/src/main/java/org/apache/iceberg/mr/hive/Deserializer.java

+  }
+
+  private static class PartnerObjectInspectorByNameAccessors
+      implements SchemaWithPartnerVisitor.PartnerAccessors<ObjectInspector> {


Minor: This could probably be a singleton.

rdblue · 2020-12-08T23:48:06Z

mr/src/main/java/org/apache/iceberg/mr/hive/Deserializer.java

+   * wrapper around the ObjectInspector. This wrapper uses the Iceberg schema column names instead of the Hive column
+   * names for {@link #getStructFieldRef(String) getStructFieldRef}
+   */
+  private static class FixNameMappingObjectInspector extends StructObjectInspector {


Don't all of the structs need to be wrapped by this?

rdblue · 2020-12-08T23:51:01Z

I don't have any major concerns about this, so I'm going to merge it to unblock the other work. Thanks @pvary!

It would be nice to figure out a better way to traverse the object inspector tree that doesn't require the FixNameMappingObjectInspector. I'm not quite sure what that is, but I think it might be a matter of converting the name into a position and accessing the object inspectors by position. That can be done later, though.

pvary · 2020-12-09T07:42:51Z

Thanks @rdblue for the review and the merge!

In Hive you always need to provide the name-value pair for structs. AFAIK the only exception is the Schema level struct.
I have been thinking about 2 other solutions:

Move field matching to be based on the ordering of the fields. We can do it by adding a new input to parameter to SchemaWithPartnerVisitor<P, R>.PartnerAccessors<P>.fieldPartner(P partnerStruct, int fieldId, String name, int fieldOrderId). I do not like this because I do not feel that Struct fields should have specific ordering. It could work, because we are the ones who generate the ObjectInspectors, but still...
Add the field name to the ObjectInspectors themselves when generating them for the Schema. I do not like this because then we will have specific ObjectInspector instances we have to create and drop.

So this is why I have kept the current one which has its own drawbacks, so I am happy to go for a more general solution if we find one.

Hive: Implement Deserializer for Hive writes

2a85e72

github-actions bot added the MR label Dec 1, 2020

pvary mentioned this pull request Dec 1, 2020

Hive: HiveIcebergOutputFormat first implementation for handling Hive inserts into unpartitioned Iceberg tables #1407

Closed

marton-bod reviewed Dec 2, 2020

View reviewed changes

Addressed Marton's comments

ac3f608

Checkstyle

2eea356

marton-bod mentioned this pull request Dec 3, 2020

Hive: OutputCommitter implementation for Hive writes #1861

Merged