Hive: Implement Deserializer for Hive writes#1854
Conversation
|
@marton-bod, @lcspinter: Could you please review? |
marton-bod
left a comment
There was a problem hiding this comment.
generally looks good, just a few minor questions
...n/java/org/apache/iceberg/mr/hive/serde/objectinspector/IcebergTimestampObjectInspector.java
Outdated
Show resolved
Hide resolved
|
@rdblue, @shardulm94: Could you please review if you have some time? This is a part of #1407 (Hive: HiveIcebergOutputFormat first implementation for handling Hive inserts into unpartitioned Iceberg tables) Thanks! |
| Object value(Object object); | ||
| } | ||
|
|
||
| private static FieldDeserializer deserializer(Type type, ObjectInspector fieldInspector) throws SerDeException { |
There was a problem hiding this comment.
Most of the other modules use the visitor pattern to traverse a type. That keeps the logic for traversing a schema in just one place so you don't need to mix it in with your domain-specific code.
It looks like this method is called from with the StructDeserializer constructor, so there is a recursive traversal of the schema that goes back and forth between object constructors and this method. That's a bit hard to follow, so I think it would be simpler if you took the visitor approach.
A GenericAvroReader is similar to what you're building here, so take a look at that as an example: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/avro/GenericAvroReader.java
There was a problem hiding this comment.
Ok. I moved to the Visitor pattern. The the code is quite nice, and compact 😄
Hive struct names are not matched to the Iceberg struct name fields in case of the main struct which contains the result columns of a query. In this case we have the following col names: 0:col1, 1:col2 and they should be matched by the positions of the fields and not by the names.
Handling this will add more complexity. I propose to handle this in the next PR. What do you think?
.../main/java/org/apache/iceberg/mr/hive/serde/objectinspector/IcebergWriteObjectInspector.java
Outdated
Show resolved
Hide resolved
mr/src/test/java/org/apache/iceberg/mr/hive/HiveIcebergTestUtils.java
Outdated
Show resolved
Hide resolved
| PrimitiveObjectInspectorFactory.writableLongObjectInspector, | ||
| PrimitiveObjectInspectorFactory.writableStringObjectInspector | ||
| )); | ||
|
|
There was a problem hiding this comment.
Shall we use 2 spaces for Continuation indent as well?
There was a problem hiding this comment.
Continuation indents are 2 indents, which are 4 spaces. What you have here looks correct to me.
There was a problem hiding this comment.
Then I am not sure what was the comment about. 😢
There was a problem hiding this comment.
I think it was an extra newline.
mr/src/test/java/org/apache/iceberg/mr/hive/TestDeserializer.java
Outdated
Show resolved
Hide resolved
mr/src/test/java/org/apache/iceberg/mr/hive/TestDeserializer.java
Outdated
Show resolved
Hide resolved
mr/src/test/java/org/apache/iceberg/mr/hive/TestDeserializer.java
Outdated
Show resolved
Hide resolved
|
Thanks, @pvary! This looks great so far. I had a few comments, but it looks like a good improvement! |
| return null; | ||
| } | ||
|
|
||
| List<Object> result = new ArrayList<>(); |
There was a problem hiding this comment.
Nit: prefer Lists.newArrayList().
There was a problem hiding this comment.
I remember on some other PR got a review from someone else to avoid unnecessary guava uses.
I am find with both, but I think we should stick to one or the other. Shall it be Lists.newArrayList then?
There was a problem hiding this comment.
We don't want to bring in any additional Guava classes if possible, but using the ones that are already there is a good thing. For this case, we can easily replace map class implementations with an import change later.
| } | ||
|
|
||
| private static class PartnerObjectInspectorByNameAccessors | ||
| implements SchemaWithPartnerVisitor.PartnerAccessors<ObjectInspector> { |
There was a problem hiding this comment.
Minor: This could probably be a singleton.
| * wrapper around the ObjectInspector. This wrapper uses the Iceberg schema column names instead of the Hive column | ||
| * names for {@link #getStructFieldRef(String) getStructFieldRef} | ||
| */ | ||
| private static class FixNameMappingObjectInspector extends StructObjectInspector { |
There was a problem hiding this comment.
Don't all of the structs need to be wrapped by this?
|
I don't have any major concerns about this, so I'm going to merge it to unblock the other work. Thanks @pvary! It would be nice to figure out a better way to traverse the object inspector tree that doesn't require the |
|
Thanks @rdblue for the review and the merge! In Hive you always need to provide the name-value pair for structs. AFAIK the only exception is the Schema level struct.
So this is why I have kept the current one which has its own drawbacks, so I am happy to go for a more general solution if we find one. |
Implements the Deserializer for Hive writes, and creates the corresponding test suite