PARQUET-286: Update String support to match upstream Avro.#201
PARQUET-286: Update String support to match upstream Avro.#201rdblue wants to merge 2 commits intoapache:masterfrom
Conversation
|
@SinghAsDev could you take a look at this? |
be50c40 to
33634c1
Compare
|
Thanks for the review, @SinghAsDev. Good call on adding tests that verify both Avro and Parquet behavior. I've done that and added comments to clarify the method that determines the String type. I didn't separate the behavior into specific, generic, and reflect blocks because I thought that would lead to a lot of duplication, though. @isnotinvain or @tomwhite, could you take a look at this to get it merged? It is a blocker for 1.8.0. |
There was a problem hiding this comment.
Do these three keys come from avro, or can we choose any name we'd like for them? If we can choose the name, they should probably be "parquet.avro.*" and be dot separated right?
There was a problem hiding this comment.
Those are from Avro. I was avoiding having them in here, but I needed them to mimic Avro's String deserialization. Most of the time, we use a call to Avro's getClass(Schema) rather than using them directly, but the method we need isn't public.
|
+1, just one minor question I'm not very familiar w/ avro's APIs but everything LGTM as far as I can tell. |
This adds getStringableClass, which determines what String representation upstream Avro would use. Specific and reflect will use an alternative String class if java-class is set that is instantiated using a constructor that takes a String. Otherwise, reflect will always use String and both specific and generic will use Utf8 or String depending on whether avro.java.string is set to "string". The new string representations required two new converters: one for Utf8 and one for stringable classes (those with constructors that take a single String). The converters have also been refactored so that all binary converters now implement dictionary support.
33634c1 to
ee71cc7
Compare
The new test validates that the behavior for generic, specific, and reflect matches Avro's behavior. This required a fix to make Stringable work. When a reflected class uses the Stringable annotation, Avro doesn't include the type in the field's schema. Instead, the accessor is built using the expected type. Parquet needs to mimic this behavior, so the record converter will detect the Stringable annotation and replace a FieldStringConverter with a FieldStringableConverter.
ee71cc7 to
beb5a44
Compare
This adds getStringableClass, which determines what String
representation upstream Avro would use. Specific and reflect will use an
alternative String class if java-class is set that is instantiated using
a constructor that takes a String. Otherwise, reflect will always use
String and both specific and generic will use Utf8 or String depending
on whether avro.java.string is set to "string".
The new string representations required two new converters: one for Utf8
and one for stringable classes (those with constructors that take a
single String). The converters have also been refactored so that all
binary converters now implement dictionary support.