-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-3116: Implement the Variant binary encoding #3117
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @gene-db! I left some comments, but this is looking good
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
if (index < 0 || index >= size) { | ||
throw malformedVariant(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks inconsistent with the getFieldAtIndex
where we return a null
. Let's raise an exception at line 220 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getFieldAtIndex
is a little bit different, since if a field doesn't exist in a variant value, that doesn't mean the variant value is malformed. This dictionary case is different because we are expecting an id in the dictionary to exist, but it doesn't.
parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java
Outdated
Show resolved
Hide resolved
// If the value doesn't fit any integer type, parse it as decimal or floating instead. | ||
parseAndAppendFloatingPoint(parser); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is lossy, and I'd rather raise an exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a tricky situation. We decided to allow parsing this type of valid JSON and not return an error, since the JSON is technically valid. It is not ideal that a valid JSON string hits an error. This behavior is similar to how Snowflake's variant parses JSON.
parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java
Outdated
Show resolved
Hide resolved
* Builder for creating Variant value and metadata. | ||
*/ | ||
public class VariantBuilder { | ||
public VariantBuilder(boolean allowDuplicateKeys) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we allow this? This isn't allowed by the spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not for writing duplicate keys in the Variant value itself, but for parsing JSON strings. JSON strings might have duplicate keys, and this flag controls the behavior when encountering duplicate keys.
I added a comment to clarify.
parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
return Arrays.copyOfRange(value, pos, pos + size); | ||
} | ||
|
||
public byte[] getMetadata() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of byte[]
seems awkward given the assumptions that are made. It looks like the intent is for value
and metadata
to either be two separate arrays starting at offset 0, or a single array with metadata
coming first followed by value
at pos
(but in this case, the array is passed to the constructor twice).
A more common pattern would be to specify each array along with an offset and a length, so that there are no implicit assumptions about the array contents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we assume that metadata
and value
are in the same array? I don't think we are making that assumption.
The pos
part in getValue()
is not assuming the metadata is in the same array, but is for getting a "sub-variant" value from a variant value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we assume that
metadata
andvalue
are in the same array? I don't think we are making that assumption.
I was referring to the possible values and intent for the pos
argument and trying to understand your intent from this code. But that isn't the point I was trying to make.
The point here is that it is more common in Java to pass byte arrays with offset and length, rather than requiring that arrays are copied before passing them in. I think the use of 0-offset byte arrays is limiting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure what the proposal is. Is this saying we should not return a byte[]
, but something else?
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
return Arrays.copyOfRange(value, pos, pos + size); | ||
} | ||
|
||
public byte[] getMetadata() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we assume that metadata
and value
are in the same array? I don't think we are making that assumption.
The pos
part in getValue()
is not assuming the metadata is in the same array, but is for getting a "sub-variant" value from a variant value.
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
// If the value doesn't fit any integer type, parse it as decimal or floating instead. | ||
parseAndAppendFloatingPoint(parser); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a tricky situation. We decided to allow parsing this type of valid JSON and not return an error, since the JSON is technically valid. It is not ideal that a valid JSON string hits an error. This behavior is similar to how Snowflake's variant parses JSON.
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
* @return the JSON representation of the variant | ||
* @throws MalformedVariantException if the variant is malformed | ||
*/ | ||
public String toJson(ZoneId zoneId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the toJson()
which defaults to +00:00
. The options are there for engines to choose the behavior, while sharing the same implementation.
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
import java.nio.ByteBuffer; | ||
import java.nio.ByteOrder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import java.nio.ByteBuffer; | |
import java.nio.ByteOrder; |
These can go according to spotless:
Error: src/main/java/org/apache/parquet/variant/Variant.java
Error: @@ -23,8 +23,6 @@
Error: import·java.io.CharArrayWriter;
Error: import·java.io.IOException;
Error: import·java.math.BigDecimal;
Error: -import·java.nio.ByteBuffer;
Error: -import·java.nio.ByteOrder;
Error: import·java.time.*;
Error: import·java.time.format.DateTimeFormatter;
Error: import·java.time.format.DateTimeFormatterBuilder;
Error: Run 'mvn spotless:apply' to fix these violations.
Error: -> [Help 1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
* An exception indicating that the Variant is malformed. | ||
*/ | ||
public class MalformedVariantException extends RuntimeException { | ||
public MalformedVariantException() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? I genearally consider no-arg constructors for exception classes to be an anti-pattern because people use them without thinking about what helpful error message should be included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
/** | ||
* @return the type id that was unknown | ||
*/ | ||
public int getTypeId() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I don't recommend using get
although it is popular. There is almost always a more specific verb that is better, and if not it can usually be omitted for a cleaner method name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to typeId()
.
/** | ||
* @return the number of object fields in the variant. `getType()` must be `Type.OBJECT`. | ||
*/ | ||
public int objectSize() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec uses num_elements
for this. I think it would be better to be clear because objectSize
could return a length in bytes rather than number of fields or elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to numObjectElements
(and the array one to numArrayElements
).
public int getDictionaryIdAtIndex(int index) { | ||
return VariantUtil.handleObject(value, pos, (size, idSize, offsetSize, idStart, offsetStart, dataStart) -> { | ||
if (index < 0 || index >= size) { | ||
throw VariantUtil.malformedVariant(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a malformed variant, but the same issue with a field or array results in null
?
I also don't think that this can assume that the variant is malformed in this case. The index
is passed in by the caller and this is a public method. The index may not exist, but if so the problem is in the caller not the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I don't see any uses of this. Can it be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it can removed.
*/ | ||
public class VariantSizeLimitException extends RuntimeException { | ||
public VariantSizeLimitException(long sizeLimitBytes, long estimatedSizeBytes) { | ||
super(String.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a size limit imposed by Parquet? We don't do this anywhere else and it isn't in the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be reasonable for an engine to want to not create arbitrarily large variant values and set a maximum size limit. How can an engine configure a limit, when this variant library is doing the parsing and creation of the Variant value?
The size limit is configurable when creating the variant builder, and the library is just trying to enforce the caller-configured limit.
Would tthe alternative be to just let the library build the entire value, and then the caller checks the size? That would be unbounded in terms of memory usage, and would negatively affect stability.
*/ | ||
public class VariantUtil { | ||
public static final int BASIC_TYPE_BITS = 2; | ||
public static final int BASIC_TYPE_MASK = 0x3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think it's easier to read if masks like this use 0b00000011
instead of 0x3
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
/** False value. Empty content. */ | ||
public static final int FALSE = 2; | ||
/** 1-byte little-endian signed integer. */ | ||
public static final int INT1 = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: In the spec these are INT8
, INT16
, INT32
, etc that match the Parquet physical types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed.
} | ||
|
||
public static MalformedVariantException malformedVariant() { | ||
return new MalformedVariantException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the variant is malformed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
} | ||
|
||
public static MalformedVariantException malformedVariant(String message) { | ||
return new MalformedVariantException(message); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the utility of this if it just calls a constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
return new MalformedVariantException(); | ||
} | ||
|
||
public static UnknownVariantTypeException unknownPrimitiveTypeInVariant(int id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also just a call to the constructor. I'd remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
* @throws MalformedVariantException if the index is out of bound | ||
*/ | ||
public static void checkIndex(int pos, int length) { | ||
if (pos < 0 || pos >= length) throw malformedVariant(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: Use curly braces even if they are not needed.
I also think this has the same issue I pointed out above, which is that there is no reason to think variant bytes are malformed just because the caller is incorrect. Instead this would normally be an IllegalArgumentException
. Throwing MalformedVariantException
assumes too much about how this is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to IllegalArgumentException
.
} | ||
|
||
private static IllegalStateException unexpectedType(Type type) { | ||
return new IllegalStateException("Expect type to be " + type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this a malformed variant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to MalformedVariantException
*/ | ||
private static void checkDecimal(BigDecimal d, int maxPrecision) { | ||
if (d.precision() > maxPrecision || d.scale() > maxPrecision) { | ||
throw malformedVariant(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have a helpful error message about the data and what went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added error message.
case INT2: | ||
case INT4: | ||
case INT8: | ||
return Type.LONG; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that this implementation doesn't allow the caller to know the actual type or to get the value as an int32 or int16 when that is what is stored. This forces the caller to use a long value even for int8.
Like I've pointed out before, I don't think that storage should modify values passed to it. In this case, the values may not be modified, but it isn't possible to tell what the values were, which is basically the same problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated to return all the integer types, and added the corresponding get*
in Variant
.
}); | ||
} | ||
|
||
private static Variant getElementAtIndex( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used by getElementAtIndex(int)
and in the JSON conversion code. The method above, getElementAtIndex(int)
isn't used at all, nor is arraySize
. Does that mean this is not tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added unit tests for both getElementAtIndex
and arraySize
.
gen.flush(); | ||
return writer.toString(); | ||
} catch (IOException e) { | ||
throw new RuntimeException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use RuntimeIOException
instead. It's also nice to add context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this one and others to use RuntimeIOException
} | ||
gen.writeEndObject(); | ||
} catch (IOException e) { | ||
throw new RuntimeException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than adding try/catch inline, I think it makes more sense for toJsonImpl
to throw IOException
so this can be handled in the wrapper methods that are public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import java.nio.ByteBuffer; | ||
import java.nio.ByteOrder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
/** | ||
* @return the type id that was unknown | ||
*/ | ||
public int getTypeId() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to typeId()
.
* An exception indicating that the Variant is malformed. | ||
*/ | ||
public class MalformedVariantException extends RuntimeException { | ||
public MalformedVariantException() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
/** | ||
* @return the number of object fields in the variant. `getType()` must be `Type.OBJECT`. | ||
*/ | ||
public int objectSize() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to numObjectElements
(and the array one to numArrayElements
).
public int getDictionaryIdAtIndex(int index) { | ||
return VariantUtil.handleObject(value, pos, (size, idSize, offsetSize, idStart, offsetStart, dataStart) -> { | ||
if (index < 0 || index >= size) { | ||
throw VariantUtil.malformedVariant(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it can removed.
*/ | ||
public class VariantSizeLimitException extends RuntimeException { | ||
public VariantSizeLimitException(long sizeLimitBytes, long estimatedSizeBytes) { | ||
super(String.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be reasonable for an engine to want to not create arbitrarily large variant values and set a maximum size limit. How can an engine configure a limit, when this variant library is doing the parsing and creation of the Variant value?
The size limit is configurable when creating the variant builder, and the library is just trying to enforce the caller-configured limit.
Would tthe alternative be to just let the library build the entire value, and then the caller checks the size? That would be unbounded in terms of memory usage, and would negatively affect stability.
} | ||
gen.writeEndObject(); | ||
} catch (IOException e) { | ||
throw new RuntimeException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
case INT2: | ||
case INT4: | ||
case INT8: | ||
return Type.LONG; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated to return all the integer types, and added the corresponding get*
in Variant
.
* @return the JSON representation of the variant | ||
* @throws MalformedVariantException if the variant is malformed | ||
*/ | ||
public String toJson(ZoneId zoneId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is behavior chosen by the caller/engine, but how can we avoid making the engine re-implement variant navigation during to_json evaluation? Without this option, the engine would have to reimplement variant navigation to produce json it wants, or modify the json string?
Alternatively, do we allow the engine pass in implementations of scalar-to-JSON-string, and have the parquet library just run it for each scalar? I wanted to avoid arbitrary possible representations for scalar-to-json conversions.
return Arrays.copyOfRange(value, pos, pos + size); | ||
} | ||
|
||
public byte[] getMetadata() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure what the proposal is. Is this saying we should not return a byte[]
, but something else?
~ specific language governing permissions and limitations | ||
~ under the License. | ||
--> | ||
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @gene-db to drive the reference implementation.
I have a general question on the requirement: we implement mostly Parse_Json() in this PR. Are we required to construct variant with richer type - date, timestamp, etc.? May be out of scope for this PR. I have the implementation in Iceberg (apache/iceberg#11857 to add the full support. As I talked to @rdblue, that may not be required for Iceberg but I can include such implementation in Parquet after this PR if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think parse_json
should be trying to determine what type a particular JSON string is supposed to be. The JSON spec doesn't have the richer types, so parse_json
will not try to guess what the strings might be. It might be error-prone and would be costly in terms of performance. Therefore, parse_json
will only use a subset of the variant types.
This PR also supports the variant builder, which supports creating variant values with all of the variant types.
Rationale for this change
This is a reference implementation for the Variant binary format.
What changes are included in this PR?
A new module for encoding/decoding the Variant binary format.
Are these changes tested?
Added unit tests
Are there any user-facing changes?
No
Closes #3116