GH-3116: Implement the Variant binary encoding #3117

gene-db · 2025-01-07T21:14:58Z

Rationale for this change

This is a reference implementation for the Variant binary format.

What changes are included in this PR?

A new module for encoding/decoding the Variant binary format.

Are these changes tested?

Added unit tests

Are there any user-facing changes?

No

Closes #3116

Fokko

Thanks for working on this @gene-db! I left some comments, but this is looking good

parquet-variant/pom.xml

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

Fokko · 2025-01-20T15:56:17Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+      if (index < 0 || index >= size) {
+        throw malformedVariant();
+      }


This looks inconsistent with the getFieldAtIndex where we return a null. Let's raise an exception at line 220 as well.

getFieldAtIndex is a little bit different, since if a field doesn't exist in a variant value, that doesn't mean the variant value is malformed. This dictionary case is different because we are expecting an id in the dictionary to exist, but it doesn't.

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

Fokko · 2025-01-23T10:46:16Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

+          // If the value doesn't fit any integer type, parse it as decimal or floating instead.
+          parseAndAppendFloatingPoint(parser);


I think this is lossy, and I'd rather raise an exception

Yeah, this is a tricky situation. We decided to allow parsing this type of valid JSON and not return an error, since the JSON is technically valid. It is not ideal that a valid JSON string hits an error. This behavior is similar to how Snowflake's variant parses JSON.

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

Fokko · 2025-01-23T12:57:11Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

+ * Builder for creating Variant value and metadata.
+ */
+public class VariantBuilder {
+  public VariantBuilder(boolean allowDuplicateKeys) {


Why would we allow this? This isn't allowed by the spec

This is not for writing duplicate keys in the Variant value itself, but for parsing JSON strings. JSON strings might have duplicate keys, and this flag controls the behavior when encountering duplicate keys.

I added a comment to clarify.

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

rdblue · 2025-01-23T23:46:42Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+    return Arrays.copyOfRange(value, pos, pos + size);
+  }
+
+  public byte[] getMetadata() {


The use of byte[] seems awkward given the assumptions that are made. It looks like the intent is for value and metadata to either be two separate arrays starting at offset 0, or a single array with metadata coming first followed by value at pos (but in this case, the array is passed to the constructor twice).

A more common pattern would be to specify each array along with an offset and a length, so that there are no implicit assumptions about the array contents.

Where do we assume that metadata and value are in the same array? I don't think we are making that assumption.

The pos part in getValue() is not assuming the metadata is in the same array, but is for getting a "sub-variant" value from a variant value.

Where do we assume that metadata and value are in the same array? I don't think we are making that assumption.

I was referring to the possible values and intent for the pos argument and trying to understand your intent from this code. But that isn't the point I was trying to make.

The point here is that it is more common in Java to pass byte arrays with offset and length, rather than requiring that arrays are copied before passing them in. I think the use of 0-offset byte arrays is limiting.

I'm not entirely sure what the proposal is. Is this saying we should not return a byte[], but something else?

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

gene-db

@Fokko @rdblue Thanks for the reviews! I updated the PR.

parquet-variant/pom.xml

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

gene-db · 2025-02-03T18:12:44Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+    return Arrays.copyOfRange(value, pos, pos + size);
+  }
+
+  public byte[] getMetadata() {


Where do we assume that metadata and value are in the same array? I don't think we are making that assumption.

The pos part in getValue() is not assuming the metadata is in the same array, but is for getting a "sub-variant" value from a variant value.

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

gene-db · 2025-02-04T18:40:00Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

+          // If the value doesn't fit any integer type, parse it as decimal or floating instead.
+          parseAndAppendFloatingPoint(parser);


Yeah, this is a tricky situation. We decided to allow parsing this type of valid JSON and not return an error, since the JSON is technically valid. It is not ideal that a valid JSON string hits an error. This behavior is similar to how Snowflake's variant parses JSON.

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

gene-db · 2025-02-05T22:22:42Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+   * @return the JSON representation of the variant
+   * @throws MalformedVariantException if the variant is malformed
+   */
+  public String toJson(ZoneId zoneId) {


I added the toJson() which defaults to +00:00. The options are there for engines to choose the behavior, while sharing the same implementation.

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

Fokko · 2025-02-20T18:48:38Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;


Suggested change

import java.nio.ByteBuffer;

import java.nio.ByteOrder;

These can go according to spotless:

Error: src/main/java/org/apache/parquet/variant/Variant.java Error: @@ -23,8 +23,6 @@ Error: import·java.io.CharArrayWriter; Error: import·java.io.IOException; Error: import·java.math.BigDecimal; Error: -import·java.nio.ByteBuffer; Error: -import·java.nio.ByteOrder; Error: import·java.time.*; Error: import·java.time.format.DateTimeFormatter; Error: import·java.time.format.DateTimeFormatterBuilder; Error: Run 'mvn spotless:apply' to fix these violations. Error: -> [Help 1]

rdblue · 2025-02-24T23:01:06Z

parquet-variant/src/main/java/org/apache/parquet/variant/MalformedVariantException.java

+ * An exception indicating that the Variant is malformed.
+ */
+public class MalformedVariantException extends RuntimeException {
+  public MalformedVariantException() {


Is this necessary? I genearally consider no-arg constructors for exception classes to be an anti-pattern because people use them without thinking about what helpful error message should be included.

rdblue · 2025-02-24T23:02:11Z

parquet-variant/src/main/java/org/apache/parquet/variant/UnknownVariantTypeException.java

+  /**
+   * @return the type id that was unknown
+   */
+  public int getTypeId() {


Nit: I don't recommend using get although it is popular. There is almost always a more specific verb that is better, and if not it can usually be omitted for a cleaner method name.

Renamed to typeId().

rdblue · 2025-02-24T23:08:05Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+  /**
+   * @return the number of object fields in the variant. `getType()` must be `Type.OBJECT`.
+   */
+  public int objectSize() {


The spec uses num_elements for this. I think it would be better to be clear because objectSize could return a length in bytes rather than number of fields or elements.

Renamed to numObjectElements (and the array one to numArrayElements).

rdblue · 2025-02-24T23:16:37Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+  public int getDictionaryIdAtIndex(int index) {
+    return VariantUtil.handleObject(value, pos, (size, idSize, offsetSize, idStart, offsetStart, dataStart) -> {
+      if (index < 0 || index >= size) {
+        throw VariantUtil.malformedVariant();


Why is this a malformed variant, but the same issue with a field or array results in null?

I also don't think that this can assume that the variant is malformed in this case. The index is passed in by the caller and this is a public method. The index may not exist, but if so the problem is in the caller not the data.

Actually, I don't see any uses of this. Can it be removed?

Yeah, it can removed.

rdblue · 2025-02-24T23:28:18Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantSizeLimitException.java

+ */
+public class VariantSizeLimitException extends RuntimeException {
+  public VariantSizeLimitException(long sizeLimitBytes, long estimatedSizeBytes) {
+    super(String.format(


Why is there a size limit imposed by Parquet? We don't do this anywhere else and it isn't in the spec.

I think it would be reasonable for an engine to want to not create arbitrarily large variant values and set a maximum size limit. How can an engine configure a limit, when this variant library is doing the parsing and creation of the Variant value?

The size limit is configurable when creating the variant builder, and the library is just trying to enforce the caller-configured limit.

Would tthe alternative be to just let the library build the entire value, and then the caller checks the size? That would be unbounded in terms of memory usage, and would negatively affect stability.

rdblue · 2025-02-24T23:29:01Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+ */
+public class VariantUtil {
+  public static final int BASIC_TYPE_BITS = 2;
+  public static final int BASIC_TYPE_MASK = 0x3;


Nit: I think it's easier to read if masks like this use 0b00000011 instead of 0x3.

rdblue · 2025-02-24T23:29:58Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+  /** False value. Empty content. */
+  public static final int FALSE = 2;
+  /** 1-byte little-endian signed integer. */
+  public static final int INT1 = 3;


Minor: In the spec these are INT8, INT16, INT32, etc that match the Parquet physical types.

rdblue · 2025-02-24T23:31:55Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+  }
+
+  public static MalformedVariantException malformedVariant() {
+    return new MalformedVariantException();


What about the variant is malformed?

rdblue · 2025-02-24T23:32:10Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+  }
+
+  public static MalformedVariantException malformedVariant(String message) {
+    return new MalformedVariantException(message);


What is the utility of this if it just calls a constructor?

rdblue · 2025-02-24T23:36:12Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+    return new MalformedVariantException();
+  }
+
+  public static UnknownVariantTypeException unknownPrimitiveTypeInVariant(int id) {


This is also just a call to the constructor. I'd remove this.

rdblue · 2025-02-25T00:04:47Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+   * @throws MalformedVariantException if the index is out of bound
+   */
+  public static void checkIndex(int pos, int length) {
+    if (pos < 0 || pos >= length) throw malformedVariant();


Style: Use curly braces even if they are not needed.

I also think this has the same issue I pointed out above, which is that there is no reason to think variant bytes are malformed just because the caller is incorrect. Instead this would normally be an IllegalArgumentException. Throwing MalformedVariantException assumes too much about how this is called.

Switched to IllegalArgumentException.

rdblue · 2025-02-25T00:07:39Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+  }
+
+  private static IllegalStateException unexpectedType(Type type) {
+    return new IllegalStateException("Expect type to be " + type);


Isn't this a malformed variant?

Updated to MalformedVariantException

rdblue · 2025-02-25T00:08:31Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+   */
+  private static void checkDecimal(BigDecimal d, int maxPrecision) {
+    if (d.precision() > maxPrecision || d.scale() > maxPrecision) {
+      throw malformedVariant();


This should have a helpful error message about the data and what went wrong.

Added error message.

rdblue · 2025-02-25T00:28:09Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+          case INT2:
+          case INT4:
+          case INT8:
+            return Type.LONG;


I don't like that this implementation doesn't allow the caller to know the actual type or to get the value as an int32 or int16 when that is what is stored. This forces the caller to use a long value even for int8.

Like I've pointed out before, I don't think that storage should modify values passed to it. In this case, the values may not be modified, but it isn't possible to tell what the values were, which is basically the same problem.

I updated to return all the integer types, and added the corresponding get* in Variant.

rdblue · 2025-02-25T00:31:51Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+    });
+  }
+
+  private static Variant getElementAtIndex(


This is only used by getElementAtIndex(int) and in the JSON conversion code. The method above, getElementAtIndex(int) isn't used at all, nor is arraySize. Does that mean this is not tested?

I added unit tests for both getElementAtIndex and arraySize.

rdblue · 2025-02-25T00:33:27Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+      gen.flush();
+      return writer.toString();
+    } catch (IOException e) {
+      throw new RuntimeException(e);


Use RuntimeIOException instead. It's also nice to add context.

Update this one and others to use RuntimeIOException

rdblue · 2025-02-25T00:35:04Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+            }
+            gen.writeEndObject();
+          } catch (IOException e) {
+            throw new RuntimeException(e);


Rather than adding try/catch inline, I think it makes more sense for toJsonImpl to throw IOException so this can be handled in the wrapper methods that are public.

gene-db

@Fokko @rdblue Thanks for the reviews! I updated the PR.

gene-db · 2025-02-25T18:10:48Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;


gene-db · 2025-02-25T19:07:39Z

parquet-variant/src/main/java/org/apache/parquet/variant/UnknownVariantTypeException.java

+  /**
+   * @return the type id that was unknown
+   */
+  public int getTypeId() {


Renamed to typeId().

gene-db · 2025-02-25T19:21:35Z

parquet-variant/src/main/java/org/apache/parquet/variant/MalformedVariantException.java

+ * An exception indicating that the Variant is malformed.
+ */
+public class MalformedVariantException extends RuntimeException {
+  public MalformedVariantException() {


gene-db · 2025-02-25T19:25:51Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+  /**
+   * @return the number of object fields in the variant. `getType()` must be `Type.OBJECT`.
+   */
+  public int objectSize() {


Renamed to numObjectElements (and the array one to numArrayElements).

gene-db · 2025-02-25T19:26:58Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+  public int getDictionaryIdAtIndex(int index) {
+    return VariantUtil.handleObject(value, pos, (size, idSize, offsetSize, idStart, offsetStart, dataStart) -> {
+      if (index < 0 || index >= size) {
+        throw VariantUtil.malformedVariant();


Yeah, it can removed.

gene-db · 2025-02-25T22:51:25Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantSizeLimitException.java

+ */
+public class VariantSizeLimitException extends RuntimeException {
+  public VariantSizeLimitException(long sizeLimitBytes, long estimatedSizeBytes) {
+    super(String.format(


I think it would be reasonable for an engine to want to not create arbitrarily large variant values and set a maximum size limit. How can an engine configure a limit, when this variant library is doing the parsing and creation of the Variant value?

The size limit is configurable when creating the variant builder, and the library is just trying to enforce the caller-configured limit.

Would tthe alternative be to just let the library build the entire value, and then the caller checks the size? That would be unbounded in terms of memory usage, and would negatively affect stability.

gene-db · 2025-02-26T21:14:26Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+            }
+            gen.writeEndObject();
+          } catch (IOException e) {
+            throw new RuntimeException(e);


gene-db · 2025-02-26T23:39:59Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+          case INT2:
+          case INT4:
+          case INT8:
+            return Type.LONG;


I updated to return all the integer types, and added the corresponding get* in Variant.

gene-db · 2025-02-27T21:01:39Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+   * @return the JSON representation of the variant
+   * @throws MalformedVariantException if the variant is malformed
+   */
+  public String toJson(ZoneId zoneId) {


This is behavior chosen by the caller/engine, but how can we avoid making the engine re-implement variant navigation during to_json evaluation? Without this option, the engine would have to reimplement variant navigation to produce json it wants, or modify the json string?

Alternatively, do we allow the engine pass in implementations of scalar-to-JSON-string, and have the parquet library just run it for each scalar? I wanted to avoid arbitrary possible representations for scalar-to-json conversions.

gene-db · 2025-02-27T21:05:12Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+    return Arrays.copyOfRange(value, pos, pos + size);
+  }
+
+  public byte[] getMetadata() {


I'm not entirely sure what the proposal is. Is this saying we should not return a byte[], but something else?

aihuaxu · 2025-03-03T18:50:57Z

parquet-variant/pom.xml

+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">


Thanks a lot @gene-db to drive the reference implementation.

I have a general question on the requirement: we implement mostly Parse_Json() in this PR. Are we required to construct variant with richer type - date, timestamp, etc.? May be out of scope for this PR. I have the implementation in Iceberg (apache/iceberg#11857 to add the full support. As I talked to @rdblue, that may not be required for Iceberg but I can include such implementation in Parquet after this PR if needed.

I don't think parse_json should be trying to determine what type a particular JSON string is supposed to be. The JSON spec doesn't have the richer types, so parse_json will not try to guess what the strings might be. It might be error-prone and would be costly in terms of performance. Therefore, parse_json will only use a subset of the variant types.

This PR also supports the variant builder, which supports creating variant values with all of the variant types.

gene-db added 8 commits January 6, 2025 13:21

Implement Variant encoding

c3c71b7

remove optional

c5d19e6

split test

0086b34

cleanup

5af337f

cleanup comment

5997732

Run mvn spotless:apply

de96bac

Fix dependencies

848ddcb

Fix tests for older jdk versions

1a448ea

Fokko reviewed Jan 23, 2025

View reviewed changes

rdblue reviewed Jan 23, 2025

View reviewed changes

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 23, 2025

View reviewed changes

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 23, 2025

View reviewed changes

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 23, 2025

View reviewed changes

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Show resolved Hide resolved

rdblue reviewed Jan 23, 2025

View reviewed changes

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 24, 2025

View reviewed changes

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 24, 2025

View reviewed changes

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java Outdated Show resolved Hide resolved

gene-db added 2 commits February 5, 2025 15:05

Address PR comments

2056297

Add new variant types

1ea911c

gene-db commented Feb 5, 2025

View reviewed changes

gene-db requested review from Fokko and rdblue February 6, 2025 03:05

Fix tests for older JDK versions

cb954a6

cashmand reviewed Feb 11, 2025

View reviewed changes

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated Show resolved Hide resolved

Return UUID

db6b98e

gene-db requested a review from cashmand February 13, 2025 18:59

cashmand suggested changes Feb 13, 2025

View reviewed changes

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated Show resolved Hide resolved

Return java.util.UUID

c220c3c

Fokko reviewed Feb 20, 2025

View reviewed changes

rdblue reviewed Feb 24, 2025

View reviewed changes

rdblue reviewed Feb 25, 2025

View reviewed changes

gene-db added 2 commits February 25, 2025 10:11

mvn spotless:apply

2318114

review feedback

553dbe9

gene-db commented Feb 27, 2025

View reviewed changes

gene-db requested review from Fokko and rdblue February 27, 2025 21:11

aihuaxu reviewed Mar 3, 2025

View reviewed changes

		// If the value doesn't fit any integer type, parse it as decimal or floating instead.
		parseAndAppendFloatingPoint(parser);

GH-3116: Implement the Variant binary encoding #3117

Are you sure you want to change the base?

GH-3116: Implement the Variant binary encoding #3117

Conversation

gene-db commented Jan 7, 2025

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gene-db left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gene-db left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aihuaxu Mar 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Feb 25, 2025 •

edited

Loading

aihuaxu Mar 3, 2025 •

edited

Loading