Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-3116: Implement the Variant binary encoding #3117

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

gene-db
Copy link

@gene-db gene-db commented Jan 7, 2025

Rationale for this change

This is a reference implementation for the Variant binary format.

What changes are included in this PR?

A new module for encoding/decoding the Variant binary format.

Are these changes tested?

Added unit tests

Are there any user-facing changes?

No

Closes #3116

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @gene-db! I left some comments, but this is looking good

Comment on lines 238 to 240
if (index < 0 || index >= size) {
throw malformedVariant();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks inconsistent with the getFieldAtIndex where we return a null. Let's raise an exception at line 220 as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getFieldAtIndex is a little bit different, since if a field doesn't exist in a variant value, that doesn't mean the variant value is malformed. This dictionary case is different because we are expecting an id in the dictionary to exist, but it doesn't.

Comment on lines +551 to +552
// If the value doesn't fit any integer type, parse it as decimal or floating instead.
parseAndAppendFloatingPoint(parser);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is lossy, and I'd rather raise an exception

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a tricky situation. We decided to allow parsing this type of valid JSON and not return an error, since the JSON is technically valid. It is not ideal that a valid JSON string hits an error. This behavior is similar to how Snowflake's variant parses JSON.

* Builder for creating Variant value and metadata.
*/
public class VariantBuilder {
public VariantBuilder(boolean allowDuplicateKeys) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we allow this? This isn't allowed by the spec

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not for writing duplicate keys in the Variant value itself, but for parsing JSON strings. JSON strings might have duplicate keys, and this flag controls the behavior when encountering duplicate keys.

I added a comment to clarify.

return Arrays.copyOfRange(value, pos, pos + size);
}

public byte[] getMetadata() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of byte[] seems awkward given the assumptions that are made. It looks like the intent is for value and metadata to either be two separate arrays starting at offset 0, or a single array with metadata coming first followed by value at pos (but in this case, the array is passed to the constructor twice).

A more common pattern would be to specify each array along with an offset and a length, so that there are no implicit assumptions about the array contents.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we assume that metadata and value are in the same array? I don't think we are making that assumption.

The pos part in getValue() is not assuming the metadata is in the same array, but is for getting a "sub-variant" value from a variant value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we assume that metadata and value are in the same array? I don't think we are making that assumption.

I was referring to the possible values and intent for the pos argument and trying to understand your intent from this code. But that isn't the point I was trying to make.

The point here is that it is more common in Java to pass byte arrays with offset and length, rather than requiring that arrays are copied before passing them in. I think the use of 0-offset byte arrays is limiting.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure what the proposal is. Is this saying we should not return a byte[], but something else?

Copy link
Author

@gene-db gene-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko @rdblue Thanks for the reviews! I updated the PR.

return Arrays.copyOfRange(value, pos, pos + size);
}

public byte[] getMetadata() {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we assume that metadata and value are in the same array? I don't think we are making that assumption.

The pos part in getValue() is not assuming the metadata is in the same array, but is for getting a "sub-variant" value from a variant value.

Comment on lines +551 to +552
// If the value doesn't fit any integer type, parse it as decimal or floating instead.
parseAndAppendFloatingPoint(parser);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a tricky situation. We decided to allow parsing this type of valid JSON and not return an error, since the JSON is technically valid. It is not ideal that a valid JSON string hits an error. This behavior is similar to how Snowflake's variant parses JSON.

* @return the JSON representation of the variant
* @throws MalformedVariantException if the variant is malformed
*/
public String toJson(ZoneId zoneId) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the toJson() which defaults to +00:00. The options are there for engines to choose the behavior, while sharing the same implementation.

@gene-db gene-db requested review from Fokko and rdblue February 6, 2025 03:05
@gene-db gene-db requested a review from cashmand February 13, 2025 18:59
Comment on lines 26 to 27
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import java.nio.ByteBuffer;
import java.nio.ByteOrder;

These can go according to spotless:

Error:      src/main/java/org/apache/parquet/variant/Variant.java
Error:          @@ -23,8 +23,6 @@
Error:           import·java.io.CharArrayWriter;
Error:           import·java.io.IOException;
Error:           import·java.math.BigDecimal;
Error:          -import·java.nio.ByteBuffer;
Error:          -import·java.nio.ByteOrder;
Error:           import·java.time.*;
Error:           import·java.time.format.DateTimeFormatter;
Error:           import·java.time.format.DateTimeFormatterBuilder;
Error:  Run 'mvn spotless:apply' to fix these violations.
Error:  -> [Help 1]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

* An exception indicating that the Variant is malformed.
*/
public class MalformedVariantException extends RuntimeException {
public MalformedVariantException() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? I genearally consider no-arg constructors for exception classes to be an anti-pattern because people use them without thinking about what helpful error message should be included.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

/**
* @return the type id that was unknown
*/
public int getTypeId() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I don't recommend using get although it is popular. There is almost always a more specific verb that is better, and if not it can usually be omitted for a cleaner method name.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to typeId().

/**
* @return the number of object fields in the variant. `getType()` must be `Type.OBJECT`.
*/
public int objectSize() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec uses num_elements for this. I think it would be better to be clear because objectSize could return a length in bytes rather than number of fields or elements.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to numObjectElements (and the array one to numArrayElements).

public int getDictionaryIdAtIndex(int index) {
return VariantUtil.handleObject(value, pos, (size, idSize, offsetSize, idStart, offsetStart, dataStart) -> {
if (index < 0 || index >= size) {
throw VariantUtil.malformedVariant();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a malformed variant, but the same issue with a field or array results in null?

I also don't think that this can assume that the variant is malformed in this case. The index is passed in by the caller and this is a public method. The index may not exist, but if so the problem is in the caller not the data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't see any uses of this. Can it be removed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it can removed.

*/
public class VariantSizeLimitException extends RuntimeException {
public VariantSizeLimitException(long sizeLimitBytes, long estimatedSizeBytes) {
super(String.format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a size limit imposed by Parquet? We don't do this anywhere else and it isn't in the spec.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be reasonable for an engine to want to not create arbitrarily large variant values and set a maximum size limit. How can an engine configure a limit, when this variant library is doing the parsing and creation of the Variant value?

The size limit is configurable when creating the variant builder, and the library is just trying to enforce the caller-configured limit.

Would tthe alternative be to just let the library build the entire value, and then the caller checks the size? That would be unbounded in terms of memory usage, and would negatively affect stability.

*/
public class VariantUtil {
public static final int BASIC_TYPE_BITS = 2;
public static final int BASIC_TYPE_MASK = 0x3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it's easier to read if masks like this use 0b00000011 instead of 0x3.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

/** False value. Empty content. */
public static final int FALSE = 2;
/** 1-byte little-endian signed integer. */
public static final int INT1 = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: In the spec these are INT8, INT16, INT32, etc that match the Parquet physical types.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed.

}

public static MalformedVariantException malformedVariant() {
return new MalformedVariantException();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the variant is malformed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

}

public static MalformedVariantException malformedVariant(String message) {
return new MalformedVariantException(message);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the utility of this if it just calls a constructor?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

return new MalformedVariantException();
}

public static UnknownVariantTypeException unknownPrimitiveTypeInVariant(int id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also just a call to the constructor. I'd remove this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

* @throws MalformedVariantException if the index is out of bound
*/
public static void checkIndex(int pos, int length) {
if (pos < 0 || pos >= length) throw malformedVariant();
Copy link
Contributor

@rdblue rdblue Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: Use curly braces even if they are not needed.

I also think this has the same issue I pointed out above, which is that there is no reason to think variant bytes are malformed just because the caller is incorrect. Instead this would normally be an IllegalArgumentException. Throwing MalformedVariantException assumes too much about how this is called.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to IllegalArgumentException.

}

private static IllegalStateException unexpectedType(Type type) {
return new IllegalStateException("Expect type to be " + type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a malformed variant?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to MalformedVariantException

*/
private static void checkDecimal(BigDecimal d, int maxPrecision) {
if (d.precision() > maxPrecision || d.scale() > maxPrecision) {
throw malformedVariant();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a helpful error message about the data and what went wrong.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added error message.

case INT2:
case INT4:
case INT8:
return Type.LONG;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that this implementation doesn't allow the caller to know the actual type or to get the value as an int32 or int16 when that is what is stored. This forces the caller to use a long value even for int8.

Like I've pointed out before, I don't think that storage should modify values passed to it. In this case, the values may not be modified, but it isn't possible to tell what the values were, which is basically the same problem.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated to return all the integer types, and added the corresponding get* in Variant.

});
}

private static Variant getElementAtIndex(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used by getElementAtIndex(int) and in the JSON conversion code. The method above, getElementAtIndex(int) isn't used at all, nor is arraySize. Does that mean this is not tested?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added unit tests for both getElementAtIndex and arraySize.

gen.flush();
return writer.toString();
} catch (IOException e) {
throw new RuntimeException(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use RuntimeIOException instead. It's also nice to add context.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this one and others to use RuntimeIOException

}
gen.writeEndObject();
} catch (IOException e) {
throw new RuntimeException(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than adding try/catch inline, I think it makes more sense for toJsonImpl to throw IOException so this can be handled in the wrapper methods that are public.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Author

@gene-db gene-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko @rdblue Thanks for the reviews! I updated the PR.

Comment on lines 26 to 27
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

/**
* @return the type id that was unknown
*/
public int getTypeId() {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to typeId().

* An exception indicating that the Variant is malformed.
*/
public class MalformedVariantException extends RuntimeException {
public MalformedVariantException() {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

/**
* @return the number of object fields in the variant. `getType()` must be `Type.OBJECT`.
*/
public int objectSize() {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to numObjectElements (and the array one to numArrayElements).

public int getDictionaryIdAtIndex(int index) {
return VariantUtil.handleObject(value, pos, (size, idSize, offsetSize, idStart, offsetStart, dataStart) -> {
if (index < 0 || index >= size) {
throw VariantUtil.malformedVariant();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it can removed.

*/
public class VariantSizeLimitException extends RuntimeException {
public VariantSizeLimitException(long sizeLimitBytes, long estimatedSizeBytes) {
super(String.format(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be reasonable for an engine to want to not create arbitrarily large variant values and set a maximum size limit. How can an engine configure a limit, when this variant library is doing the parsing and creation of the Variant value?

The size limit is configurable when creating the variant builder, and the library is just trying to enforce the caller-configured limit.

Would tthe alternative be to just let the library build the entire value, and then the caller checks the size? That would be unbounded in terms of memory usage, and would negatively affect stability.

}
gen.writeEndObject();
} catch (IOException e) {
throw new RuntimeException(e);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

case INT2:
case INT4:
case INT8:
return Type.LONG;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated to return all the integer types, and added the corresponding get* in Variant.

* @return the JSON representation of the variant
* @throws MalformedVariantException if the variant is malformed
*/
public String toJson(ZoneId zoneId) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is behavior chosen by the caller/engine, but how can we avoid making the engine re-implement variant navigation during to_json evaluation? Without this option, the engine would have to reimplement variant navigation to produce json it wants, or modify the json string?

Alternatively, do we allow the engine pass in implementations of scalar-to-JSON-string, and have the parquet library just run it for each scalar? I wanted to avoid arbitrary possible representations for scalar-to-json conversions.

return Arrays.copyOfRange(value, pos, pos + size);
}

public byte[] getMetadata() {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure what the proposal is. Is this saying we should not return a byte[], but something else?

@gene-db gene-db requested review from Fokko and rdblue February 27, 2025 21:11
~ specific language governing permissions and limitations
~ under the License.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
Copy link

@aihuaxu aihuaxu Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @gene-db to drive the reference implementation.

I have a general question on the requirement: we implement mostly Parse_Json() in this PR. Are we required to construct variant with richer type - date, timestamp, etc.? May be out of scope for this PR. I have the implementation in Iceberg (apache/iceberg#11857 to add the full support. As I talked to @rdblue, that may not be required for Iceberg but I can include such implementation in Parquet after this PR if needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think parse_json should be trying to determine what type a particular JSON string is supposed to be. The JSON spec doesn't have the richer types, so parse_json will not try to guess what the strings might be. It might be error-prone and would be costly in terms of performance. Therefore, parse_json will only use a subset of the variant types.

This PR also supports the variant builder, which supports creating variant values with all of the variant types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement the Variant binary encoding
5 participants