-
Notifications
You must be signed in to change notification settings - Fork 2.6k
OpenAPI: Add ContentFile types to spec for the PreplanTable and PlanTable API #9717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
open-api/rest-catalog-open-api.yaml
Outdated
@@ -3324,6 +3348,97 @@ components: | |||
type: integer | |||
format: int64 | |||
|
|||
TypeValue: | |||
oneOf: | |||
- $ref: '#/components/schemas/PrimitiveTypeValue' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My take is that we should have one for each type, for example #/components/schemas/DecimalTypeValue
and #/components/schemas/TimestampTypeValue
although they are all string, and then we can document how each type is serialized differently in the definition section of each type value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thoughts? @danielcweeks @nastra @amogh-jahagirdar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think documenting how the different types are serialized is important but just to make sure for DecimalTypeValue
and TimestampTypeValue
those are just examples right? We don't need those data types spec'd out for what we're trying to do here right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to summarize the 2 options we are discussing so far:
Option 1: different schema
TypeValue:
oneOf:
- $ref: '#/components/schemas/DecimalTypeValue'
- $ref: '#/components/schemas/TimestampTypeValue'
- ...
DecimalTypeValue:
type: string
description: // describe serialization for decimal
...
TimestampTypeValue:
type: string
description: // describe serialization for timestamp
...
Option 2: same schema, just document different serializations
TypeValue:
oneOf:
- $ref: '#/components/schemas/PrimitiveTypeValue'
- ...
PrimitiveTypeValue:
description: // describe serialization for all primitive types
oneOf:
- string
- ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the conversation in the prior PR about passing content files to the service The first approach beneficial if there is a use case for building custom deserializers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah my bad I only saw ContentFile, for the actual update we need schema/partition-spec so yes these data types do need to be spec'd out in rest.
Option 1 seems better to me; it does not have as much indirection as option 2 has and the types are just explicitly laid out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah +1 for option 1. I would suggest @geruh let's first move in that direction and make the corresponding changes, while waiting for more feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for option 1.
I also commented on #9695 about the non-primitive types. I think that structs should use a simple array of these primitive values.
@jackye1995, should we move these additions to a separate PR so that the scan APIs are not blocked by the append API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The append API change is pretty minimal if we exclude these type changes, that's why it is kept here. But seems like we have other thoughts about how the append API as a separated API. In that case +1 let's separate these 2 changes.
We can either create a new PR for the Append API, or the content file. @geruh up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the focus of this PR to the ContentFile
and type spec changes, to preserve the comments. I'll open a new PR later for the Append changes.
Will take a look at this PR tomorrow morning @geruh @jackye1995 ! |
open-api/rest-catalog-open-api.yaml
Outdated
@@ -3324,6 +3348,97 @@ components: | |||
type: integer | |||
format: int64 | |||
|
|||
TypeValue: | |||
oneOf: | |||
- $ref: '#/components/schemas/PrimitiveTypeValue' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think documenting how the different types are serialized is important but just to make sure for DecimalTypeValue
and TimestampTypeValue
those are just examples right? We don't need those data types spec'd out for what we're trying to do here right?
type: integer | ||
format: int64 | ||
|
||
FloatTypeValue: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is based on the Java implementation, but would this really work? How do we preserve float precision in JSON? Did it cause any issue in the past for Flink use case? @stevenzwu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt this would have been an issue for Flink because the tasks probably don't need to contain stats. Stats would be used at planning time on the coordinator/driver, but aren't needed on the task side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we need a lossless spec in JSON, we should probably change this definition. For example, we can store it as a byte string, or a long value which has the same binary representation. Any thoughts in particular?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The single-value JSON serialization should be lossless. You just need to encode values with enough precision to capture the whole value.
While not every decimal number has an exact binary representation, each binary representation can be expressed as a decimal number. Here's a stack overflow answer I found to sanity check that: https://stackoverflow.com/questions/68943707/are-there-any-binary-values-that-dont-have-exact-representation-in-decimal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that we need to consider both directions, because this is a wire protocol and you expect the other side to deserialize to exactly the same value.
Suppose A sends B a JSON value of the upper bound { 1: 0.1 }
, where column 1 is of float type, the other side cannot really reconstruct the original float binary.
Maybe I am misunderstanding what you mean by "each binary representation can be expressed as a decimal number", could you help walk through that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a potential issue of this in the current implementation. We directly use Jackson's writeNumber(float), which has some default rounding behavior if I remember correctly. Let me double check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some experiments, I think the current SingleValueParser implementation has some problems.
The Jackson writeNumber(float/double) method just writes the string form of the number by calling Float/Double.toString() method internally. The Java doc linked provides an explanation of how the string conversion is done.
The conversion fundamentally have 2 problems, (1) data that is too big or too small (outside 10^-3 to 10^7 range) is written in scientific notation, and the JSON representation will be a string but not number. For example, 10^20
is written as "1.0E20"
. (2) the result is lossy, because it is the nearest approximation to the true value. This is not serializing the float/double to the exact decimal representation as we discussed above.
For example, there is a very small chance that value 3.0
is sometimes serialized as 2.99999999999999
, and when deserialized back it is probably still the 3.0 double value, but sometimes it will be just 2.99999999999999. This becomes a correctness issue for use cases like row-level filtering, where user can define a filter against a double like a < 3
and that can produce unexpected result. We actually saw this exact issue in the past in LakeFormation row-level filtering with Athena, so I suggest us be very cautious here.
In general, I think achieving the true decimal representation will be actually more spacial and computationally intensive than just storing the binary representation. We can easily store the binary form of a double by Double.doubleToRawLongBits to store a long value in the serialized form, and deserialize it back using the reverse longBitsToDouble
method. I think we should consider using this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can easily store the binary form of a double by Double.doubleToRawLongBits to store a long value in the serialized form, and deserialize it back using the reverse longBitsToDouble method.
@jackye1995 I thought that won't conform to the JSON spec: https://datatracker.ietf.org/doc/html/rfc8259#section-6
This specification allows implementations to set limits on the range
and precision of numbers accepted. Since software that implements
IEEE 754 binary64 (double precision) numbers [[IEEE754](https://datatracker.ietf.org/doc/html/rfc8259#ref-IEEE754)] is generally
available and widely used, good interoperability can be achieved by
implementations that expect no more precision or range than these
provide, in the sense that implementations will approximate JSON
numbers within the expected precision. A JSON number such as 1E400
or 3.141592653589793238462643383279 may indicate potential
interoperability problems, since it suggests that the software that
created it expects receiving software to have greater capabilities
for numeric magnitude and precision than is widely available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackye1995, we should make sure that we produce values correctly and don't modify them in serialization or deserialization.
However, for the purpose of this spec I think that it is okay to move forward. Both float and double are part of the OpenAPI Spec (OAS) so I think they can be exchanged correctly. We just need to make sure our implementation is doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can move forward with this first, although I am not fully convinced yet, but I don't want to block on this for too long. I think the issues I see are mostly due to the serialization implementation of Jackson that does not comply with the standards you listed above. We can address that in subsequent implementation PRs.
Mostly looks good to me. I flagged a couple of minor things. Also, I don't think that we resolved this thread: #9717 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me!
This looks good to me now. Since we have 3 approvals, I'll go ahead and merge it. |
Let's split the REST API changes from the implementation, since there is some overlap with the rest scan API. We are now serializing DataFiles using the ContentFileParser and passing them to the service. ContentFileParser requires passing additional information like the
Schema
andPartitionSpec
of the table to successfully round trip the updates and apply them to a table. With this approach, the updates will follow this structure:Furthermore, this change includes a dependency on the models for the SIngleValueParser which is used to serialize the values of the partition spec. So the models follow how the
toJson
method write the data.with the proposed spec we now get:
These model changes are stemming from the conversation here: #9292
cc @jackye1995 @rdblue @danielcweeks @rahil-c
Testing
OpenAPI: ran make install, make lint, make generate