OpenAPI: Add ContentFile types to spec for the PreplanTable and PlanTable API #9717

geruh · 2024-02-12T21:31:38Z

Let's split the REST API changes from the implementation, since there is some overlap with the rest scan API. We are now serializing DataFiles using the ContentFileParser and passing them to the service. ContentFileParser requires passing additional information like the Schema and PartitionSpec of the table to successfully round trip the updates and apply them to a table. With this approach, the updates will follow this structure:

{
  "spec-id": 0,
  "content": "DATA",
  "file-path": "/path/to/data-with-stats.parquet",
  "file-format": "PARQUET",
  "partition": {
    "1000": 1
  },
  "file-size-in-bytes": 350,
  "record-count": 10,
  "column-sizes": {
    "keys": [3,4],
    "values": [100,200]
  },
  "value-counts": {
    "keys": [3,4],
    "values": [90,180]
  },
  "null-value-counts": {
    "keys": [3,4],
    "values": [10,20]
  },
  "nan-value-counts": {
    "keys": [3,4],
    "values": [0,0]
  },
  "lower-bounds": {
    "keys": [3,4],
    "values": ["01000000","02000000"]
  },
  "upper-bounds": {
    "keys": [3,4],
    "values": ["05000000","0A000000"]
  },
  "key-metadata": "00000000000000000000000000000000",
  "split-offsets": [128,256],
  "sort-order-id": 1
}

Furthermore, this change includes a dependency on the models for the SIngleValueParser which is used to serialize the values of the partition spec. So the models follow how the toJson method write the data.

with the proposed spec we now get:

partition field as a list of primitives

These model changes are stemming from the conversation here: #9292
cc @jackye1995 @rdblue @danielcweeks @rahil-c

Testing

OpenAPI: ran make install, make lint, make generate

jackye1995 · 2024-02-12T21:49:42Z

open-api/rest-catalog-open-api.yaml

@@ -3324,6 +3348,97 @@ components:
          type: integer
          format: int64

+    TypeValue:
+      oneOf:
+        - $ref: '#/components/schemas/PrimitiveTypeValue'


My take is that we should have one for each type, for example #/components/schemas/DecimalTypeValue and #/components/schemas/TimestampTypeValue although they are all string, and then we can document how each type is serialized differently in the definition section of each type value.

thoughts? @danielcweeks @nastra @amogh-jahagirdar

I think documenting how the different types are serialized is important but just to make sure for DecimalTypeValue and TimestampTypeValue those are just examples right? We don't need those data types spec'd out for what we're trying to do here right?

So to summarize the 2 options we are discussing so far:

Option 1: different schema

TypeValue: oneOf: - $ref: '#/components/schemas/DecimalTypeValue' - $ref: '#/components/schemas/TimestampTypeValue' - ... DecimalTypeValue: type: string description: // describe serialization for decimal ... TimestampTypeValue: type: string description: // describe serialization for timestamp ...

Option 2: same schema, just document different serializations

TypeValue: oneOf: - $ref: '#/components/schemas/PrimitiveTypeValue' - ... PrimitiveTypeValue: description: // describe serialization for all primitive types oneOf: - string - ...

@rdblue @danielcweeks @amogh-jahagirdar @rahil-c

Per the conversation in the prior PR about passing content files to the service The first approach beneficial if there is a use case for building custom deserializers

Ah my bad I only saw ContentFile, for the actual update we need schema/partition-spec so yes these data types do need to be spec'd out in rest.

Option 1 seems better to me; it does not have as much indirection as option 2 has and the types are just explicitly laid out.

yeah +1 for option 1. I would suggest @geruh let's first move in that direction and make the corresponding changes, while waiting for more feedback.

+1 for option 1.

I also commented on #9695 about the non-primitive types. I think that structs should use a simple array of these primitive values.

@jackye1995, should we move these additions to a separate PR so that the scan APIs are not blocked by the append API?

The append API change is pretty minimal if we exclude these type changes, that's why it is kept here. But seems like we have other thoughts about how the append API as a separated API. In that case +1 let's separate these 2 changes.

We can either create a new PR for the Append API, or the content file. @geruh up to you.

I changed the focus of this PR to the ContentFile and type spec changes, to preserve the comments. I'll open a new PR later for the Append changes.

open-api/rest-catalog-open-api.yaml

amogh-jahagirdar · 2024-02-13T05:31:13Z

Will take a look at this PR tomorrow morning @geruh @jackye1995 !

open-api/rest-catalog-open-api.yaml

open-api/rest-catalog-open-api.py

amogh-jahagirdar · 2024-02-13T21:37:51Z

open-api/rest-catalog-open-api.yaml

@@ -3324,6 +3348,97 @@ components:
          type: integer
          format: int64

+    TypeValue:
+      oneOf:
+        - $ref: '#/components/schemas/PrimitiveTypeValue'


I think documenting how the different types are serialized is important but just to make sure for DecimalTypeValue and TimestampTypeValue those are just examples right? We don't need those data types spec'd out for what we're trying to do here right?

open-api/rest-catalog-open-api.yaml

jackye1995 · 2024-02-16T01:14:34Z

open-api/rest-catalog-open-api.yaml

+      type: integer
+      format: int64
+
+    FloatTypeValue:


I know this is based on the Java implementation, but would this really work? How do we preserve float precision in JSON? Did it cause any issue in the past for Flink use case? @stevenzwu

I doubt this would have been an issue for Flink because the tasks probably don't need to contain stats. Stats would be used at planning time on the coordinator/driver, but aren't needed on the task side.

I think if we need a lossless spec in JSON, we should probably change this definition. For example, we can store it as a byte string, or a long value which has the same binary representation. Any thoughts in particular?

The single-value JSON serialization should be lossless. You just need to encode values with enough precision to capture the whole value.

While not every decimal number has an exact binary representation, each binary representation can be expressed as a decimal number. Here's a stack overflow answer I found to sanity check that: https://stackoverflow.com/questions/68943707/are-there-any-binary-values-that-dont-have-exact-representation-in-decimal

My understanding is that we need to consider both directions, because this is a wire protocol and you expect the other side to deserialize to exactly the same value.

Suppose A sends B a JSON value of the upper bound { 1: 0.1 }, where column 1 is of float type, the other side cannot really reconstruct the original float binary.

Maybe I am misunderstanding what you mean by "each binary representation can be expressed as a decimal number", could you help walk through that?

I think there is a potential issue of this in the current implementation. We directly use Jackson's writeNumber(float), which has some default rounding behavior if I remember correctly. Let me double check.

I did some experiments, I think the current SingleValueParser implementation has some problems.

The Jackson writeNumber(float/double) method just writes the string form of the number by calling Float/Double.toString() method internally. The Java doc linked provides an explanation of how the string conversion is done.

The conversion fundamentally have 2 problems, (1) data that is too big or too small (outside 10^-3 to 10^7 range) is written in scientific notation, and the JSON representation will be a string but not number. For example, 10^20 is written as "1.0E20". (2) the result is lossy, because it is the nearest approximation to the true value. This is not serializing the float/double to the exact decimal representation as we discussed above.

For example, there is a very small chance that value 3.0 is sometimes serialized as 2.99999999999999, and when deserialized back it is probably still the 3.0 double value, but sometimes it will be just 2.99999999999999. This becomes a correctness issue for use cases like row-level filtering, where user can define a filter against a double like a < 3 and that can produce unexpected result. We actually saw this exact issue in the past in LakeFormation row-level filtering with Athena, so I suggest us be very cautious here.

In general, I think achieving the true decimal representation will be actually more spacial and computationally intensive than just storing the binary representation. We can easily store the binary form of a double by Double.doubleToRawLongBits to store a long value in the serialized form, and deserialize it back using the reverse longBitsToDouble method. I think we should consider using this approach.

We can easily store the binary form of a double by Double.doubleToRawLongBits to store a long value in the serialized form, and deserialize it back using the reverse longBitsToDouble method.

@jackye1995 I thought that won't conform to the JSON spec: https://datatracker.ietf.org/doc/html/rfc8259#section-6

This specification allows implementations to set limits on the range and precision of numbers accepted. Since software that implements IEEE 754 binary64 (double precision) numbers [[IEEE754](https://datatracker.ietf.org/doc/html/rfc8259#ref-IEEE754)] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision. A JSON number such as 1E400 or 3.141592653589793238462643383279 may indicate potential interoperability problems, since it suggests that the software that created it expects receiving software to have greater capabilities for numeric magnitude and precision than is widely available.

@jackye1995, we should make sure that we produce values correctly and don't modify them in serialization or deserialization.

However, for the purpose of this spec I think that it is okay to move forward. Both float and double are part of the OpenAPI Spec (OAS) so I think they can be exchanged correctly. We just need to make sure our implementation is doing so.

Sure, we can move forward with this first, although I am not fully convinced yet, but I don't want to block on this for too long. I think the issues I see are mostly due to the serialization implementation of Jackson that does not comply with the standards you listed above. We can address that in subsequent implementation PRs.

open-api/rest-catalog-open-api.yaml

rdblue · 2024-02-27T21:52:49Z

Mostly looks good to me. I flagged a couple of minor things.

Also, I don't think that we resolved this thread: #9717 (comment)

open-api/rest-catalog-open-api.yaml

jackye1995

looks good to me!

open-api/rest-catalog-open-api.yaml

rdblue · 2024-02-28T16:57:12Z

This looks good to me now. Since we have 3 approvals, I'll go ahead and merge it.

…nTable API (apache#9717)

Add AppendDataFileUpdate for TableUpdate

c2925e4

github-actions bot added the OPENAPI label Feb 12, 2024

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

amogh-jahagirdar self-requested a review February 13, 2024 05:30

rahil-c reviewed Feb 13, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Show resolved Hide resolved

amogh-jahagirdar reviewed Feb 13, 2024

View reviewed changes

geruh added 3 commits February 14, 2024 12:54

Add FileContent enum, and rename fields

f10fd97

define values for each iceberg type

7ca4368

generate python code

70b8e5f

geruh requested review from jackye1995, rahil-c and amogh-jahagirdar February 16, 2024 00:27

jackye1995 reviewed Feb 16, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 16, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

geruh added 2 commits February 16, 2024 11:59

explain decimal type and add examples

0128526

rename the append files table update type

5a3dd76

jackye1995 reviewed Feb 16, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 16, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 16, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 16, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 16, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 16, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 16, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 16, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 16, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 16, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

improve api spec descriptions

7ad41de

stevenzwu reviewed Feb 26, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Show resolved Hide resolved

address feedback for content file

54fedd0

jackye1995 reviewed Feb 27, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 27, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 27, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

make bound maps values primitive

629c29f

rdblue reviewed Feb 27, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

rdblue reviewed Feb 27, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

rdblue reviewed Feb 27, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

fix time type and position mapping

9bf0d3f

geruh changed the title ~~OpenAPI: Add ContentFile types to spec for scan and append api~~ OpenAPI: Add ContentFile types to spec for scan API Feb 27, 2024

geruh added 2 commits February 27, 2024 14:28

address feedback and remove line

8eb97a1

make partition data a list

5794008

jackye1995 reviewed Feb 27, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

update partition field description

76d6bbb

jackye1995 approved these changes Feb 27, 2024

View reviewed changes

geruh changed the title ~~OpenAPI: Add ContentFile types to spec for scan API~~ OpenAPI: Add ContentFile types to spec for the PreplanTable and PlanTable API Feb 27, 2024

jackye1995 requested review from rdblue and stevenzwu February 27, 2024 23:51

stevenzwu approved these changes Feb 28, 2024

View reviewed changes

rdblue reviewed Feb 28, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Show resolved Hide resolved

rdblue approved these changes Feb 28, 2024

View reviewed changes

rdblue merged commit bb53c3d into apache:main Feb 28, 2024

bitsondatadev pushed a commit to bitsondatadev/iceberg that referenced this pull request Mar 3, 2024

REST spec: Add ContentFile types to spec for the PreplanTable and Pla…

00b2886

…nTable API (apache#9717)

devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024

REST spec: Add ContentFile types to spec for the PreplanTable and Pla…

6fe5619

…nTable API (apache#9717)

geruh deleted the rest-append-api-spec branch April 23, 2024 06:07

geruh mentioned this pull request Apr 24, 2024

OpenAPI: Add AppendDataFile models to openapi spec for fine grained metadata commits #10202

Closed

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

REST spec: Add ContentFile types to spec for the PreplanTable and Pla…

7d04d82

…nTable API (apache#9717)

OpenAPI: Add ContentFile types to spec for the PreplanTable and PlanTable API #9717

OpenAPI: Add ContentFile types to spec for the PreplanTable and PlanTable API #9717

Uh oh!

Conversation

geruh commented Feb 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amogh-jahagirdar commented Feb 13, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 Feb 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdblue commented Feb 27, 2024

Uh oh!

Uh oh!

geruh commented Feb 12, 2024 •

edited

Loading

jackye1995 Feb 20, 2024 •

edited

Loading