Spec: add sort order to spec #2055

yyanyy · 2021-01-09T01:19:31Z

Add sort order descriptions to the spec

site/docs/spec.md

rdblue · 2021-01-11T18:39:20Z

site/docs/spec.md

@@ -480,6 +499,8 @@ Table metadata consists of the following fields:
 | _optional_ | _optional_ | **`current-snapshot-id`**| `long` ID of the current table snapshot. |
 | _optional_ | _optional_ | **`snapshots`**| A list of valid snapshots. Valid snapshots are snapshots for which all data files exist in the file system. A data file must not be deleted from the file system until the last snapshot in which it was listed is garbage collected. |
 | _optional_ | _optional_ | **`snapshot-log`**| A list (optional) of timestamp and snapshot ID pairs that encodes changes to the current snapshot for the table. Each time the current-snapshot-id is changed, a new entry should be added with the last-updated-ms and the new current-snapshot-id. When snapshots are expired from the list of valid snapshots, all entries before a snapshot that has expired should be removed. |
+| _optional_ | _optional_ | **`sort-orders`**| A list of sort orders, stored as full sort order objects. |
+| _optional_ | _optional_ | **`default-sort-order-id`**| Default sort order id of the table. Note that this could be used by writers, but is not used when reading because reads use the specs stored in manifest files. |


@aokolnychyi, do you think that sort order information should be optional in v2?

I think that we need to clarify this a bit more. Sort order is optional when v2 is reading because v1 metadata is still valid. But a v2 writer must always write the sort-orders and default-sort-order-id fields. I think we need to change this to required for v2 and make sure there is text stating that these requirements are for writers and that v2 readers must accept all valid v1 metadata.

I think the last sentence in the current formatv2 section basically coverers the v2 readers accepting valid v1 metadata part. I'll add sort-orders and default-sort-order-id to the list of this section for now, and do you think if we should move this section/add more explanation about this in other places?

Yes, since that section is intended to clarify, not specify. That doesn't need to be part of this PR, though.

jackye1995 · 2021-01-11T21:35:57Z

site/docs/spec.md

+|Field|JSON representation|Possible values|
+|--- |--- |--- |
+|**`direction`**|`JSON string`|`"asc", "desc"`|
+|**`null-order`**|`JSON string`|`"nulls-first", "nulls-last"`|


how is NaN ordered? In java, Double.NaN is considered to be equal to itself and greater than all other double values including Double.POSITIVE_INFINITY. If this is the definition we follow, we should document it here. Otherwise something similar like a NaNOrder class should be defined in sort order.

Thanks, I think this is a good point I missed. I'm hoping to treat NaN as the largest value by default, but I'm not familiar with engines' implementations to say if it's something that could easily be achieved without much effort. @rdblue @aokolnychyi Do you have any recommendation to this? Thanks!

The Float.compare implementation in Java will sort NaN larger than any other value and -NaN smaller than any other value. I think we should go with that.

public static int compare(float f1, float f2) { if (f1 < f2) return -1; // Neither val is NaN, thisVal is smaller if (f1 > f2) return 1; // Neither val is NaN, thisVal is larger // Cannot use floatToRawIntBits because of possibility of NaNs. int thisBits = Float.floatToIntBits(f1); int anotherBits = Float.floatToIntBits(f2); return (thisBits == anotherBits ? 0 : // Values are equal (thisBits < anotherBits ? -1 : // (-0.0, 0.0) or (!NaN, NaN) 1)); // (0.0, -0.0) or (NaN, !NaN) }

I think that produces: -NaN < -Infinity < -value < -0 < 0 < value < Infinity < NaN

Yeah this order sounds good to me

A bit more on this: the IEEE 754 spec is quoted in this Rust issue: rust-lang/rust#5585

It confirms that -NaN should sort before all values, and +NaN should sort after all values. So we can note in the doc that floating point ordering should follow the IEEE 754 totalOrder predicate.

site/docs/spec.md

rdblue · 2021-01-16T20:10:46Z

site/docs/spec.md

+
+Order id `0` is reserved for the unsorted order. 
+
+A data or delete file is associated with a sort order by the sort order's id within [a manifest](#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes.


Looks good!

site/docs/spec.md

rdblue · 2021-01-16T20:11:49Z

Thanks @yyanyy! This looks very close to being ready.

rdblue · 2021-01-22T19:44:41Z

Merged! Thanks @yyanyy! And thanks to @jackye1995 for helping to review!

findepi · 2021-07-29T12:05:32Z

site/docs/spec.md

+
+Order id `0` is reserved for the unsorted order. 
+
+Sorting floating-point numbers should produce the following behavior: `-NaN` < `-Infinity` < `-value` < `-0` < `0` < `value` < `Infinity` < `NaN`. This aligns with the implementation of Java floating-point types comparisons. 


The -NaN is a bit ambiguous:

it can be read as a result of applying unary minus to a NaN value. In Java, -Double.NaN is not distinguishable from Double.NaN (is exact same value bitwise).

my JVM returns true from Double.doubleToRawLongBits(-Double.NaN) == Double.doubleToRawLongBits(Double.NaN)

it can be read as a "a IEEE 754 NaN value that has a sign bit set". For exampe

for example, in my JVM, Double.longBitsToDouble(0xfff8000000000000L) is such value, if I read this correctly. and so, Double.isNaN(Double.longBitsToDouble(0xfff8000000000000L)) is true (while also Double.doubleToRawLongBits(Double.longBitsToDouble(0xfff8000000000000L)) != Double.doubleToRawLongBits(Double.NaN) is true, i.e. this expression constructs a NaN value that;s bitwise distinguishable from Double.NaN)

In both cases however, "Java floating-point types comparisons" seems to sort all NaN values as peers, and "greater" than positive infinity:

System.out.println(Double.compare(Double.POSITIVE_INFINITY, Double.NaN)); // -1 System.out.println(Double.compare(Double.NaN, -Double.NaN)); // 0 System.out.println(Double.compare(Double.NaN, Double.longBitsToDouble(0xfff8000000000000L))); // 0 System.out.println(Double.compare(Double.NaN, Double.longBitsToDouble(0xfff8000012340000L))); // 0

if you want to tinker with this further, you can find https://github.com/findepi/urandom-bits/blob/417d908ebf5ef43f2d0c5187a5d8914b0139d8f6/src/main/java/findepi/java/NanOrder.java a useful starting point.

if your thinking matches mine, here is the PR #2891

Thanks for the input! You are right that in Java -Double.NaN equals Double.NaN so I think we should drop the mentioning of -NaN in spec. For the interpretation of having the sign bit set, I did some quick reading and it seems like NaN can be represented by a lot of ways (from here it says anything that looks like x111 1111 1axx xxxx xxxx xxxx xxxx xxxx where x means anything would be NaN), and the sign bit for NaN seems to have no meaning. In this case I think we probably want to avoid mentioning -NaN in spec completely.

@yyanyy you're right -- from representation perspective, NaN value still has a sign bit that can be set or unset, but this indeed has no meaning, at least in Java

add sort order to spec

9d8037c

github-actions bot added the docs label Jan 9, 2021

yyanyy changed the title ~~add sort order to spec~~ Spec: add sort order to spec Jan 9, 2021

rdblue reviewed Jan 11, 2021

View reviewed changes

site/docs/spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jan 11, 2021

View reviewed changes

site/docs/spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jan 11, 2021

View reviewed changes

jackye1995 reviewed Jan 11, 2021

View reviewed changes

update based on comments

796a91b

rdblue reviewed Jan 16, 2021

View reviewed changes

site/docs/spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jan 16, 2021

View reviewed changes

site/docs/spec.md Outdated Show resolved Hide resolved

jackye1995 mentioned this pull request Jan 16, 2021

API: handle NaN as min/max stats in evaluators #2069

Merged

update based on comments

1c76ebd

rdblue approved these changes Jan 22, 2021

View reviewed changes

Merge branch 'master' into sort-order-spec

01080d3

rdblue merged commit ac9e722 into apache:master Jan 22, 2021

jackye1995 mentioned this pull request Feb 2, 2021

Doc: improve description for sort order fields in spec #2201

Closed

XuQianJin-Stars pushed a commit to XuQianJin-Stars/iceberg that referenced this pull request Mar 22, 2021

Spec: Add sort order to spec (apache#2055)

5b40137

findepi reviewed Jul 29, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spec: add sort order to spec #2055

Spec: add sort order to spec #2055

yyanyy commented Jan 9, 2021

rdblue Jan 11, 2021

rdblue Jan 16, 2021 •

edited

Loading

yyanyy Jan 20, 2021

rdblue Jan 22, 2021

jackye1995 Jan 11, 2021

yyanyy Jan 12, 2021

rdblue Jan 16, 2021

jackye1995 Jan 16, 2021

rdblue Jan 22, 2021

rdblue Jan 16, 2021

rdblue commented Jan 16, 2021

rdblue commented Jan 22, 2021

findepi Jul 29, 2021

findepi Jul 29, 2021

findepi Jul 29, 2021

yyanyy Jul 29, 2021

findepi Jul 30, 2021


		Order id `0` is reserved for the unsorted order.

		A data or delete file is associated with a sort order by the sort order's id within [a manifest](#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes.


		Order id `0` is reserved for the unsorted order.

		Sorting floating-point numbers should produce the following behavior: `-NaN` < `-Infinity` < `-value` < `-0` < `0` < `value` < `Infinity` < `NaN`. This aligns with the implementation of Java floating-point types comparisons.

Spec: add sort order to spec #2055

Spec: add sort order to spec #2055

Conversation

yyanyy commented Jan 9, 2021

Choose a reason for hiding this comment

rdblue Jan 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Jan 16, 2021

rdblue commented Jan 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Jan 16, 2021 •

edited

Loading