You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Avoid floating point number ordering NaN semantics
This patch prohibits the use of NaNs in ordering semantics for
floating point numbers, including in sort_columns, lower_bounds,
lower_bound, upper_bounds, and upper_bound. It additionally requires
that those fields respect the IEEE 754 totalOrder predicate, which
defines negative zero as being ordered before positive zero.
That requirement will be invisible on the read path for processes
that use the numeric less-than, rather than totalOrder, since the
numeric comparators consider negative zero as ordered neither before
nor after positive zero.
Copy file name to clipboardExpand all lines: site/docs/spec.md
+12-9Lines changed: 12 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -206,19 +206,22 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo
206
206
|**`104 file_size_in_bytes`**|`long`| Total file size in bytes |
207
207
|~~**`105 block_size_in_bytes`**~~|`long`|**Deprecated. Always write a default value and do not read.**|
208
208
|**`106 file_ordinal`**|`optional int`| Ordinal of the file w.r.t files with the same partition tuple and snapshot id |
209
-
|**`107 sort_columns`**|`optional list`| Columns the file is sorted by |
209
+
|**`107 sort_columns`**|`optional list`| Columns the file is sorted by [2]. If a column has type `float` or `double` and contains `NaN`, it must not be in `sort_columns`.|
210
210
|**`108 column_sizes`**|`optional map`| Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro). |
211
211
|**`109 value_counts`**|`optional map`| Map from column id to number of values in the column (including null values) |
212
212
|**`110 null_value_counts`**|`optional map`| Map from column id to number of null values in the column |
213
213
|~~**`111 distinct_counts`**~~|`optional map`|**Deprecated. Do not use.**|
214
-
|**`125 lower_bounds`**|`optional map<126: int, 127: binary>`| Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all values in the column for the file. |
215
-
|**`128 upper_bounds`**|`optional map<129: int, 130: binary>`| Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all values in the column for the file. |
214
+
|**`112 nan_value_counts`**|`optional map`| Map from column id to number of NaN values in the column |
215
+
|**`125 lower_bounds`**|`optional map<126: int, 127: binary>`| Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file. [3]|
216
+
|**`128 upper_bounds`**|`optional map<129: int, 130: binary>`| Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-NaN values in the column for the file. [3]|
216
217
|**`131 key_metadata`**|`optional binary`| Implementation-specific key metadata for encryption |
217
218
|**`132 split_offsets`**|`optional list`| Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending. |
218
219
219
220
Notes:
220
221
221
222
1. Single-value serialization for lower and upper bounds is detailed in Appendix D.
223
+
2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the IEEE 754 `totalOrder` predicate.
224
+
3. Just as for `float` or `double` columns in `sort_columns`, `-0.0` is considered to be strictly less than `+0.0`, following IEEE 754's `totalOrder` predicate.
222
225
223
226
The `partition` struct stores the tuple of partition values for each file. Its type is derived from the partition fields of the partition spec for the manifest file.
224
227
@@ -296,16 +299,16 @@ Manifest list files store `manifest_file`, a struct with the following fields:
296
299
297
300
`field_summary` is a struct with the following fields
|**`509 contains_null`**|`boolean`| Whether the manifest contains at least one partition with a null value for the field. |
305
+
|**`510 lower_bound`**|`optional bytes`[1]| Lower bound for the non-null, non-NaN values in the partition field. If present, must be null if all values are null and NaN if all non-null values are NaN. [2]|
306
+
|**`511 upper_bound`**|`optional bytes`[1]| Upper bound for the non-null, non-NaN values in the partition field. If present, must be null if all values are null and NaN if all non-null values are NaN. [2]|
304
307
305
308
Notes:
306
309
307
310
1. Lower and upper bounds are serialized to bytes using the single-object serialization in Appendix D. The type of used to encode the value is the type of the partition field data.
308
-
311
+
2. If -0.0 is a value of the partition field, the `lower_bound` must not be +0.0, and if +0.0 is a value of the partition field, the `upper_bound` must not be -0.0.
0 commit comments