Skip to content

Commit efd02a5

Browse files
committed
Avoid floating point number ordering NaN semantics
This patch prohibits the use of NaNs in ordering semantics for floating point numbers, including in sort_columns, lower_bounds, lower_bound, upper_bounds, and upper_bound. It additionally requires that those fields respect the IEEE 754 totalOrder predicate, which defines negative zero as being ordered before positive zero. That requirement will be invisible on the read path for processes that use the numeric less-than, rather than totalOrder, since the numeric comparators consider negative zero as ordered neither before nor after positive zero.
1 parent 77a456a commit efd02a5

File tree

1 file changed

+12
-9
lines changed

1 file changed

+12
-9
lines changed

site/docs/spec.md

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -206,19 +206,22 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo
206206
| **`104 file_size_in_bytes`** | `long` | Total file size in bytes |
207207
| ~~**`105 block_size_in_bytes`**~~ | `long` | **Deprecated. Always write a default value and do not read.** |
208208
| **`106 file_ordinal`** | `optional int` | Ordinal of the file w.r.t files with the same partition tuple and snapshot id |
209-
| **`107 sort_columns`** | `optional list` | Columns the file is sorted by |
209+
| **`107 sort_columns`** | `optional list` | Columns the file is sorted by [2]. If a column has type `float` or `double` and contains `NaN`, it must not be in `sort_columns`. |
210210
| **`108 column_sizes`** | `optional map` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro). |
211211
| **`109 value_counts`** | `optional map` | Map from column id to number of values in the column (including null values) |
212212
| **`110 null_value_counts`** | `optional map` | Map from column id to number of null values in the column |
213213
| ~~**`111 distinct_counts`**~~ | `optional map` | **Deprecated. Do not use.** |
214-
| **`125 lower_bounds`** | `optional map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all values in the column for the file. |
215-
| **`128 upper_bounds`** | `optional map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all values in the column for the file. |
214+
| **`112 nan_value_counts`** | `optional map` | Map from column id to number of NaN values in the column |
215+
| **`125 lower_bounds`** | `optional map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file. [3] |
216+
| **`128 upper_bounds`** | `optional map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-NaN values in the column for the file. [3] |
216217
| **`131 key_metadata`** | `optional binary` | Implementation-specific key metadata for encryption |
217218
| **`132 split_offsets`** | `optional list` | Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending. |
218219

219220
Notes:
220221

221222
1. Single-value serialization for lower and upper bounds is detailed in Appendix D.
223+
2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the IEEE 754 `totalOrder` predicate.
224+
3. Just as for `float` or `double` columns in `sort_columns`, `-0.0` is considered to be strictly less than `+0.0`, following IEEE 754's `totalOrder` predicate.
222225

223226
The `partition` struct stores the tuple of partition values for each file. Its type is derived from the partition fields of the partition spec for the manifest file.
224227

@@ -296,16 +299,16 @@ Manifest list files store `manifest_file`, a struct with the following fields:
296299

297300
`field_summary` is a struct with the following fields
298301

299-
| Field id, name | Type | Description |
300-
|-------------------------|-------------------------|---------------------------------------------------------------------------------------------|
301-
| **`509 contains_null`** | `boolean` | Whether the manifest contains at least one partition with a null value for the field. |
302-
| **`510 lower_bound`** | `optional bytes` [1] | Lower bound for the non-null values in the partition field, or null if all values are null. |
303-
| **`511 upper_bound`** | `optional bytes` [1] | Upper bound for the non-null values in the partition field, or null if all values are null. |
302+
| Field id, name | Type | Description |
303+
|-------------------------|-------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
304+
| **`509 contains_null`** | `boolean` | Whether the manifest contains at least one partition with a null value for the field. |
305+
| **`510 lower_bound`** | `optional bytes` [1] | Lower bound for the non-null, non-NaN values in the partition field. If present, must be null if all values are null and NaN if all non-null values are NaN. [2] |
306+
| **`511 upper_bound`** | `optional bytes` [1] | Upper bound for the non-null, non-NaN values in the partition field. If present, must be null if all values are null and NaN if all non-null values are NaN. [2] |
304307

305308
Notes:
306309

307310
1. Lower and upper bounds are serialized to bytes using the single-object serialization in Appendix D. The type of used to encode the value is the type of the partition field data.
308-
311+
2. If -0.0 is a value of the partition field, the `lower_bound` must not be +0.0, and if +0.0 is a value of the partition field, the `upper_bound` must not be -0.0.
309312

310313
### Table Metadata
311314

0 commit comments

Comments
 (0)