Skip to content

Commit

Permalink
update lower_bounds and upper_bounds
Browse files Browse the repository at this point in the history
  • Loading branch information
szehon-ho committed Dec 21, 2024
1 parent 71a1992 commit 314c687
Showing 1 changed file with 9 additions and 6 deletions.
15 changes: 9 additions & 6 deletions format/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -589,8 +589,8 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo
| _optional_ | _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column |
| _optional_ | _optional_ | _optional_ | **`137 nan_value_counts`** | `map<138: int, 139: long>` | Map from column id to number of NaN values in the column |
| _optional_ | _optional_ | _optional_ | **`111 distinct_counts`** | `map<123: int, 124: long>` | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts |
| _optional_ | _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2]. See [7][9] for`geometry` and `geography`. |
| _optional_ | _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2]. See [8][9] for `geometry` and `geography`. |
| _optional_ | _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2]. See [7][11] for`geometry` and [8][12] for `geography`. |
| _optional_ | _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2]. See [9][11] for `geometry` and [10][12] for `geography`. |
| _optional_ | _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption |
| _optional_ | _optional_ | _optional_ | **`132 split_offsets`** | `list<133: long>` | Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending |
| | _optional_ | _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file |
Expand All @@ -608,10 +608,13 @@ Notes:
4. Position delete metadata can use `referenced_data_file` when all deletes tracked by the entry are in a single data file. Setting the referenced file is required for deletion vectors.
5. The `content_offset` and `content_size_in_bytes` fields are used to reference a specific blob for direct access to a deletion vector. For deletion vectors, these values are required and must exactly match the `offset` and `length` stored in the Puffin footer for the deletion vector blob.
6. The following field ids are reserved on `data_file`: 141.
7. `geometry` and `geography`, this is a point: X = westernmost bound of all geometries in file, Y = northernmost bound of all geometries in file, Z is min value for all component points of all geometries in the file, M is min value of all component points of all geometries in the file. See Appendix D for encoding.
8. `geometry` and `geography`, this is a point: X = easternmost bound of all geometries in file, Y = southernmost bound of all geometries in file, Z is max value for all component points of all geometries in the file, M is max value of all component points of all geometries in the file. See Appendix D for encoding.
9. `geometry` and `geography`, the concepts of westernmost and easternmost values are explicitly introduced to address cases involving anti-meridian crossing, where the `lower_bound` may be greater than `upper_bound`. For `geography`, the canonical ranges for the bounding box covering all points in the coordinate system is [-180 180] for the west-east range and [-90 90] for the south-north range.
10. The `partition` struct stores the tuple of partition values for each file. Its type is derived from the partition fields of the partition spec used to write the manifest file. In v2, the partition struct's field ids must match the ids from the partition spec.
7. `geometry`: this is a point: X, Y, Z, and M are the lower bound of all component points of all geometries in file.
8. `geography`: this is a point: X = westernmost bound of all geometries in file, Y = northernmost bound of all geometries in file, Z and M are the min value for all component points of all geometries in the file.
9. `geometry`: this is a point: X, Y, Z, and M take the upper bound of all component points of all geometries in file.
10. `geography`: this is a point: X = easternmost bound of all geometries in file, Y = southernmost bound of all geometries in file, Z and M are the max value for all component points of all geometries in the file.
11. `geometry`: For the X value only, the lower_bound's X value (xmin) may be greater than the upper_bound's X value, and a geometry in the file may match if it contains an X such that `x >= xmin` OR `x <= xmax`. Note this definition is agnostic to coordinate system.
12. `geography`, the concepts of westernmost, easternmost, northernmost, and southernmost are explicitly introduced to address cases involving anti-meridian crossing, implying that the X and Y values of `lower_bound` may be greater than `upper_bound`. The canonical ranges for these points in the coordinate system is [-180 180] for west-east and [-90 90] for the south-north.
13. The `partition` struct stores the tuple of partition values for each file. Its type is derived from the partition fields of the partition spec used to write the manifest file. In v2, the partition struct's field ids must match the ids from the partition spec.

The column metrics maps are used when filtering to select both data and delete files. For delete files, the metrics must store bounds and counts for all deleted rows, or must be omitted. Storing metrics for deleted rows ensures that the values can be used during job planning to find delete files that must be merged during a scan.

Expand Down

0 comments on commit 314c687

Please sign in to comment.