PARQUET-2471: Add geometry logical type #240

wgtmac · 2024-05-10T14:56:04Z

Apache Iceberg is adding geospatial support: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI. It would be good if Apache Parquet can support geometry type natively.

jiayuasu · 2024-05-10T20:13:12Z

@wgtmac Thanks for the work. On the other hand, I'd like to highlight that GeoParquet (https://github.com/opengeospatial/geoparquet/tree/main) has been there for a while and many geospatial software has started to support reading and writing it.

Is the ultimate goal of this PR to merge GeoParquet spec into Parquet completely, or it might end up creating a new spec that is not compatible with GeoParquet?

jiayuasu · 2024-05-10T20:15:13Z

Geo Iceberg does not need to conform to GeoParquet because people should not directly use a parquet reader to read iceberg parquet files anyways. So that's a separate story.

wgtmac · 2024-05-11T01:23:58Z

Is the ultimate goal of this PR to merge GeoParquet spec into Parquet completely, or it might end up creating a new spec that is not compatible with GeoParquet?

@jiayuasu That's why I've asked the possibility of direct compliance to the GeoParquet spec in the Iceberg design doc. I don't intend to create a new spec. Instead, it would be good if the proposal here can meet the requirement of both Iceberg and GeoParquet, or share the common stuff to make the conversion between Iceberg Parquet and GeoParquet lightweight. We do need advice from the GeoParquet community to make it possible.

szehon-ho

From Iceberg side, I am excited about this, I think it will make Geospatial inter-op easier in the long run to define the type formally in parquet-format, and also unlock row group filtering. For example, Iceberg's add_file for parquet file. Perhaps there can be conversion utils for GeoParquet if we go ahead with this, and definitely like to see what they think.

Im new in parquet side, so had some questions

src/main/thrift/parquet.thrift

pitrou · 2024-05-15T08:24:29Z

@paleolimbot is quite knowledgeable on the topic and could probably be give useful feedback.

pitrou · 2024-05-15T08:36:13Z

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

Doing so in Parquet as well would lighten the maintenance workload on the serialization format, and would also allow easier evolution of geometry metadata to support additional information.

Edit: this seems to be the approach adopted by GeoParquet as well.

paleolimbot

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

In reading this I do wonder if there should just be an extension mechanism here instead of attempting to enumerate all possible encodings in this repo. The people that are engaged and working on implementations are the right people to engage here, which is why GeoParquet and GeoArrow have been successful (we've engaged the people who care about this, and they are generally not paying attention to apache/parquet-format nor apache/arrow).

There are a few things that this PR solves in a way that might not be possible using EXTENSION, which is that of column statistics. It would be nice to have some geo-specific things there (although maybe that can also be part of the extension mechanism). Another thing that comes up frequently is where to put a spatial index (rtree)...I don't think there's any good place for that at the moment.

It would be nice to allow the format to be extended in a way that does not depend on schema-level metadata...this metadata is typically propagated through projections and the things we do in the GeoParquet standard (store bounding boxes, refer to columns by name) become stale with the ways that schema metadata are typically propagated through projections and concatenations.

src/main/thrift/parquet.thrift

wgtmac · 2024-05-17T15:46:24Z

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

Doing so in Parquet as well would lighten the maintenance workload on the serialization format, and would also allow easier evolution of geometry metadata to support additional information.

Edit: this seems to be the approach adopted by GeoParquet as well.

@pitrou Yes, that might be an option. Then we can perhaps use the same json string defined in the iceberg doc. @jiayuasu @szehon-ho WDYT?

EDIT: I think we can remove those informative attributes like subtype, orientation, edges. Perhaps encoding can be removed as well if we only support WKB. dimension is something that we must be aware of because we need to build bbox which depends on whether the coordinate is represented as xy, xyz, xym and xyzm.

wgtmac · 2024-05-17T15:54:38Z

Another thing that comes up frequently is where to put a spatial index (rtree)

I thought this can be something similar to the page index or bloom filter in parquet, which are stored somewhere between row groups or before the footer. It can be row group level or file level as well.

It would be nice to allow the format to be extended in a way that does not depend on schema-level metadata.

I think we really need your advise here. If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge? @paleolimbot @jiayuasu

paleolimbot · 2024-05-17T19:48:56Z

If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge?

The main reasons that the schema level metadata had to exist is because there was no way to put anything custom at the column level to give geometry-aware readers extra metadata about the column (CRS being the main one) and global column statistics (bbox). Bounding boxes at the feature level (worked around as a separate column) is the second somewhat ugly thing, which gives reasonable row group statistics for many things people might want to store. It seems like this PR would solve most of that.

I am not sure that a new logical type will catch on to the extent that GeoParquet will, although I'm new to this community and I may be very wrong. The GeoParquet working group is enthusiastic and encodings/strategies for storing/querying geospatial datasets in a data lake context are evolving rapidly. Even though it is a tiny bit of a hack, using extra columns and schema-level metadata to encode these things is very flexible and lets implementations be built on top of a number of underlying readers/underlying versions of the Parquet format.

wgtmac · 2024-05-18T02:46:21Z

@paleolimbot I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet. Instead, GeoParquet can simply ignore the new type as of now, or leverage the built-in bbox if beneficial. For additional (informative) attributes of the geometry type, if some of them are stable and make sense to store them natively into parquet column metadata, then perhaps we can work together to make it happen? I think the main goal of this addition is to enhance interoperability of geospatial data across systems and at the same time it takes little effort to convert to GeoParquet.

Kontinuation · 2024-05-18T06:15:01Z

Another thing that comes up frequently is where to put a spatial index (rtree)

I thought this can be something similar to the page index or bloom filter in parquet, which are stored somewhere between row groups or before the footer. It can be row group level or file level as well.

The bounding-box based sort order defined for geometry logical type is already good enough for performing row-level and page-level data skipping. Spatial index such as R-tree may not be suitable for Parquet. I am aware that flatgeobuf has optional static packed Hilbert R-tree index, but for the index to be effective, flatgeobuf supports random access of records and does not support compression. The minimal granularity of reading data in Parquet files is data pages, and the pages are usually compressed so it is impossible to access records within pages randomly.

paleolimbot · 2024-05-20T02:43:39Z

I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet.

I agree! I think first-class geometry support is great and I'm happy to help wherever I can. I see GeoParquet as a way for existing spatial libraries to leverage Parquet and is not well-suited to Parquet-native things like Iceberg (although others working on GeoParquet may have a different view).

Extension mechanisms are nice because they allow an external community to hash out the discipline-specific details where these evolve at an orthogonal rate to that of the format (e.g., GeoParquet), which generally results in buy-in. I'm not familiar with the speed at which the changes proposed here can evolve (or how long it generally takes readers to implement them), but if @pitrou's suggestion of encoding the type information or statistics in serialized form makes it easier for this to evolve it could provide some of that benefit.

Spatial index such as R-tree may not be suitable for Parquet

I also agree here (but it did come up a lot of times in the discussions around GeoParquet). I think developers of Parquet-native workflows are well aware that there are better formats for random access.

paleolimbot · 2024-05-21T13:32:08Z

I think we really need your advise here. If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge?

I opened up opengeospatial/geoparquet#222 to collect some thoughts on this...we discussed it at our community call and I think we mostly just never considered that the Parquet standard would be interested in supporting a first-class data type. I've put my thoughts there but I'll let others add their own opinions.

src/main/thrift/parquet.thrift

jorisvandenbossche · 2024-05-21T15:20:13Z

Just to ensure my understanding is correct:

This is proposing to add a new logical type annotating the BYTE_ARRAY physical type. For readers that expect just such a BYTE_ARRAY column (e.g. existing GeoParquet implementations), that is compatible if the column would start having a logical type as well? (although I assume this might depend on how the specific parquet reader implementation deals with an unknown logical type, i.e. error about that or automatically fall back to the physical type).
For such "legacy" readers (just reading the WKB values from a binary column), the only thing that actually changes (apart from the logical type annotation) are the values of the statistics? Now, I assume that right now no GeoParquet reader is using the statistics of the binary column, because the physical statistics for BYTE_ARRAY ("unsigned byte-wise comparison") are essentially useless in the case those binary blobs represent WKB geometries. So again that should probably not give any compatibility issues?

jorisvandenbossche · 2024-05-21T16:03:09Z

although I assume this might depend on how the specific parquet reader implementation deals with an unknown logical type, i.e. error about that or automatically fall back to the physical type

To answer this part myself, at least for the Parquet C++ implementation, it seems an error is raised for unknown logical types, and it doesn't fall back to the physical type. So that does complicate the compatibility story ..

wgtmac · 2024-05-21T16:09:38Z

@jorisvandenbossche I think your concern makes sense. It should be a bug if parquet-cpp fails due to unknown logical type and we need to fix that. I also have concern about a new ColumnOrder and need to do some testing. Adding a new logical type should not break anything from legacy readers.

jornfranke · 2024-05-21T19:55:14Z

Apache Iceberg is adding geospatial support: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI. It would be good if Apache Parquet can support geometry type natively.

On the geo integration into Iceberg no one has really worked since some time: apache/iceberg#2586

szehon-ho · 2024-05-21T21:14:39Z

On the geo integration into Iceberg no one has really worked since some time: apache/iceberg#2586

Yes there is now a concrete proposal apache/iceberg#10260 , and the plan currently is to bring it up in next community sync

cholmes · 2024-05-23T20:55:53Z

Thanks for doing this @wgtmac - it's awesome to see this proposal! I helped initiate GeoParquet, and hope we can fully support your effort.

@paleolimbot I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet. Instead, GeoParquet can simply ignore the new type as of now, or leverage the built-in bbox if beneficial.

That makes sense, but I think we're also happy to have GeoParquet replaced! As long as it can 'scale up' to meet all the crazy things that hard core geospatial people need, while also being accessible to everyone else. If Parquet had geospatial types from the start we wouldn't have started GeoParquet. We spent a lot of time and effort trying to get the right balance between making it easy to implement for those who don't care about the complexity of geospatial (edges, coordinate reference systems, epochs, winding), while also having the right options to handle it for those who do. My hope has been that the decisions we made there will make it easier to add geospatial support to any new format - like that a 'geo-ORC' could use the same fields and options that we added.

For additional (informative) attributes of the geometry type, if some of them are stable and make sense to store them natively into parquet column metadata, then perhaps we can work together to make it happen? I think the main goal of this addition is to enhance interoperability of geospatial data across systems and at the same time it takes little effort to convert to GeoParquet.

Sounds great! Happy to have GeoParquet be a place to 'try out' things. But I think ideally the surface area of 'GeoParquet' would be very minimal or not even exist, and that Parquet would just be the ideal format to store geospatial data in. And I think if we can align well between this proposal and GeoParquet that should be possible.

* Add the new suggestion according to the meeting with Snowflake * Refine the description according to the suggestion

Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>

* Update the spec according to the new feedback * Fix typo

Co-authored-by: emkornfield <emkornfield@gmail.com>

wgtmac · 2024-10-13T14:39:10Z

Update:

Added a section for GEOMETRY type in LogicalTypes.md as suggested from @gszadovszky. This makes the LogicalTypes.md the single source of truth and greatly simplified addition to the thrift file.
Removed GeometryStatistics from ColumnIndex for now as we don't have any benchmark data to justify it. We can add them in the followup if necessary. cc @rdblue @emkornfield
Added explanation for X/Y/Z/M values in LogicalTypes.md. cc @paleolimbot @emkornfield

wgtmac · 2024-10-13T14:43:21Z

Created two pull requests:

Replace the Edges enumeration values: wgtmac/parquet-format#6 for the renaming of edges enumeration values.

CRS encoding and permutation wgtmac/parquet-format#7 for the renaming of CRS type field as CRS encoding, and the addition of a permutation field.

@desruisseaux Thanks for opening two PRs to clarify them! FYI that I have just moved the spec from parquet.thrift to LogicalTypes.md in this PR. I'm not sure if you have time to refactor the above two PRs to reflect this change. Or is it better to discuss these topics in the GeoParquet community which might be a better place?

src/main/thrift/parquet.thrift

rdblue · 2024-10-14T22:57:47Z

src/main/thrift/parquet.thrift

+ * GeometryEncoding and Edges are required. CRS is optional.
+ *
+ * Once CRS is set, it MUST be a key to an entry in the `key_value_metadata`
+ * field of `FileMetaData`.


Why is it required that the CRS is embedded in file metadata? Isn't it clear if the CRS is a well-known one like OGC:CRS84? It seems to me that this resolution should be out of scope. Parquet can encourage that the CRS is documented in file metadata, but other systems could store the definition in a different location. For example, Iceberg could store this in a table property instead of in each data file.

I would prefer to define this string property as a "Coordinate reference system identifier" and not specify how to exchange the PROJJSON or other format definition. I would also add a note that people are encouraged to store it in a location along with the file or table metadata.

A string property of "Coordinate reference system identifier" (with a convention, either within this spec or outside it, of where in the file to look for the full definition) would allow for enough detail for GeoSpatial libraries to leverage Parquet.

The need for embedding a full CRS description somewhere that is programatically accessible by a Parquet implementation is to ensure a producer's intent can be faithfully transported by the consumer. In the C++ implementation we can attach this as extension type metadata that can pass through a pipeline to a consumer that does not have access to the original context (e.g., constructing a GeoPandas GeoDataFrame from a Parquet file that was read and filtered using a non-spatial tool like pyarrow). If that needs to be an external convention (e.g., one that we define in GeoParquet) to get consensus here that is OK (even though I think it would result in less misinterpreted data to have that convention be in the Parquet specification itself).

Alternatively, would removing any conventions or requirements around the string crs be acceptable? (i.e., the producer puts what it needs to put there to ensure that the coordinates in this column are not misinterpreted by the consumer, which may be an identifier or a full CRS definition according to the requirements of the producer?).

@rdblue Although the GeoParquet community would really appreciate the possibility of embedding a full CRS description somewhere in Parquet, we understand that compromises need to be made sometimes.

Like @paleolimbot said, will it be acceptable that if we remove any conventions or requirements around the string crs and only allow this single value in the column metadata?

This means, the writer can put whatever they want but they will need to communicate this to the reader via other channel.

The need for embedding a full CRS description somewhere that is programatically accessible by a Parquet implementation is to ensure a producer's intent can be faithfully transported by the consumer.

To achieve this, is it possible to reserve some crs values or at least some prefixes? For example, Iceberg may store iceberg.xxx to crs where xxx is an arbitrary crs identifier defined in its table metadata. Similarly, GeoParquet may set geoparquet.xxx to crs and the key must exist in the Parquet file metadata and its associated value is the full CRS.

This still causes fragmentation but it looks better than a strong enforcement. WDYT? @rdblue @jiayuasu @paleolimbot @szehon-ho

src/main/thrift/parquet.thrift

rdblue · 2024-10-14T23:05:07Z

src/main/thrift/parquet.thrift

@@ -286,6 +313,9 @@ struct Statistics {
 7: optional bool is_max_value_exact;
 /** If true, min_value is the actual minimum value for a column */
 8: optional bool is_min_value_exact;
+
+ /** statistics specific to geometry logical type */
+ 9: optional GeometryStatistics geometry_stats;


I don't think this is the right place to include GeometryStatistics. There are a couple reasons:

This is unnecessary nesting to get to the geo stats, making them harder to find

Nesting within Statistics includes geo stats in places where simple Statistics make sense, but geo stats do not. For example, this would be included in page headers in addition to ColumnMetaData (and the page index already removed these)

This doesn't match the approach used for SizeStatistics, which was included directly in ColumnMetaData

I think that this should match the addition of SizeStatistics and should be included as a field in ColumnMetaData.

I agree. Actually it was a little bit awkward when I did the PoC impl to nest geo stats into the common stats. I have moved it to ColumnMetaData now.

src/main/thrift/parquet.thrift

Co-authored-by: Jia Yu <jiayu@wherobots.com>

wgtmac force-pushed the geo branch from 4d36df9 to ad29afd Compare May 10, 2024 15:01

szehon-ho reviewed May 11, 2024

View reviewed changes

wgtmac marked this pull request as ready for review May 11, 2024 16:13

wgtmac changed the title ~~WIP: Add geometry logical type~~ PARQUET-2471: Add geometry logical type May 11, 2024

emkornfield reviewed May 14, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

emkornfield reviewed May 14, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

emkornfield reviewed May 14, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

paleolimbot reviewed May 15, 2024

View reviewed changes

paleolimbot mentioned this pull request May 21, 2024

Thoughts about a first-class GEOMETRY data type in Parquet? opengeospatial/geoparquet#222

Open

jorisvandenbossche reviewed May 21, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

paleolimbot mentioned this pull request May 21, 2024

[Parquet][C++] Behaviour of unknown logical type when encountered in Parquet reader apache/arrow#41764

Open

wgtmac force-pushed the geo branch from 745ecb1 to f71c010 Compare May 25, 2024 15:22

zhangfengcdt and others added 17 commits October 13, 2024 11:37

Update covering and geometry type protocol based on comments (#2)

1aaaca8

Add the new suggestion according to the meeting with Snowflake (#3)

ee5b2df

change metadata to string type and rewording WKB description

19cc081

add example for crs

16c5868

reword crs

56a65de

clarify WKB

f28b282

clarify coverings

5127702

Update the suggestion for bbox stats (#4)

298ab64

* Add the new suggestion according to the meeting with Snowflake * Refine the description according to the suggestion

Update src/main/thrift/parquet.thrift

41c6394

Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>

Update src/main/thrift/parquet.thrift

d86abe4

Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>

Update src/main/thrift/parquet.thrift

c7a4f4c

Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>

Update src/main/thrift/parquet.thrift

f20f685

Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>

address feedback about edges and wkb

dbf9d54

add geoparquet column metadata back

b4296aa

Update the spec according to the new feedback (#5)

9bcea6e

* Update the spec according to the new feedback * Fix typo

Update src/main/thrift/parquet.thrift

99f0403

Co-authored-by: emkornfield <emkornfield@gmail.com>

Update src/main/thrift/parquet.thrift

dbb78cf

Co-authored-by: emkornfield <emkornfield@gmail.com>

wgtmac force-pushed the geo branch from c855ca7 to dbb78cf Compare October 13, 2024 03:39

wgtmac added 2 commits October 13, 2024 21:04

add description to LogicalTypes.md

25df0ff

add explanation for Z & M values

d349727

szehon-ho reviewed Oct 14, 2024

View reviewed changes

src/main/thrift/parquet.thrift Show resolved Hide resolved

rdblue reviewed Oct 14, 2024

View reviewed changes

src/main/thrift/parquet.thrift Show resolved Hide resolved

rdblue reviewed Oct 14, 2024

View reviewed changes

move geo stats to ColumnMetaData

9ea6559

jiayuasu reviewed Oct 16, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

wgtmac and others added 2 commits October 17, 2024 09:15

Update src/main/thrift/parquet.thrift

011de45

Co-authored-by: Jia Yu <jiayu@wherobots.com>

fix typo

6425a3c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2471: Add geometry logical type #240

PARQUET-2471: Add geometry logical type #240

wgtmac commented May 10, 2024

jiayuasu commented May 10, 2024

jiayuasu commented May 10, 2024

wgtmac commented May 11, 2024

szehon-ho left a comment •

edited

Loading

pitrou commented May 15, 2024 •

edited

Loading

pitrou commented May 15, 2024 •

edited

Loading

paleolimbot left a comment

wgtmac commented May 17, 2024 •

edited

Loading

wgtmac commented May 17, 2024

paleolimbot commented May 17, 2024

wgtmac commented May 18, 2024

Kontinuation commented May 18, 2024

paleolimbot commented May 20, 2024

paleolimbot commented May 21, 2024

jorisvandenbossche commented May 21, 2024

jorisvandenbossche commented May 21, 2024

wgtmac commented May 21, 2024

jornfranke commented May 21, 2024 •

edited

Loading

szehon-ho commented May 21, 2024

cholmes commented May 23, 2024

wgtmac commented Oct 13, 2024 •

edited

Loading

wgtmac commented Oct 13, 2024

rdblue Oct 14, 2024

paleolimbot Oct 15, 2024 •

edited

Loading

jiayuasu Oct 16, 2024

wgtmac Oct 16, 2024 •

edited

Loading

rdblue Oct 14, 2024 •

edited

Loading

wgtmac Oct 16, 2024

PARQUET-2471: Add geometry logical type #240

Are you sure you want to change the base?

PARQUET-2471: Add geometry logical type #240

Conversation

wgtmac commented May 10, 2024

jiayuasu commented May 10, 2024

jiayuasu commented May 10, 2024

wgtmac commented May 11, 2024

szehon-ho left a comment • edited Loading

Choose a reason for hiding this comment

pitrou commented May 15, 2024 • edited Loading

pitrou commented May 15, 2024 • edited Loading

paleolimbot left a comment

Choose a reason for hiding this comment

wgtmac commented May 17, 2024 • edited Loading

wgtmac commented May 17, 2024

paleolimbot commented May 17, 2024

wgtmac commented May 18, 2024

Kontinuation commented May 18, 2024

paleolimbot commented May 20, 2024

paleolimbot commented May 21, 2024

jorisvandenbossche commented May 21, 2024

jorisvandenbossche commented May 21, 2024

wgtmac commented May 21, 2024

jornfranke commented May 21, 2024 • edited Loading

szehon-ho commented May 21, 2024

cholmes commented May 23, 2024

wgtmac commented Oct 13, 2024 • edited Loading

wgtmac commented Oct 13, 2024

rdblue Oct 14, 2024

Choose a reason for hiding this comment

paleolimbot Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

jiayuasu Oct 16, 2024

Choose a reason for hiding this comment

wgtmac Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

rdblue Oct 14, 2024 • edited Loading

Choose a reason for hiding this comment

wgtmac Oct 16, 2024

Choose a reason for hiding this comment

szehon-ho left a comment •

edited

Loading

pitrou commented May 15, 2024 •

edited

Loading

pitrou commented May 15, 2024 •

edited

Loading

wgtmac commented May 17, 2024 •

edited

Loading

jornfranke commented May 21, 2024 •

edited

Loading

wgtmac commented Oct 13, 2024 •

edited

Loading

paleolimbot Oct 15, 2024 •

edited

Loading

wgtmac Oct 16, 2024 •

edited

Loading

rdblue Oct 14, 2024 •

edited

Loading