Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add geometry type to iceberg #2586

Open
x-malet opened this issue May 12, 2021 · 24 comments
Open

Add geometry type to iceberg #2586

x-malet opened this issue May 12, 2021 · 24 comments

Comments

@x-malet
Copy link

x-malet commented May 12, 2021

Hi everyone,

I was playing with Trino and Spark to store and manipulate geospatial data and store them in an Iceberg table but the geometry type is not supported yet. Is there a plan to add or support it?

I store the geometries as binary for now but, even if it's fast, it should be easier to store it as a geometry type and not having to translate it in every geospatial operations ( intersections, overlaps, etc.)

Thanks!

@kbendick
Copy link
Contributor

kbendick commented May 13, 2021

I'm not necessarily opposed to the idea, but support for geospatial types seems like something that would need to be added to the various query engines (Trino, spark, etc).

If they supported geospatial data structures, or if they even had the geospatial functions like postgres-gis does, then I think it would definitely make sense to add support for them into Iceberg.

However, Iceberg is simply a table format for use by the various big data query engines (spark, Trino, flink, hive, ...). Without 1st class support for the geospatial types in those engines, I can't see what you'd be able to do with the stored geometry data types other than fall back to binary or string types for manipulation.

It's spark / Trino that is responsible for the query processing. Iceberg is more for the table definition as well as the source and the sink (and occasionally things like specialized joins, specialized commands to update the tables metadata which goes along with being a table definition, etc).

So I can't see support for geometric shapes being added into iceberg unless the major query engines that we support added them.

I know there's spark-gis (spark is very popular so there are many spark libraries), but is geospatial data supported outside of the occasional 3rd party spark lib? Like, does Trino have support for geometry operations (intersections, overlaps, contains, whatever)?

@x-malet
Copy link
Author

x-malet commented May 13, 2021

Hi @kbendick, thanks for your answer. Yeah there is support for geospatial data in Trino ( trino geospartial func ) and spark have now a major projet supported by apache ( Apache Sedona ) and the problem is that, ever time you want to process a massive dataset stored in binary, you have to translate it from binary to geometry and then process it.

On the other hand, I think that storing a geometry in a binary format is more exchangeable between systems and allow more flexibility...

@badbye
Copy link

badbye commented Aug 31, 2022

To fully support geometry, there are lots of things to do.

  1. Add geometry type.
  2. Partitioning.
  3. Filtering.
  4. Writing and reading.

Firstly, we must figure out how to store geometry in parquet and Avro files. geomesa already did it. geoparquet is trying to set up a standard. What about Avro? no idea yet.

Second, use query engines like Spark to read data from sources and write geometry records into files. (Since Iceberg only offers an APIs to append files, not records)

Finally, (conditional) reading is not that hard to do.

My team is working on it. Hopefully, we can make it at the end of 2022.

@tpolong
Copy link

tpolong commented Jan 9, 2023

@badbye How is it going

@badbye
Copy link

badbye commented Jan 10, 2023

@badbye How is it going

It's done, but only supports the Spark engine, still a lot of work to do.
We are still discussing whether to make an MR. It may introduce more maintenance burden than the features it brings.

We may release the code in a few months, in another repo.

@badbye
Copy link

badbye commented Mar 3, 2023

Hi @x-malet @tpolong @kbendick , we have developed a project named GeoLake. You can try it in this docker environment: https://github.com/spatialx-project/docker-spark-geolake

we will release our code soon in this repo: https://github.com/spatialx-project/geolake.

@tpolong
Copy link

tpolong commented Mar 3, 2023 via email

@vcschapp
Copy link

@badbye, did you abandon efforts to get this into the main Iceberg product?

@badbye
Copy link

badbye commented Apr 23, 2023

@badbye, did you abandon efforts to get this into the main Iceberg product?

We really hope to be able to merge into the main branch. However, it requires a lot of work: proposing a proposal, splitting the code and adapting to the latest main branch code, submitting merge requests, and dealing with code review feedback.

Currently, our team does not have enough time and no clear plan when we will do it. We have released the code: https://github.com/spatialx-project/geolake which is fully compatible with Iceberg 0.15. Welcome anyone to use and contribute.

@jornfranke
Copy link

I think it would be good to have Geospatial support in Apache Iceberg, although it is certainly a more complex feature.
While the spatialx-project seems to have done a lot of useful implementation, it is a bit difficult to use as the changes made to the Iceberg are unclear and I could also not find some documentation on how geometry is added to Iceberg.

I propose that this documentation is started so this issue can move on. @badbye does it make sense to you that you or me create a Google Docs (or Cryptpad, https://cryptpad.fr/) document that is viewable by everybody (similarly to how other specs are done in Iceberg?)? Happy to also help with the writing structuring.

One could initially have it as follows:

What benefit does one have to use Apache Iceberg with Geospatial data instead of using, for instance, simply geoparquet? I would think about:

  • Support writing of individual rows (this can be useful in streaming scenarios, e.g. Internet of Thing devices communicating their position).
  • Natively already query the geospatial data without manual/error-prone conversion (e.g. using right CRS when loading etc.) and by having a much higher performance
    ... maybe you have some other motivations why you started geolake?

One can also think about other features (would not add them in the first spec due to complexity):

  • Partition of data according to spatial (location) criteria (see also: Add an spatial index in geoparquet format as optional opengeospatial/geoparquet#13 (comment)), which seems to be supported by Geolake (I wonder can we instead/additionally use the z-ordering feature of Iceberg to reuse the Iceberg functionality?)
  • Loading/Storing rasters (at the moment all proposals, including geoparquet, include only vector data), more complex, the raster should be split in equal small tiles et.

I suggest that a public Google Doc is started and that one can add what it would mean for Iceberg to support Geospatial support, e.g.:

@szehon-ho
Copy link
Collaborator

Hi, we are also interested in collaborating on this topic. I was looking at this with @hsiang-c who is also working with Geo data , we are starting a POC of geo-data on Iceberg.

We were reading a previous thread on the subject: https://lists.apache.org/thread/t6nk5t5j1p302hmcjs77lndm07ssk8cl , it seems there, @rdblue was thinking an alternative to adding geo-type was to use binary type with WKD encoding. That way we don't have to add new types to Iceberg. Also, I noticed that changing partition transform from taking one primitive, to taking a complex type, is probably a bigger change and can be avoided if possible, as it may raise more questions.

I also agree with the thread that the benefit of Iceberg will be having some geo_partition partition transform, to avoid the user to have to manually generate a partition column. There's also a few open question for geo_partition transform, as there's several formats , we need to decide which one we support, or my thought to leave it a pluggable implementation.

@badbye
Copy link

badbye commented Aug 15, 2023

@jornfranke Thanks for your suggestion. I will write a doc this week.

@badbye
Copy link

badbye commented Aug 23, 2023

Hi @jornfranke , I wrote a doc https://docs.google.com/document/d/1vCCVFayEck5f1uAMAmPVFiAbgJoR9x6BdAaj3Tgr5XU/edit?usp=sharing
If you have any questions or ideas, welcome to comment and contribute.

@jornfranke
Copy link

wow great, will look into it soon

@jornfranke
Copy link

I think it is a good start. I would like your opinion on the following:

  • From my point of view we need to support some spatial metadata - especially CRS ). I propose to reuse the same as in the geoparquet definition: https://geoparquet.org/releases/v1.0.0-rc.1/
  • What do you propose as an underlying storage format? You mention three: geoparquet, spatialparquet and Geolake parquet and you implemented geoparquet, geoparquet (bbox), Geolake parquet. I propose to reduce this to one. At the moment it looks to me geoparquet has the largest community and support also in other systems (e.g. geopandas), which may make it easier to use in the Iceberg ecosystem (e.g. https://py.iceberg.apache.org/)

Generally, I propose to go with a roadmap with a simple release first first to make it also easier for people from the Iceberg project to review and get initial feedback from the Iceberg community, e.g.:
First release: Storage backend geoparquet (and also include geoparquet metadata). Supported Ecosystem: Apache Sedona - Spark
Second release: Add XZ partitioning. Supported Ecosystem: Apache Sedona Spark and Flink and PyIceberg.
Third release: Include raster data (here the challenge is to split a big raster into multiple tiles that are transparently read as one, cf. https://sedona.apache.org/1.4.1/tutorial/storing-blobs-in-parquet/)...?

This is just an example, it can be changed in the detail.

@paleolimbot
Copy link
Member

Hi all, I'm sorry to be late to this conversation!

For the last few years we have been working on conventions for geospatial data in Apache Arrow and are nearing a first release of the specification (See https://github.com/geoarrow/geoarrow ). The metadata specification is closely related (identical as long as you specify an explicit CRS) to GeoParquet's metadata and it defines Arrow extension types/memory layout conventions for WKT, WKB, Point, Linestring, Polygon, MultiPoint, MultiLinestring, and MultiPolygon. The next release of the GeoParquet specification will probably include one or more of these as options because the Point/Linestring/Polygon/MultiPoint/MultiLinestring/MultiPolygon types have a storage type that encodes coordinates in Parquet such that bounding boxes are written as column statistics out-of-the-box by most Parquet writers.

At heart, GeoParquet is a file-level specification, whereas GeoArrow is a column-level specification. I wonder if there would be lower barrier to entry if the first step were to add GeoArrow types (maybe WKB to start) as a first-class type? MultiPoint, MultiLinestring, and MultiPolygon types would also give you bounding box information via the built-in system for nested column statistics (but are more complicated...nested lists of structs). I believe they also align with DuckDB's internal memory layout for (non-serialized) geo types in the geo extension.

I'll be presenting on this at Community Over Code in October and look forward to chatting about all of this there!

@jornfranke
Copy link

I think it is a good proposal. I think we need to get some feedback from the Apache Iceberg team. Apache Iceberg is based on storage backends (which can be custom), such as Parquet, Avro, Orc (https://iceberg.apache.org/spec/#appendix-a-format-specific-requirements) - there can be also custom ones. Parquet is though the most popular one.

Using Geoarrow can make sense here, because all geodata would be stored the same way in all formats. Of course it would then not be 100% compatible with Geoparquet.

However, I think the benefit of having this harmonized across all storage backends would overweight this. Nevertheless, I am interested in opinions.

I would also be interested in an opinion by the Apache Iceberg team. The Geolake solution is great, but was developed without the Apache Iceberg team. What are the expectations now:

  • Continue to write the specification of Geolake here and use the Geolake changes as their are and add them to Iceberg
  • Develop a new specification which is lightweight with mininmal features at the start and evolve the implementation over time? (One can then include geoarrow in this discussion)
  • ... sth else

@jiayuasu
Copy link
Member

For folks who are interested in Geometry/Raster + Iceberg, we @wherobots have implemented an iceberg-compatible spatial table format called Havasu: https://docs.wherobots.services/1.2.0/references/havasu/introduction/

The reader and writer implementation of Sedona + Apache Spark is available in our cloud database SedonaDB. You can try it out immediately on Wherobots Cloud

@EternalDeiwos
Copy link

EternalDeiwos commented Nov 18, 2023

I'll be presenting on this at Community Over Code in October and look forward to chatting about all of this there!

@paleolimbot This sounds like a really great summary of current progress. Any chance there's a recording of this somewhere?

@paleolimbot
Copy link
Member

I think there is a recording but I'm not sure if it has been posted yet. The slides are here: https://dewey.dunnington.ca/slides/geoarrow2023 and GeoArrow for Python is on pypi/conda and can generate examples of what Parquet files would look like and what the memory layout would be:

 import geoarrow.pyarrow as ga
import pyarrow as pa
from geoarrow.pyarrow import io
from pyarrow import parquet

extension_array = ga.as_geoarrow(["POLYGON ((0 0, 1 0, 0 1, 0 0))"])
extension_array.type
#> PolygonType(geoarrow.polygon)
extension_array.type.storage_type
#> ListType(list<rings: list<vertices: struct<x: double, y: double>>>)
extension_array.geobuffers()
#> [None,
#>  array([0, 1], dtype=int32),
#>  array([0, 4], dtype=int32),
#>  array([0., 1., 0., 0.]),
#>  array([0., 0., 1., 0.])]

# Parquet with extension type
table = pa.table([extension_array], names=["geometry"])
parquet.write_table(table, "ext.parquet")

# GeoParquet (no extension type, but with 'geo' metadata for GeoParquet)
io.write_geoparquet_table(table, "geo.parquet")

An example of how to find column statistics and use them is here: https://github.com/geoarrow/geoarrow-python/blob/main/geoarrow-pyarrow/src/geoarrow/pyarrow/dataset.py#L434-L468

@ajantha-bhat
Copy link
Member

ajantha-bhat commented Nov 18, 2023

@jiayuasu
Copy link
Member

@ajantha-bhat Yes, I have mentioned that above. The reader and writer implementation of Havasu in Sedona + Apache Spark is available in our cloud database SedonaDB. You can try it out immediately on Wherobots Cloud

I found this recently.

wherobots/havasu@main/spec.md

Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added stale and removed stale labels Jun 14, 2024
@gregorywaynepower
Copy link

Just want to make sure folks who care about Geospatial Support are forwarded to #10260

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests