Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A "nested list array" representation for geometries #4

Closed
jorisvandenbossche opened this issue Apr 12, 2020 · 57 comments
Closed

A "nested list array" representation for geometries #4

jorisvandenbossche opened this issue Apr 12, 2020 · 57 comments

Comments

@jorisvandenbossche
Copy link
Contributor

jorisvandenbossche commented Apr 12, 2020

One possible memory layout (cfr #3) is an array of "nested lists", which is natively supported in Apache Arrow.
It is the format that spatialpandas started using recently (in combination with datashader for visualization).

In this issue, I try to provide a clear description of what this format exactly is, and we can discuss potential advantages/disadvantages.


The idea here is similar to how eg GeoJSON stores the coordinates or how WKT displays the coordinates. Using an example from the wiki page for MultiPolygon:

{
    "type": "MultiPolygon", 
    "coordinates": [
        [
            [[40, 40], [20, 45], [45, 30], [40, 40]]
        ], 
        [
            [[20, 35], [10, 30], [10, 10], [30, 5], [45, 20], [20, 35]], 
            [[30, 20], [20, 15], [20, 25], [30, 20]]
        ]
    ]
}

The MultiPolygon's coordinates are represented as a nested list: a list of geometry parts (polygons), each consisting of a list of rings (at least one exterior ring, potentially additional interior rings), and each ring consisting of a list of coordinates.

An array of such nested lists can be represented efficiently in memory in a contiguous buffer for the coordinates, with offsets determining where the lists start.
In the Apache Arrow standard memory layout ("Arrow Columnar Format"), such data can be stored as what is called a "Variable-size List Array" (https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).

In the case of an array of MultiPolygons, you would get a ListArray with 3 levels of nesting.

So something like this (the single multipolygon from above):

[[[[40, 40], [20, 45], [45, 30], [40, 40]]], [[[20, 35], [10, 30], [10, 10], [30, 5], [45, 20], [20, 35]], [[30, 20], [20, 15], [20, 25], [30, 20]]]]

gets actually stored in memory as

values: [40, 40, 20, 45, 45, 30, 40, 40, 20, 35, 10, 30, 10, 10, 30, 5, 45, 20, 20, 35, 30, 20, 20, 15, 20, 25, 30, 20, ... ]
ring indices into values: [0, 8, 20, 28, ...]
parts indices into rings: [0, 1, 3, ...]
geometry indices into parts: [0, 2, ...]

with other MultiPolygons appended to those arrays.
In Apache Arrow this is a list<list<list<double>>> List Type. The innermost nested arrays store the x/y coordinates in interleaved ordering. And each additional nesting level is represented by an array of offsets.

This is an example for MultiPolygons (where 3 levels of nesting is required). Similarly, all other geometry types can be stored as well, but requiring a varying (but predefined) number of levels of nesting.
For example, an array of LineStrings can be represented as a ListArray with a single level of nesting. Each element in the array is 1 list of coordinates (with x, y, x, y, ... coordinates of one linestring interleaved), requiring only one array of offsets keeping track of the number of coordinates of each linestring.

More specifically, The List Array for each geometry type has a the following nesting levels:

  • MultiPoint/LineString/LinearRing: one level of nesting.
  • MultiLineString/Polygon: two levels of nesting.
  • MultiPolygon: three levels of nesting.

Point arrays can be represented as a Fixed Size list array (since we know that each entry will have 2 coordinates), which means it is effectily stored as the x/y coordinates in a single interleaved array.


Potential advantages of this kind of format:

  • It is natively supported in Apache Arrow memory layout. Which means it can also be efficiently stored in formats like Parquet and Feather, or be used in systems that rely on Apache Arrow for data transfer. To be clear, also WKB can be natively stored in Apache Array as variable sized binary array.

  • The coordinates are accessible as a single flat array of floats. This means they can be directly interpreted as values without needing a WKB decoder. And this can also be beneficial for high-performant processing of those coordinates. For example, this is the format that is currently used by @jonmmease in the datashader visualization library to ingest geometry data, where it is efficient to iterate over the coords/offset arrays.

There are also some clear limitations / constraints:

  • The ListArray itself does not store any information on the geometry types. This means that 1) such an array can only store homogeneous geometry types (no mixed geometry types in one column) and 2) the geometry type needs to be stored in metadata in order to correctly interpret the list array.

  • Since it doesn't support mixed geometries, it also does not support GeometryCollection objects.

@kylebarron
Copy link
Member

kylebarron commented Apr 13, 2020

This is interesting; thanks for writing this up. My comments here are generally in comparison with the geometry representation thoughts I wrote here.

Nested vs flat lists

The biggest contrast between my linked thoughts and your description is that I proposed a struct of flat lists and you propose nested lists.

To make the difference concrete, let's take the above multi polygon example. With nested lists that might look like

[
    [
        [[40, 40], [20, 45], [45, 30], [40, 40]]
    ], 
    [
        [[20, 35], [10, 30], [10, 10], [30, 5], [45, 20], [20, 35]], 
        [[30, 20], [20, 15], [20, 25], [30, 20]]
    ]
]

with a struct of flat lists as I laid out, that would look like

{
  type: 'Polygon',
  positions: [40, 40, 20, 45, 45, 30, 40, 40, 20, 35, 10, 30, 10, 10, 30, 5, 45, 20, 20, 35, 30, 20, 20, 15, 20, 25, 30, 20],
  size: 2,
  polygonIndices: [0, 4, 14],
  ringIndices: [0, 4, 10, 14]
}

Essentially while nested lists might internally store positions in the same format as this flat list, storing purely as a flat list with metadata allows lower-level access essentially.

I think most disadvantages of the nested list format could be solved by wrapping it in a struct.

Multi-geometries for free

One thing I like about the flat lists is that multi-geometries are the same layout as their single-part geometry counterparts. With nested lists, multi-part geometries always have an extra nested list level. In contrast, with flat lists, the number of polygon geometries is defined exclusively by the polygonIndices array. Setting it to length 2 implies a single polygon (with possibly multiple rings), while setting to length >2 implies a MultiPolygon. But the format of positions doesn't change either way.

2D vs N-D

Point arrays can be represented as a Fixed Size list array (since we know that each entry will have 2 coordinates), which means it is effectily stored as the x/y coordinates in a single interleaved array.

I would prefer that representations support N-D coordinates, so that the same memory format could support 1, 3, 4 dimensions without any changes.

Flat array

The coordinates are accessible as a single flat array of floats. This means they can be directly interpreted as values without needing a WKB decoder. And this can also be beneficial for high-performant processing of those coordinates. For example, this is the format that is currently used by @jonmmease in the datashader visualization library to ingest geometry data, where it is efficient to iterate over the coords/offset arrays.

I can't tell if this is true? If coordinates are stored as nested arrays as above, is it possible to extract the core flat array representation without processing?

Heterogeneous types

The ListArray itself does not store any information on the geometry types. This means that 1) such an array can only store homogeneous geometry types (no mixed geometry types in one column) and 2) the geometry type needs to be stored in metadata in order to correctly interpret the list array.

Even sticking with nested arrays, it wouldn't add much space to wrap it in a struct... Something like

{
  // Or, to save space, type: 3
  // https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary
  type: 'Polygon',
  geometry: [[[...]]]
}

GeometryCollections

Since it doesn't support mixed geometries, it also does not support GeometryCollection objects.

Yes, this is true. I'm not sure how important it would be to support them. I've never actually seen them in the wild.

@jorisvandenbossche
Copy link
Contributor Author

OK, that's interesting!

Your flats lists could be seen basically as the different arrays that make up the internal representation of a ListArray, but by putting them in a Struct ? (although the exact interpretation of the indices is different: in your example they all point into the coordinates, while for the ListArray, they each time point at the "next" indices.

One thing I like about the flat lists is that multi-geometries are the same layout as their single-part geometry counterparts.

That's something you could do with the ListArray as well, by representing a Polygon in the same way as the MultiPolygon above, but with only one sub-polygon. That would mean that you can't distinguish a Polygon and a MultiPolygon with 1 part, but that's maybe not a problem (it's in any case way to store mixed Polygons/MultiPolygons array).

So I think? this is equivalent between ListArray and your proposal? Since the polygonIndices would also not be needed if you only allow Polygons and not MultiPolygons (which is basically the extra level of nesting in the ListArray for MultiPolygons).

2D vs N-D

I would prefer that representations support N-D coordinates, so that the same memory format could support 1, 3, 4 dimensions without any changes.

Yes, my example used x, y (2D), but this can extend to any number of dimensions. We can store in metadata what the dimension is, but you could also infer it from the length of the fixed size list (in case of points).

Flat array

I can't tell if this is true? If coordinates are stored as nested arrays as above, is it possible to extract the core flat array representation without processing?

Yes, that's how ListArray is stored in memory in the Arrow spec (see the "values" in my memory layout example for the MultiPolygon).
The same would be true for a StructArray (with your example struct layout).

Heterogeneous types

Even sticking with nested arrays, it wouldn't add much space to wrap it in a struct... Something like

The main problem for Arrow is that all values in the array need to have the same type. So that would mean: the same number of nesting levels for the ListArray, or the same keys for the struct in your example.
So with both nested lists as structs, that poses a problem. Now, we could take the "union" of nesting levels / struct keys, so that each geometry type can be stored in it, possibly not using certain nesting levels / keys. But that's a trade-off with memory usage.

@kylebarron
Copy link
Member

I haven't responded up to now because I don't think I understand the Arrow spec quite well enough. You might be right that we're essentially saying the same thing... I'll try to read up a bit more and get back to you.

@jorisvandenbossche
Copy link
Contributor Author

jorisvandenbossche commented Apr 27, 2020

Some additional points that might help in clarifying:

  • One of the core concepts of Arrow is that it is a columnar layout, so which means all values in a single column (here: all geometries of one column) are stored in one contiguous array. While maybe obvious, I am re-iterating this point to emphasize the following: although the above sketches of the nested list vs struct of flat lists showed the "encoding" of a single geometry, in the actual Arrow memory those are stored together for many geometries.
    For the "nested list", this means that the interleaved values array is one contiguous array for the interleaved x,y coordinates for all coordinates of all geometries in the column. Similarly, for the "struct of lists" sketch, the way that Arrow stores such structs in memory is by storing one contiguous array for all the values of one key in the struct (so for a struct with keys "type" and "positions", it actually stores two separate arrays in memory, one array for all values of the "type" key and one array for "position").
  • The above means that in both cases, the actual interleaved coordinate values are stored in a single flat array for all the geometries of a column. And I think this is the core / most important aspect of either ideas: we store all coordinates in a single array. It is then a question of how to indicate where each geometry starts, where each geometry-part starts, where each ring starts, etc. And it might be that here, the details between the nested list vs struct might differ a bit (and important details to flesh out of course!), but both use this single flat array of coordinates (accros geometries, and not limited to a single geometry).

@kylebarron
Copy link
Member

Just a quick recap of our offline chat, and @jorisvandenbossche can correct me if I'm wrong on anything

  • The "nested lists" and "list of structs" formats are quite alike, with differences of what the offset indices mean

  • With nested lists, you can only have one geometry type per column, so points, lines, and polygons would have to be separate tables or separate columns with null values

  • With a list of structs, all data for the entire column of each key in the struct is stored contiguously. So if the struct is:

     {positions: Array, offsets: Array, type: 1|2|3}

    Then you have a flat array with all the positions for the entire table, followed by a flat array with all the offsets for the entire table, followed by types for the entire table.

  • From the deck.gl perspective, it's ideal to have all data of a single geometry type in a contiguous buffer, so that Point and Line geometries could potentially be uploaded to the GPU without any intermediate processing.

  • There's a possibility of using a Union type to do this. To essentially represent a struct something like:

     {
     	type: Integer,
     	size: Integer,
     	pointPositions: Array,
     	linePositions: Array,
     	lineStartIndices: Array,
     	polygonPositions: Array,
     	polygonStartIndices: Array,
     	polygonRingStartIndices: Array
     }

    But where the positions array is split internally into an array for each geometry type. Since deck.gl processes each geometry type independently and in different ways, separating these position arrays in the Arrow memory aids in working with the geometries without an extra copy to iterate over the dataset splitting by type.

    However from here it says that Union types aren't supported in Parquet. So to keep Parquet support open, we could store it as a plain struct, but then we have to deal with null values in the middle of the pointPositions array for geometries that are not Points, which is not ideal.

  • WKB is a good format to start out with, but should we support it forever as a geometry encoding, or just to start? Multiple formats are harder to support for implementations, but a specialized internal data structure such as struct arrays might be difficult to support for applications used to GeoJSON. A possible benefit of the nested list approach is that it mimics the GeoJSON structure closely, and thus shouldn't be a huge switch for existing applications that expect GeoJSON, while also allowing access to the raw flat array of coordinates.

@jorisvandenbossche
Copy link
Contributor Author

I made a notebook exploring a bit more in detail the different memory representations (nested list vs struct) with a small example: https://nbviewer.jupyter.org/gist/jorisvandenbossche/dc4e98cf5c9fdbb64769716d046d0edf
(feedback on how this can be illustrated / explained more clearly is certainly welcome, as I find it not that easy to explain)

@kylebarron thanks for that write-up!

The "nested lists" and "list of structs" formats are quite alike, with differences of what the offset indices mean

Indeed, see the notebook I linked to above where I tried to make those differences in offsets a bit more clear / visual.

There's a possibility of using a Union type to do this [store mixed geometries as separate contiguous arrays]. ... However from here it says that Union types aren't supported in Parquet.

Hmm, yes, that might be a reason to not use Unions, at least not directly (otherwise it would limit mixed geometries to Feather format and Arrow IPC).

Now, I think the "fallback" of mimicking the Union with a plain StructArray, as you mention, might actually still be possible. In general, then you indeed have to deal with nulls. But since the coordinates array is actually a ListArray, the nulls are not necessarily stored in the physical flat array, but only through the offset arrays.
If you look at the first example layout at https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout, there are nulls (and empty lists) in the example array, but the actual values are still stored in a contiguous values array.

This would actually work for both the ListArray or StructArray layout, since the coordinates are stored as a single list in the StructArray geometry layout as well.

@kylebarron
Copy link
Member

cc'ing @trxcllnt who has contributed to the C++ and JS arrow bindings and is working on https://github.com/rapidsai/cuspatial, geospatial support for a CUDA GPU dataframe using Arrow with python bindings. Comments for how they're currently laying out polygons are here and here.

@jorisvandenbossche
Copy link
Contributor Author

@kylebarron thanks for making this link! That's interesting ;)

@trxcllnt if I am reading the comments in cuspatial source code correctly, the memory layout for polygons you are using is very similar to what has been discussed here? (separate arrays for coordinates, geometry (feature) indices and ring indices).
The discussion here is a bit long and maybe hard to find what was actually proposed, but I think the top post still gives a good idea (except the example is using MultiPolygons (which need an extra array of indices) instead of Polygons).

The "SoA" mentioned (Structure of Arrays) is the same idea that Arrow's StructArray is using (and in the end ListArray is similar as well).

The main difference seems to be that in cuspatial also the x and y coordinates are stored in separate arrays, while the idea here was to keep those interleaved and only store the indices in separate arrays.

And I am also not fully clear on the exact interpretation of the feature and ring indices in cuspatial based on those links.
From here, there is an example for fpos, rpos, x and y:

            cudf.Series([1, 2], index=['nyc', 'dc']), # ring positions of
                    # two polygons each with one ring
            cudf.Series([4, 8]), # positions of last vertex in each ring
            # polygon coordinates, x and y. Note [-10, -10] and [0, 0] repeat
            # the start/end coordinate of the two polygons.
            cudf.Series([-10, 5, 5, -10, -10, 0, 10, 10, 0, 0]),
            cudf.Series([-10, -10, 5, 5, -10, 0, 0, 10, 10, 0]),

But so the cumulative number of vertices in each ring doesn't keep into account that the ring needs to be closed? (the ring indices indicate that each ring is of length 4, but in the coordinates array, each ring has 5 elements) But this would mean you can't directly use the ring offsets to index into the coordinates array?

@trxcllnt
Copy link

@jorisvandenbossche cuSpatial is undergoing a massive overhaul at the moment, so the behavior described by the comment you referenced may change soon :-). But yes we've taken the Struct of Arrays approach in many places because global memory reads are often more optimized on the GPU.

We were discussing using a segmented prefix sum layout (Arrow's List) for some of the trajectory APIs, but the List column type is only just now landing in libcudf, so it wasn't available when the algo was first implemented.

I believe all the feature/ring/geometry terminology here is inherited from GDAL. You can check out their docs here, or check out the polygon_shapefile_reader.cpp to see the relationship between the GDAL types a bit clearer.

@achapkowski
Copy link

Why not use WKB?

@kylebarron
Copy link
Member

From the top post

  • The coordinates are accessible as a single flat array of floats. This means they can be directly interpreted as values without needing a WKB decoder. And this can also be beneficial for high-performant processing of those coordinates. For example, this is the format that is currently used by @jonmmease in the datashader visualization library to ingest geometry data, where it is efficient to iterate over the coords/offset arrays.

And I laid out a couple more arguments in #3 (comment)

I've been reading the WKB spec (surprisingly hard to find good documentation, but this is helpful). And I feel that the main issue I have with WKB is:

  1. coordinates are not contiguous.
  2. It's impossible to know array offsets without scanning an entire binary object.
    For example with Polygons, it's helpful to have metadata at the beginning describing the total number of rings, but it doesn't directly tell you the byte offsets where those rings start. In order to find the start byte of the nth ring, you must read the lengths of each n-1 rings and sum them up.
    image

For an in-memory format such as Arrow, I think it's critically important for the geometry format to be fully usable without any further parsing.

With Arrow-native representations such as have been discussed in this issue, you can get all geometries for the entire table without a copy.

With something like a Union type (#4 (comment)), you can get all geometries of a single geometry type without a copy.

@achapkowski
Copy link

My concern is that other API, products, etc... will have to write a special case to handle the geometry.

If you are going to re-write the geometry spec, why not get rid of polygon vs multi-polygon, etc.. and just simplify everything to Point, Line, Polygon and Multipoint?

@kylebarron
Copy link
Member

My concern is that other API, products, etc... will have to write a special case to handle the geometry.

I think that's a valid concern, and this comes back to performance vs backwards compatibility.

Of course existing applications will have their own geometry needs, but to my knowledge few other than PostGIS work with WKB as their internal memory format, so whether geometries are stored in WKB or Arrow-native arrays, there's a conversion necessary. I can't find it right now, but @jorisvandenbossche made a prototype where loading geometries into a GeoDataFrame (and its GEOS-based memory format) from Arrow-native arrays was faster (significantly I believe) than from WKB.

The point being that if we choose the Arrow-native format, we can advise in writing performant adapters to the necessary memory format, which will be in most cases faster than WKB parsing.

Already in several libraries with a focus on high performance (datashader, cuSpatial, deck.gl) there are similar but slightly different Arrow-native geometry implementations. My goal is to prevent fracturing of the ecosystem for upcoming libraries.

Coming from the deck.gl perspective, it's especially attractive to have such a format as Arrow where points and lines can be uploaded to the GPU without any processing, and polygons need only tessellation, which would be faster on Arrow arrays than on WKB.

If you are going to re-write the geometry spec, why not get rid of polygon vs multi-polygon, etc.. and just simplify everything to Point, Line, Polygon and Multipoint?

That's essentially one of the things being proposed. In #4 (comment), there are Point, LineString, and Polygon geometry types, and Multi* are unnecessary because it can be inferred from the metadata.

@srenoes
Copy link

srenoes commented Aug 6, 2020

The main problem for Arrow is that all values in the array need to have the same type. So that would mean: the same number of nesting levels for the ListArray, or the same keys for the struct in your example.
So with both nested lists as structs, that poses a problem. Now, we could take the "union" of nesting levels / struct keys, so that each geometry type can be stored in it, possibly not using certain nesting levels / keys. But that's a trade-off with memory usage.

Really? then probably it is good to look at awkward arrays or its implementation... because that lib turns any nested dictionary into such a structure as I understood. It works separate or in connection with numpy, or as a column in pandas. Also, it has some pretty intensive support for indexing. And something they say is "native numba support"
Of course its another module, so I dont know if it would make sense to include it as another dependency. I thought earlier that it might be a way to increase the speed of iterfeatures in geopandas, though as it returns some iterator or generator I dont know if that would work out.
But: Maybe you can get some ideas from this? https://www.youtube.com/watch?v=WlnUF3LRBj4.

@jorisvandenbossche
Copy link
Contributor Author

@srenoes AFAIK the data types that awkward array supports are very similar to the ones supported by Arrow. I think the physical represented for nested lists or struct arrays is basically the same in both project (and nestes lists or struct arrays can be converted between both zero-copy). For example awkward-array's RecordArray is similar to Arrow's StructArray and also requires the same keys in each record.

As mentioned in the discussion above (and awkward-array has the same), there is also a Union type to have different types for each record (but that is eg not supported in Parquet).

(and awkward array is certainly interesting for operations on such data structures / interfacing numba with such data, but the discussion for now is about the actual representation in memory)

@jnh5y
Copy link

jnh5y commented Aug 10, 2020

Hi all, I work on LocationTech GeoMesa, and we've got an existing implementation of (Multi){Point,LineString,Polygon}s using the nested array approach in Scala for Apache Arrow, Parquet, and Orc.

I'm not sure I see the benefit of managing the ring (and other) indices in a (Multi)Polygon separately from the List<List<...>> support given by the various columnar data types.

As a separate note, when we needed to implement mixed 'Geometry' columns, at that point I did fallback to use WKB. Using any of the WKB variants short-circuits most all of the columnar compression that one may be able to get otherwise.

@pramsey
Copy link

pramsey commented Aug 10, 2020

Not a joke: much of what has been said above is reflected in the hoary old shapefile specification. https://www.esri.com/library/whitepapers/pdfs/shapefile.pdf

@jnh5y
Copy link

jnh5y commented Aug 10, 2020

@pramsey Nice! It is good reminder that all of this has happened before. Thanks for the link; I need to remember to read it later.

Do you think it'd make sense to push this kind of Arrow (maybe other formats as well) standard into something like JTS/GEOS?

@pramsey
Copy link

pramsey commented Aug 10, 2020

Depends a great deal on the dependency load... I can't see GEOS picking it up if it requires adding in some dependency. I can easily see an PgSQL FDW that reads from this into PostGIS. Note that spec'ing the geometry file is the easy part. You also need to spec the index file. I've long said that there's already a "cloud optimized vector" format, and that's a sorted shape file (use the shpsort utility from mapserver). All the pieces (SHP, SHX, DBF, QIX) have known offsets so you can random access things, the sorting means you can access blocks of spatially contiguous shapes, etc.

@jnh5y
Copy link

jnh5y commented Aug 10, 2020

Good points. I suppose I ask about pushing something like this to JTS/GEOS is to help get broad adoption.

For shapefiles, while I don't know the spec well, I get the sense that there's a 2 GB limit for some of the pieces, etc., in addition to other issues with the format. I wonder if what a 'big data' shapefile-esque format would / should look like...

@pramsey
Copy link

pramsey commented Aug 10, 2020

Yes, there's some signed ints in there that limit offsets to 2GB. Would this big a "big data" esqeue format? https://bjornharrtell.github.io/flatgeobuf/

@jnh5y
Copy link

jnh5y commented Aug 10, 2020

Yes, I'll admit that I hadn't paid attention to flatgeobuf enough. That's heading in a good direction.

I suppose we have two ways to get to a 'big geo data' format. Bjorn's approach is to create/scale up the good ideas we have as geospatial experts. The other directions would be to geo-enable existing big-data formats.

As much as I'd like to see one format/approach, both already exist and will likely evolve. Being able to go between flatgeobufs and well-written Arrow/Parquet files may be the 'best of both worlds' we should hope to live in.

@kylebarron
Copy link
Member

kylebarron commented Aug 10, 2020

Thanks both for your comments!

I'm not sure I see the benefit of managing the ring (and other) indices in a (Multi)Polygon separately from the List<List<...>> support given by the various columnar data types.

From above:

The main problem for Arrow is that all values in the array need to have the same type. So that would mean: the same number of nesting levels for the ListArray, or the same keys for the struct in your example.

I believe this means that you couldn't store both Polygons and MultiPolygons in the same column, where you need either 2 or 3 levels of nesting. When you manage the ring indices yourself, you don't need separate (Multi)* types. A MultiPolygon just has additional values in the indices array to designate it has multiple polygons.

Not a joke: much of what has been said above is reflected in the hoary old shapefile specification.

Great point. For example, the layout of Polygons is specified on page 12, and it defines a similar setup where indices arrays point into the main positions array.

As an aside, I originally suggested a struct of positions and indices arrays because we use that as an optional binary data format in deck.gl. Because shapefile geometries are in a similar format, we can parse them quite quickly to this binary format, avoiding ever creating many small GeoJSON objects.

Depends a great deal on the dependency load... I can't see GEOS picking it up if it requires adding in some dependency.

I think a dependency on the Arrow C++ library would be required.

I've long said that there's already a "cloud optimized vector" format, and that's a sorted shape file (use the shpsort utility from mapserver). All the pieces (SHP, SHX, DBF, QIX) have known offsets so you can random access things, the sorting means you can access blocks of spatially contiguous shapes, etc.

Having recently delved into the spec to make a JS shapefile parser, I was pleasantly surprised how amenable shapefiles are to random access streaming. I'd love to see a sort of "shapefile update" which replaces the .shx and .dbf with a binary columnar format like Parquet, which would make finding geometries based on attributes much faster than an ASCII row-based format like dbf.

Would this big a "big data" esqeue format? bjornharrtell.github.io/flatgeobuf

I've been working with Flatgeobuf a lot recently, and it certainly has its strengths. It also overlaps a bit with our discussions here. The schema of each Flatgeobuf geometry is defined as follows:

table Geometry {
  ends: [uint];          // Array of end index in flat coordinates per geometry part
  xy: [double];          // Flat x and y coordinate array (flat pairs)
  z: [double];           // Flat z height array
  m: [double];           // Flat m measurement array
  t: [double];           // Flat t geodetic decimal year time array
  tm: [ulong];           // Flat tm time nanosecond measurement array
  type: GeometryType;    // Type of geometry (only relevant for elements in heterogeneous collection types)
  parts: [Geometry];     // Array of parts (for heterogeneous collection types)
}

It's similar to what I described earlier in this issue with a few differences:

  • I suggested interleaved positions with a separate stride argument. This means you can get xyz positions as a flat array without a copy, but the user needs to know how to interpret each dimension. When Flatgeobuf stores arrays as xy, z, m, and t separately, there's no confusion what each dimension beyond two represents.
  • You can't get all positions from a MultiPolygon as a single array without a copy; each inner Polygon is stored within parts, so only positions of each inner part can be read with zero copy.
  • Flatgeobuf is a row-oriented format, so it isn't optimized for filtering based on attributes. To do that, you'd need to parse every flatbuffer individually and consider the value of the given attribute. With Arrow, since it's a columnar format, you can have a very wide table with hundreds of attributes and one geometry column; then at runtime choose the one attribute column for filtering and only read the bytes of that column plus the geometry column.
  • Flatgeobuf is explicitly designed for random access. If you know the byte range of the feature you're interested in, you can easily read and parse only those bytes from object storage. I think with a columnar layout like Feather/Parquet, reading a single feature isn't particularly fast because you need to read the entire column, then select the row you want.
  • Since Arrow is a columnar format, you can get all geometries of the entire table as contiguous memory without a copy.

@jorisvandenbossche
Copy link
Contributor Author

Hi all, I work on LocationTech GeoMesa, and we've got an existing implementation of (Multi){Point,LineString,Polygon}s using the nested array approach in Scala for Apache Arrow, Parquet, and Orc.

@jnh5y Thanks a lot for chiming in! When I saw GeoMesa mentioned in @kylebarron's tweet thread, I was wondering how GeoMesa stores/serializes the geometries. Is there some code comments / documentation about this?

I personally also think that the approach of using List<List<...>> should be sufficient for the data itself (if there are some global metadata about the geometry type stored in the array).

@trxcllnt
Copy link

Arrow does have a Union type (SparseUnion and DenseUnion), which we could use to represent columns of mixed geometries without resorting to packed struct-like geometry representations.

@TomAugspurger
Copy link

An earlier comment linked to https://ursalabs.org/blog/2020-feather-v2/, which mentions that while Arrow supports Unions, parquet might not:

Parquet has a smaller type system and while most of Arrow can be stored in Parquet, there are some things that cannot (like unions)

I don't know if that's still accurate.

@thomcom
Copy link

thomcom commented Jul 23, 2021

Hey all, just wanted to share that I've merged a from_geopandas feature into RAPIDS cuSpatial that gives pretty solid compatibility with any GeoPandas dataframe, copying all of the geometry data to GPU and indexing it as referenced in this conversation, using offset arrays and such.

@thomcom
Copy link

thomcom commented Jul 23, 2021

rapidsai/cuspatial#300

@jorisvandenbossche
Copy link
Contributor Author

A note to all: with much delay, I started writing down some of the things discussed above about the format in a specification: #12. Feedback is very much appreciated!

(it certainly does not yet try to write down everything (such as mixed geometry / collections support through Unions, but starting with the core features)


Hey all, just wanted to share that I've merged a from_geopandas feature into RAPIDS cuSpatial that gives pretty solid compatibility with any GeoPandas dataframe, copying all of the geometry data to GPU and indexing it as referenced in this conversation, using offset arrays and such.

Thanks @thomcom for the update, that's cool to see! I have a few questions about how you handle for example the case of mixed Polygons / MultiPolygons. I explored a bit the conversion code of cuspatial, the buffers it creates, and how this compares the the Arrow buffers (or at least how I now wrote it down in the above mentioned PR). See https://notebooksharing.space/view/517f3172b12354804179f248247ab5ffd6573214e9f9810d13494533f1aefd8a#displayOptions= for a notebook exploring this (it should also allow inline comments / annotations through hypothes.is)

I think both are mostly compatible, but I mainly want to ensure that if GeoPandas would be able to emit such buffers itself (and create those in a more efficient way), cuspatial can use them.

@thomcom
Copy link

thomcom commented May 12, 2022

Hey @jorisvandenbossche ! It is exciting to see this specification developing and gathering more interest. Sorry I missed this message from Nov, I'm happy to get together with you soon and talk about how we've implemented GeoArrow at RAPIDS and make sure we're still in alignment. We are in fact having a discussion with @harrism and @isVoid about a possible simplification of the nested LineString/MultiLineString structure. It'd be good to put heads together about it.

@thomcom
Copy link

thomcom commented May 12, 2022

Following up on the notebook that you shared above that is very interesting! I think it would be straight-forward to ensure that the MultiPolygons structure lines up with the geoarrow format. In fact @harrism the notebook linked above https://notebooksharing.space/view/517f3172b12354804179f248247ab5ffd6573214e9f9810d13494533f1aefd8a#displayOptions= shows that GeoArrow is doing exactly what you suggested - tracking Polygons as length-1 MultiPolygons.

@thomcom
Copy link

thomcom commented Jun 28, 2022

Hey @jorisvandenbossche I'm curious where you're at with a geopandas implementation? I'm circling back around to fix my GeoArrow implementation bug and curious about what optimizations may now exist that I can plug into instead of doing it all dumbly and iteratively.

@thomcom
Copy link

thomcom commented Jun 29, 2022

Actually I'm looking now at dask-geopandas and seeing that you pack the geometry into a buffer in wkb format. I'm going to try to give you a PR to geopandas soon with the syntax to convert all the geometries into one or more pa.Array objects.

@sendreams
Copy link

@jnh5y i notice the geomesa support encode jts geometry to geoarray, is any way to decode geoarray string to jts geometry? thanks

@jnh5y
Copy link

jnh5y commented Sep 1, 2023

@jnh5y i notice the geomesa support encode jts geometry to geoarray, is any way to decode geoarray string to jts geometry? thanks

Yes, for the formats that GeoMesa supports, it can read and write those formats.

GeoMesa's Arrow support was created before this repo, so its way of using Arrow may be slightly different than what is described here. @elahrvivaz may be able to answer any specific questions.

@sendreams
Copy link

sendreams commented Sep 2, 2023

Yes, for the formats that GeoMesa supports, it can read and write those formats.

GeoMesa how to read the geoarray String?
i have read the testing code, something like below code
` try(RootAllocator allocator = new RootAllocator(Long.MAX_VALUE);
PolygonFloatVector floats = new PolygonFloatVector("polys", allocator, null);
PolygonVector doubles = new PolygonVector("polys", allocator, null)) {

        Field floatField = floats.getVector().getField();
        Field doubleField = doubles.getVector().getField();

        floats.set(0, (Polygon) wktReader.read(p0));
        floats.set(1, (Polygon) wktReader.read(p1));`

it create a PolygonVector instance and set geometry to each slot, and use ‘vector’ property's toString() method can output the geoarray string.
floats.getVector().toString()

but how to read the output geoarray string? i can't find any code example. @jnh5y thanks

@jnh5y
Copy link

jnh5y commented Sep 2, 2023

Ah! I think you want the get method(s) which are implemented on the AbstractVector classes. E.g. https://github.com/locationtech/geomesa/blob/main/geomesa-arrow/geomesa-arrow-jts/src/main/java/org/locationtech/geomesa/arrow/jts/impl/AbstractPolygonVector.java#L59-L86

The test uses those methods here to get the Polygon instances. The Java class is written to a string to compare it to the WKT representation above.

Lemme know if that's not quite that you are looking for.

We do not have a good tutorial on how to use those classes directly. Those classes are used as library code in the geomesa-arrow-gt module which implements the GeoTools DataStore API using Arrow files to store SimpleFeatures.

@sendreams
Copy link

@jnh5y thanks for your reply
finally i have been writing some java code to deserialize these geoarray string.
`package com.ware4u.test.geoarray;

import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.locationtech.jts.geom.*;

import java.util.ArrayList;
import java.util.List;

public class PolygonDeserializer {

private static final GeometryFactory geometryFactory = new GeometryFactory();

public static Geometry deserialize(String value) throws JsonProcessingException {
    ObjectMapper objectMapper = new ObjectMapper();
    int idx = value.indexOf("[[[[");
    if (idx >= 0) {  // level4
        List<List<List<List<Double>>>> c2 = objectMapper.readValue(value,
                objectMapper.getTypeFactory().constructCollectionType(List.class,
                        objectMapper.getTypeFactory().constructCollectionType(List.class,
                                objectMapper.getTypeFactory().constructCollectionType(List.class,
                                        objectMapper.getTypeFactory().constructCollectionType(List.class, Double.class)))));
        List<Polygon> polygons = new ArrayList<>();
        for (List<List<List<Double>>> shell : c2) {
            polygons.add(createPolygon(shell));
        }
        return geometryFactory.createMultiPolygon(polygons.toArray(new Polygon[0]));
    }
    else {  // level3
        List<List<List<Double>>> c2 = objectMapper.readValue(value,
                objectMapper.getTypeFactory().constructCollectionType(List.class,
                        objectMapper.getTypeFactory().constructCollectionType(List.class,
                                objectMapper.getTypeFactory().constructCollectionType(List.class, Double.class))));
        return createPolygon(c2);
    }
}

private static Polygon createPolygon(List<List<List<Double>>> c2) {
    List<LinearRing> rings = new ArrayList<>();
    for (List<List<Double>> shell : c2) {
        rings.add(convert(shell));
    }
    Polygon polygon = null;
    if (rings.size() == 1){
        polygon = geometryFactory.createPolygon(rings.get(0));
    }
    else {
        LinearRing[] holes = rings.subList(1, rings.size()).toArray(new LinearRing[0]);
        polygon = geometryFactory.createPolygon(rings.get(0), holes);
    }
    return polygon;
}

private static LinearRing convert(List<List<Double>> shell) {
    List<Coordinate> coordinates = new ArrayList<>();
    for (List<Double> coord : shell){
        Coordinate c = new Coordinate();
        c.x = coord.get(0);
        c.y = coord.get(1);
        coordinates.add(c);
    }
    return geometryFactory.createLinearRing(coordinates.toArray(new Coordinate[0]));
}

}
`
These codes have not been fully tested. using like below code.

public void test3LevelDeserialize() throws JsonProcessingException {
String value = "[[[121.4315650332303, 30.96909430594966], .....]]]";
Geometry geo = PolygonDeserializer.deserialize(value);
System.out.println(geo);
}

@jnh5y
Copy link

jnh5y commented Sep 2, 2023

Hmm.... by geoarray, do you mean GeoJSON? If you have GeoJSON, but want GeoArrow, you could from GeoJSON To JTS Geometries and then go from JTS to Arrow.

@trxcllnt
Copy link

trxcllnt commented Sep 2, 2023

If you have GeoJSON, here's two examples implemented in JavaScript. They differ slightly in how they represent points in the schema. The first one uses a Struct, and the second uses a 2-element list:

  1. ArrowJS Nested List with Points Struct Layout
  2. geospatial.js (uses Uber's san_francisco_censustracts.geojson)

@sendreams
Copy link

`
public void test3LevelDeserialize() throws JsonProcessingException {
String value = "[[[121.4315650332303, 30.96909430594966], [121.4315608215966, 30.96909332166395], [121.43155626528808, 30.969107261850347], [121.43153712689295, 30.96916130142165], [121.43153606545346, 30.969164296505063], [121.43153036322325, 30.969180398348694], [121.43152827801022, 30.96918642911354], [121.43152825600563, 30.969186492243967], [121.43152568873826, 30.969194038151873], [121.4315210969928, 30.969207543587125], [121.43150456063141, 30.96925615232736], [121.43150235592134, 30.969262634033765], [121.43149938731783, 30.969271357777014], [121.43149739637184, 30.96927721088314], [121.43149258456954, 30.969291354837065], [121.43149117728433, 30.96929549259525], [121.43148974275506, 30.969299708815562], [121.43148644302342, 30.969309405671122], [121.43147719664199, 30.969336587891288], [121.43147702583882, 30.969337092034177], [121.43147550538387, 30.969341558972673], [121.4314736621858, 30.96934697377145], [121.43146777421444, 30.9693642850755], [121.43146474273017, 30.969373200013866], [121.43146022641713, 30.969386475470156], [121.43145237683224, 30.96940954698603], [121.43149466677144, 30.969417340689304], [121.43155188615341, 30.969427885833102], [121.43156753058653, 30.969379617779538], [121.43157001206689, 30.969371957299735], [121.43157604592399, 30.96936200955099], [121.43157710427175, 30.96935888638958], [121.4315774856974, 30.969357757253217], [121.43158040401232, 30.96934914533622], [121.43158474952477, 30.969336318999726], [121.43158791618434, 30.96932697115971], [121.43159048241438, 30.96931939638803], [121.43159273009024, 30.96931276677221], [121.43159516324012, 30.969305585214002], [121.43159527640942, 30.969305252425595], [121.4316023893372, 30.96928425608491], [121.43161019071019, 30.969261228741743], [121.4316180769547, 30.96923795609094], [121.43161841227256, 30.9692369649401], [121.4316216407472, 30.969227441235862], [121.4316238947091, 30.96922078456264], [121.43162685493094, 30.96921204728488], [121.43163794554084, 30.96917931405804], [121.43163985160774, 30.969173688217737], [121.43164653593782, 30.96915396080006], [121.43165195549196, 30.96913796891147], [121.43165453219142, 30.96913036076902], [121.4316591523807, 30.969116301539348], [121.43165834438533, 30.969116106496163], [121.43164561109552, 30.969113131072], [121.43159221836538, 30.96910065775146], [121.4315650332303, 30.96909430594966]]]";

    org.locationtech.jts.geom.Geometry geo = PolygonDeserializer.deserialize(value);
    System.out.println(geo);
}

`

my point is deserialize Arraw string value to jts geometry. @jnh5y

@sendreams
Copy link

If you have GeoJSON, here's two examples implemented in JavaScript. They differ slightly in how they represent points in the schema. The first one uses a Struct, and the second uses a 2-element list:

  1. ArrowJS Nested List with Points Struct Layout
  2. geospatial.js (uses Uber's san_francisco_censustracts.geojson)

thanks for your infor.

@paleolimbot
Copy link
Contributor

Closing this since we now have the nested list encoding well documented in format.md! Feel free to open new issues for any unresolved parts of this discussion that I missed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests