-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A "nested list array" representation for geometries #4
Comments
This is interesting; thanks for writing this up. My comments here are generally in comparison with the geometry representation thoughts I wrote here. Nested vs flat listsThe biggest contrast between my linked thoughts and your description is that I proposed a struct of flat lists and you propose nested lists. To make the difference concrete, let's take the above multi polygon example. With nested lists that might look like [
[
[[40, 40], [20, 45], [45, 30], [40, 40]]
],
[
[[20, 35], [10, 30], [10, 10], [30, 5], [45, 20], [20, 35]],
[[30, 20], [20, 15], [20, 25], [30, 20]]
]
] with a struct of flat lists as I laid out, that would look like {
type: 'Polygon',
positions: [40, 40, 20, 45, 45, 30, 40, 40, 20, 35, 10, 30, 10, 10, 30, 5, 45, 20, 20, 35, 30, 20, 20, 15, 20, 25, 30, 20],
size: 2,
polygonIndices: [0, 4, 14],
ringIndices: [0, 4, 10, 14]
} Essentially while nested lists might internally store positions in the same format as this flat list, storing purely as a flat list with metadata allows lower-level access essentially. I think most disadvantages of the nested list format could be solved by wrapping it in a Multi-geometries for freeOne thing I like about the flat lists is that multi-geometries are the same layout as their single-part geometry counterparts. With nested lists, multi-part geometries always have an extra nested list level. In contrast, with flat lists, the number of polygon geometries is defined exclusively by the 2D vs N-D
I would prefer that representations support N-D coordinates, so that the same memory format could support 1, 3, 4 dimensions without any changes. Flat array
I can't tell if this is true? If coordinates are stored as nested arrays as above, is it possible to extract the core flat array representation without processing? Heterogeneous types
Even sticking with nested arrays, it wouldn't add much space to wrap it in a struct... Something like {
// Or, to save space, type: 3
// https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary
type: 'Polygon',
geometry: [[[...]]]
} GeometryCollections
Yes, this is true. I'm not sure how important it would be to support them. I've never actually seen them in the wild. |
OK, that's interesting! Your flats lists could be seen basically as the different arrays that make up the internal representation of a ListArray, but by putting them in a Struct ? (although the exact interpretation of the indices is different: in your example they all point into the coordinates, while for the ListArray, they each time point at the "next" indices.
That's something you could do with the ListArray as well, by representing a Polygon in the same way as the MultiPolygon above, but with only one sub-polygon. That would mean that you can't distinguish a Polygon and a MultiPolygon with 1 part, but that's maybe not a problem (it's in any case way to store mixed Polygons/MultiPolygons array). So I think? this is equivalent between ListArray and your proposal? Since the 2D vs N-D
Yes, my example used x, y (2D), but this can extend to any number of dimensions. We can store in metadata what the dimension is, but you could also infer it from the length of the fixed size list (in case of points). Flat array
Yes, that's how ListArray is stored in memory in the Arrow spec (see the "values" in my memory layout example for the MultiPolygon). Heterogeneous types
The main problem for Arrow is that all values in the array need to have the same type. So that would mean: the same number of nesting levels for the ListArray, or the same keys for the struct in your example. |
I haven't responded up to now because I don't think I understand the Arrow spec quite well enough. You might be right that we're essentially saying the same thing... I'll try to read up a bit more and get back to you. |
Some additional points that might help in clarifying:
|
Just a
|
I made a notebook exploring a bit more in detail the different memory representations (nested list vs struct) with a small example: https://nbviewer.jupyter.org/gist/jorisvandenbossche/dc4e98cf5c9fdbb64769716d046d0edf @kylebarron thanks for that write-up!
Indeed, see the notebook I linked to above where I tried to make those differences in offsets a bit more clear / visual.
Hmm, yes, that might be a reason to not use Unions, at least not directly (otherwise it would limit mixed geometries to Feather format and Arrow IPC). Now, I think the "fallback" of mimicking the Union with a plain StructArray, as you mention, might actually still be possible. In general, then you indeed have to deal with nulls. But since the coordinates array is actually a ListArray, the nulls are not necessarily stored in the physical flat array, but only through the offset arrays. This would actually work for both the ListArray or StructArray layout, since the coordinates are stored as a single list in the StructArray geometry layout as well. |
cc'ing @trxcllnt who has contributed to the C++ and JS arrow bindings and is working on https://github.com/rapidsai/cuspatial, geospatial support for a CUDA GPU dataframe using Arrow with python bindings. Comments for how they're currently laying out polygons are here and here. |
@kylebarron thanks for making this link! That's interesting ;) @trxcllnt if I am reading the comments in The "SoA" mentioned (Structure of Arrays) is the same idea that Arrow's StructArray is using (and in the end ListArray is similar as well). The main difference seems to be that in cuspatial also the x and y coordinates are stored in separate arrays, while the idea here was to keep those interleaved and only store the indices in separate arrays. And I am also not fully clear on the exact interpretation of the feature and ring indices in cuspatial based on those links.
But so the cumulative number of vertices in each ring doesn't keep into account that the ring needs to be closed? (the ring indices indicate that each ring is of length 4, but in the coordinates array, each ring has 5 elements) But this would mean you can't directly use the ring offsets to index into the coordinates array? |
@jorisvandenbossche cuSpatial is undergoing a massive overhaul at the moment, so the behavior described by the comment you referenced may change soon :-). But yes we've taken the Struct of Arrays approach in many places because global memory reads are often more optimized on the GPU. We were discussing using a segmented prefix sum layout (Arrow's List) for some of the trajectory APIs, but the List column type is only just now landing in I believe all the feature/ring/geometry terminology here is inherited from GDAL. You can check out their docs here, or check out the |
Why not use WKB? |
From the top post
And I laid out a couple more arguments in #3 (comment)
With Arrow-native representations such as have been discussed in this issue, you can get all geometries for the entire table without a copy. With something like a Union type (#4 (comment)), you can get all geometries of a single geometry type without a copy. |
My concern is that other API, products, etc... will have to write a special case to handle the geometry. If you are going to re-write the geometry spec, why not get rid of polygon vs multi-polygon, etc.. and just simplify everything to Point, Line, Polygon and Multipoint? |
I think that's a valid concern, and this comes back to performance vs backwards compatibility. Of course existing applications will have their own geometry needs, but to my knowledge few other than PostGIS work with WKB as their internal memory format, so whether geometries are stored in WKB or Arrow-native arrays, there's a conversion necessary. I can't find it right now, but @jorisvandenbossche made a prototype where loading geometries into a GeoDataFrame (and its GEOS-based memory format) from Arrow-native arrays was faster (significantly I believe) than from WKB. The point being that if we choose the Arrow-native format, we can advise in writing performant adapters to the necessary memory format, which will be in most cases faster than WKB parsing. Already in several libraries with a focus on high performance (datashader, cuSpatial, deck.gl) there are similar but slightly different Arrow-native geometry implementations. My goal is to prevent fracturing of the ecosystem for upcoming libraries. Coming from the deck.gl perspective, it's especially attractive to have such a format as Arrow where points and lines can be uploaded to the GPU without any processing, and polygons need only tessellation, which would be faster on Arrow arrays than on WKB.
That's essentially one of the things being proposed. In #4 (comment), there are |
Really? then probably it is good to look at awkward arrays or its implementation... because that lib turns any nested dictionary into such a structure as I understood. It works separate or in connection with numpy, or as a column in pandas. Also, it has some pretty intensive support for indexing. And something they say is "native numba support" |
@srenoes AFAIK the data types that awkward array supports are very similar to the ones supported by Arrow. I think the physical represented for nested lists or struct arrays is basically the same in both project (and nestes lists or struct arrays can be converted between both zero-copy). For example awkward-array's RecordArray is similar to Arrow's StructArray and also requires the same keys in each record. As mentioned in the discussion above (and awkward-array has the same), there is also a Union type to have different types for each record (but that is eg not supported in Parquet). (and awkward array is certainly interesting for operations on such data structures / interfacing numba with such data, but the discussion for now is about the actual representation in memory) |
Hi all, I work on LocationTech GeoMesa, and we've got an existing implementation of (Multi){Point,LineString,Polygon}s using the nested array approach in Scala for Apache Arrow, Parquet, and Orc. I'm not sure I see the benefit of managing the ring (and other) indices in a (Multi)Polygon separately from the List<List<...>> support given by the various columnar data types. As a separate note, when we needed to implement mixed 'Geometry' columns, at that point I did fallback to use WKB. Using any of the WKB variants short-circuits most all of the columnar compression that one may be able to get otherwise. |
Not a joke: much of what has been said above is reflected in the hoary old shapefile specification. https://www.esri.com/library/whitepapers/pdfs/shapefile.pdf |
@pramsey Nice! It is good reminder that all of this has happened before. Thanks for the link; I need to remember to read it later. Do you think it'd make sense to push this kind of Arrow (maybe other formats as well) standard into something like JTS/GEOS? |
Depends a great deal on the dependency load... I can't see GEOS picking it up if it requires adding in some dependency. I can easily see an PgSQL FDW that reads from this into PostGIS. Note that spec'ing the geometry file is the easy part. You also need to spec the index file. I've long said that there's already a "cloud optimized vector" format, and that's a sorted shape file (use the shpsort utility from mapserver). All the pieces (SHP, SHX, DBF, QIX) have known offsets so you can random access things, the sorting means you can access blocks of spatially contiguous shapes, etc. |
Good points. I suppose I ask about pushing something like this to JTS/GEOS is to help get broad adoption. For shapefiles, while I don't know the spec well, I get the sense that there's a 2 GB limit for some of the pieces, etc., in addition to other issues with the format. I wonder if what a 'big data' shapefile-esque format would / should look like... |
Yes, there's some signed ints in there that limit offsets to 2GB. Would this big a "big data" esqeue format? https://bjornharrtell.github.io/flatgeobuf/ |
Yes, I'll admit that I hadn't paid attention to flatgeobuf enough. That's heading in a good direction. I suppose we have two ways to get to a 'big geo data' format. Bjorn's approach is to create/scale up the good ideas we have as geospatial experts. The other directions would be to geo-enable existing big-data formats. As much as I'd like to see one format/approach, both already exist and will likely evolve. Being able to go between flatgeobufs and well-written Arrow/Parquet files may be the 'best of both worlds' we should hope to live in. |
Thanks both for your comments!
From above:
I believe this means that you couldn't store both Polygons and MultiPolygons in the same column, where you need either 2 or 3 levels of nesting. When you manage the ring indices yourself, you don't need separate
Great point. For example, the layout of Polygons is specified on page 12, and it defines a similar setup where indices arrays point into the main positions array. As an aside, I originally suggested a struct of positions and indices arrays because we use that as an optional binary data format in deck.gl. Because shapefile geometries are in a similar format, we can parse them quite quickly to this binary format, avoiding ever creating many small GeoJSON objects.
I think a dependency on the Arrow C++ library would be required.
Having recently delved into the spec to make a JS shapefile parser, I was pleasantly surprised how amenable shapefiles are to random access streaming. I'd love to see a sort of "shapefile update" which replaces the
I've been working with Flatgeobuf a lot recently, and it certainly has its strengths. It also overlaps a bit with our discussions here. The schema of each Flatgeobuf geometry is defined as follows: table Geometry {
ends: [uint]; // Array of end index in flat coordinates per geometry part
xy: [double]; // Flat x and y coordinate array (flat pairs)
z: [double]; // Flat z height array
m: [double]; // Flat m measurement array
t: [double]; // Flat t geodetic decimal year time array
tm: [ulong]; // Flat tm time nanosecond measurement array
type: GeometryType; // Type of geometry (only relevant for elements in heterogeneous collection types)
parts: [Geometry]; // Array of parts (for heterogeneous collection types)
} It's similar to what I described earlier in this issue with a few differences:
|
@jnh5y Thanks a lot for chiming in! When I saw GeoMesa mentioned in @kylebarron's tweet thread, I was wondering how GeoMesa stores/serializes the geometries. Is there some code comments / documentation about this? I personally also think that the approach of using List<List<...>> should be sufficient for the data itself (if there are some global metadata about the geometry type stored in the array). |
Arrow does have a Union type (SparseUnion and DenseUnion), which we could use to represent columns of mixed geometries without resorting to packed struct-like geometry representations. |
An earlier comment linked to https://ursalabs.org/blog/2020-feather-v2/, which mentions that while Arrow supports Unions, parquet might not:
I don't know if that's still accurate. |
Hey all, just wanted to share that I've merged a |
A note to all: with much delay, I started writing down some of the things discussed above about the format in a specification: #12. Feedback is very much appreciated! (it certainly does not yet try to write down everything (such as mixed geometry / collections support through Unions, but starting with the core features)
Thanks @thomcom for the update, that's cool to see! I have a few questions about how you handle for example the case of mixed Polygons / MultiPolygons. I explored a bit the conversion code of cuspatial, the buffers it creates, and how this compares the the Arrow buffers (or at least how I now wrote it down in the above mentioned PR). See https://notebooksharing.space/view/517f3172b12354804179f248247ab5ffd6573214e9f9810d13494533f1aefd8a#displayOptions= for a notebook exploring this (it should also allow inline comments / annotations through hypothes.is) I think both are mostly compatible, but I mainly want to ensure that if GeoPandas would be able to emit such buffers itself (and create those in a more efficient way), cuspatial can use them. |
Hey @jorisvandenbossche ! It is exciting to see this specification developing and gathering more interest. Sorry I missed this message from Nov, I'm happy to get together with you soon and talk about how we've implemented GeoArrow at RAPIDS and make sure we're still in alignment. We are in fact having a discussion with @harrism and @isVoid about a possible simplification of the nested LineString/MultiLineString structure. It'd be good to put heads together about it. |
Following up on the notebook that you shared above that is very interesting! I think it would be straight-forward to ensure that the MultiPolygons structure lines up with the geoarrow format. In fact @harrism the notebook linked above https://notebooksharing.space/view/517f3172b12354804179f248247ab5ffd6573214e9f9810d13494533f1aefd8a#displayOptions= shows that GeoArrow is doing exactly what you suggested - tracking Polygons as length-1 MultiPolygons. |
Hey @jorisvandenbossche I'm curious where you're at with a |
Actually I'm looking now at |
@jnh5y i notice the geomesa support encode jts geometry to geoarray, is any way to decode geoarray string to jts geometry? thanks |
Yes, for the formats that GeoMesa supports, it can read and write those formats. GeoMesa's Arrow support was created before this repo, so its way of using Arrow may be slightly different than what is described here. @elahrvivaz may be able to answer any specific questions. |
GeoMesa how to read the geoarray String?
it create a PolygonVector instance and set geometry to each slot, and use ‘vector’ property's toString() method can output the geoarray string. but how to read the output geoarray string? i can't find any code example. @jnh5y thanks |
Ah! I think you want the The test uses those methods here to get the Lemme know if that's not quite that you are looking for. We do not have a good tutorial on how to use those classes directly. Those classes are used as library code in the geomesa-arrow-gt module which implements the GeoTools DataStore API using Arrow files to store SimpleFeatures. |
@jnh5y thanks for your reply import com.fasterxml.jackson.core.JsonProcessingException; import java.util.ArrayList; public class PolygonDeserializer {
}
|
Hmm.... by geoarray, do you mean GeoJSON? If you have GeoJSON, but want GeoArrow, you could from GeoJSON To JTS Geometries and then go from JTS to Arrow. |
If you have GeoJSON, here's two examples implemented in JavaScript. They differ slightly in how they represent points in the schema. The first one uses a Struct, and the second uses a 2-element list: |
`
` my point is deserialize Arraw string value to jts geometry. @jnh5y |
thanks for your infor. |
Closing this since we now have the nested list encoding well documented in format.md! Feel free to open new issues for any unresolved parts of this discussion that I missed! |
One possible memory layout (cfr #3) is an array of "nested lists", which is natively supported in Apache Arrow.
It is the format that
spatialpandas
started using recently (in combination with datashader for visualization).In this issue, I try to provide a clear description of what this format exactly is, and we can discuss potential advantages/disadvantages.
The idea here is similar to how eg GeoJSON stores the coordinates or how WKT displays the coordinates. Using an example from the wiki page for MultiPolygon:
The MultiPolygon's coordinates are represented as a nested list: a list of geometry parts (polygons), each consisting of a list of rings (at least one exterior ring, potentially additional interior rings), and each ring consisting of a list of coordinates.
An array of such nested lists can be represented efficiently in memory in a contiguous buffer for the coordinates, with offsets determining where the lists start.
In the Apache Arrow standard memory layout ("Arrow Columnar Format"), such data can be stored as what is called a "Variable-size List Array" (https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout).
In the case of an array of MultiPolygons, you would get a ListArray with 3 levels of nesting.
So something like this (the single multipolygon from above):
gets actually stored in memory as
with other MultiPolygons appended to those arrays.
In Apache Arrow this is a
list<list<list<double>>>
List Type. The innermost nested arrays store the x/y coordinates in interleaved ordering. And each additional nesting level is represented by an array of offsets.This is an example for MultiPolygons (where 3 levels of nesting is required). Similarly, all other geometry types can be stored as well, but requiring a varying (but predefined) number of levels of nesting.
For example, an array of LineStrings can be represented as a ListArray with a single level of nesting. Each element in the array is 1 list of coordinates (with x, y, x, y, ... coordinates of one linestring interleaved), requiring only one array of offsets keeping track of the number of coordinates of each linestring.
More specifically, The List Array for each geometry type has a the following nesting levels:
MultiPoint
/LineString
/LinearRing
: one level of nesting.MultiLineString
/Polygon
: two levels of nesting.MultiPolygon
: three levels of nesting.Point
arrays can be represented as a Fixed Size list array (since we know that each entry will have 2 coordinates), which means it is effectily stored as the x/y coordinates in a single interleaved array.Potential advantages of this kind of format:
It is natively supported in Apache Arrow memory layout. Which means it can also be efficiently stored in formats like Parquet and Feather, or be used in systems that rely on Apache Arrow for data transfer. To be clear, also WKB can be natively stored in Apache Array as variable sized binary array.
The coordinates are accessible as a single flat array of floats. This means they can be directly interpreted as values without needing a WKB decoder. And this can also be beneficial for high-performant processing of those coordinates. For example, this is the format that is currently used by @jonmmease in the
datashader
visualization library to ingest geometry data, where it is efficient to iterate over the coords/offset arrays.There are also some clear limitations / constraints:
The ListArray itself does not store any information on the geometry types. This means that 1) such an array can only store homogeneous geometry types (no mixed geometry types in one column) and 2) the geometry type needs to be stored in metadata in order to correctly interpret the list array.
Since it doesn't support mixed geometries, it also does not support GeometryCollection objects.
The text was updated successfully, but these errors were encountered: