Create geometries with different number of points in a vectorized way #138

TonyXiang8787 · 2020-04-27T18:53:03Z

I have found this open-source project by looking for a vectorized geospatial python library. I am impressed about how the library can handle geospatial processing with full vectorization. However, I could not figure out how to create geometries with different number of points in a vectorized way. A very simple example.

I can create two polygons in one call, if the polygons have the same number of points.

pygeos.polygons(
    [
        [[0.0, 0.0], [1.0, 0.0], [1.0, 1.0]],
        [[0.0, 0.0], [1.0, 0.0], [1.0, 1.0]]
    ]
)

However, if I put two polygons with different number of points, it will not work.

pygeos.polygons(
    [
        [[0.0, 0.0], [1.0, 0.0], [1.0, 1.0]],
        [[0.0, 0.0], [1.0, 0.0], [1.0, 1.0], [0.0, 1.0]]
    ]
)

I will get a type error:
TypeError: One of the arguments is of incorrect type. Please provide only Geometry objects.

Is this a design choice? Or does it have to do with the limitations of ufunc in numpy? Or maybe I have not found the correct usage of this library. In that case I apologise for not being able to find the relevant information in the documentation.

The text was updated successfully, but these errors were encountered:

caspervdw · 2020-04-28T17:40:55Z

Hi @TonyXiang8787, thanks for your question. The fact that your example gives a TypeError is caused by that the functions expects an ndarray. What you supply can't be supported as it is not 'rectangular'. So it fails. This is not by design, this is just because you can't do it differently if you want to use numpy directly.

The issue you are raising is in fact the reverse of #128 : how can you deal with vectorizing over arrays of geometries with different numbers of elements? We just reached a conclusion for the API of extracting the points, I guess we could also implement the reverse of that.

Keep in mind that vectorizing over coordinate input only makes sense if coordinates are already in a numpy ndarray. In your example, the input coordinates are in a (nested) list so you will always need a python for loop to loop over individual coordinate lists. You'll never get the speedup because you need to extract the data from a 'slow' datastructure. Given your example, I'd say you can't be faster than:

np.array([pygeos.polygons(x) for x in
    [
        [[0.0, 0.0], [1.0, 0.0], [1.0, 1.0]],
        [[0.0, 0.0], [1.0, 0.0], [1.0, 1.0], [0.0, 1.0]]
    ]
])

brendan-ward · 2020-04-28T17:47:16Z

There is some related discussion of this problem space (as I understand it) over here in relation to serializing / deserializing geometry coordinates through the Arrow in-memory format (or associated file formats).

TonyXiang8787 · 2020-04-28T18:05:33Z

@caspervdw, thank you for your answer. I am aware of that the vectorization only makes sense if coordinates are already a numpy array. Maybe I have oversimplified the example in the description. In our code I actually tried with a numpy object array of numpy numeric arrays (each of them representing one polygon). It still does not work. As long as the input is heterogeneous, I cannot create an array of geometries in one go.

For example, this is what I also tried.

input_data = np.array([
    np.array([[0.0, 0.0], [1.0, 0.0], [1.0, 1.0]]), 
    np.array([[0.0, 0.0], [1.0, 0.0], [1.0, 1.0], [0.0, 1.0]])
], dtype=np.object)

pygeos.polygons(input_data)

This will get the same error.
TypeError: One of the arguments is of incorrect type. Please provide only Geometry objects.

jorisvandenbossche · 2020-04-28T18:05:54Z

The fact that your example gives a TypeError is caused by that the functions expects an ndarray

It still seems to be raising a "wrong" error, though. It is giving the error that is raised in get_geom (that's the only place where we raise this error message), which seems not correct.

There is some related discussion of this problem space (as I understand it) over here i

There is a pygeos draft PR about it as well: #93 (although that is for now only about the conversion in the other direction)

jorisvandenbossche · 2020-04-28T18:10:15Z

It still seems to be raising a "wrong" error, though.

OK, looking at the implementation of polygons now, this is because we first create linearrings before creating polygons. But we only do the linearring step if it is a float array, so we are actually passing the numpy arrays of arrays to lib.polygons_without_holes, and that function is expecting geometries, hence the error message.
I think we should improve the error messaging though.

TonyXiang8787 · 2020-04-28T18:14:40Z

@jorisvandenbossche, I have read the other discussion here.

Actually in our project we use the concept of compressed sparse matrix.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html

So all the coordinates of a collection of polygons are stored in a N*2 array. There is another indptr specifying for each polygon, the begin and end index at the flat coordinates array. In this way we have two homogeneous array in continuous memory which we can easily pass to any C/C++ routines.

jorisvandenbossche · 2020-04-28T18:17:20Z

I suppose we could in principle have a linearrings-like ufunc that expects an object dtype array of arrays, instead of a 2D float array.
The question is whether that will be much faster as simply writing that in python though.

I think the use case of converting an array of arrays / list of lists (like the input_data in the example above) to polygons is a good use case, and it would be nice to provide a user-friendly interface for this this case (even if it is just falling back to a python loop for creating the linearrings)

So all the coordinates of a collection of polygons are stored in a N*2 array. There is another indptr specifying for each polygon, the begin and end position at the flat coordinates array. In this way we have two homogeneous array in memory which we can easily pass to any C/C++ routines.

@TonyXiang8787 ah, that sounds very similar to the structure discussed in geoarrow/geoarrow#4 / prototyped in #93
(is it an open source project? Would be interested in knowing a bit more about the exact use case / get feedback on geoarrow/geoarrow#4)

bergkvist · 2020-07-19T05:42:27Z

I need something similar. Essentially, I have a long list points (around 300,000), which are grouped together by an id like this:

 lat | lon | sequence_number | group
-----+-----+-----------------+-------
 8.0 | 8.1 |       1         |  1
 4.0 | 4.1 |       1         |  2
 3.0 | 3.1 |       2         |  2
 2.0 | 2.1 |       3         |  2
...

I transform it into the following using np.transpose, np.unique and np.split:

coordinates == [np.array([8.0, 8.1]), np.array([[4.0, 4.1], [3.0, 3.1], [2.0, 2.1]]), ...]
coordinate_count = np.array([1, 3, ...])

But since I now have a mix of points and line strings of different lengths, pygeos basically won't help me at this point (unless I'm misunderstanding something)

It seems like my best shot is to avoid pygeos alltogether, and serialize it to a GeoJSON structure manually:

def first_value(rows):
    first_column = list(zip(*rows))[0]
    # Prevent NumPy from broadcasting into multidimensional array
    return np.array([*first_column, None])[:-1]

is_point = (coordinate_count == 1)
geometry = pd.Series(
    [ dict(zip(['type', 'coordinates'], row))
        for row in np.transpose([
            np.where(is_point, 'Point', 'LineString'),
            np.where(is_point, first_value(coordinates), coordinates)
        ])
    ]
)

Before using GeoPandas to load it back. It seems like there should be a more efficient way of doing this.

caspervdw · 2020-07-19T12:24:44Z

This lack of fast IO is (in my opinion) an important thing for pygeos to improve. I think the main problem here is that there is no generally accepted binary format for exchanging data on lists/arrays of geometries. Unless I am missing something @jorisvandenbossche

Of course we have GeoJSON, but it’s not binary, and it also forces you to use WGS84.

If we make up our mind about the format, I think we just need to write a serializer / deserializer in C / C++/ Cython for it.

bergkvist · 2020-07-23T03:20:10Z

Based on my previous post:

 lat | lon | sequence_number | group
-----+-----+-----------------+-------
 8.0 | 8.1 |       1         |  1
 4.0 | 4.1 |       1         |  2
 3.0 | 3.1 |       2         |  2
 2.0 | 2.1 |       3         |  2
...

Let's assume this is a regular pandas dataframe (df), which is primarily sorted by group, and secondarily by sequence number.

Then we have:

lat = df.lat.values
lon = df.lon.values
group = df.group.values

unique_group, unique_group_index, coord_count = np.unique(group, return_index=True, return_counts=True)

# Let's assume a custom function exists:
geometry_array = points_or_linestrings_from_xy(lat, lon, unique_group_index, coord_count)

Which could be (partially) implemented using Cython with the geos_c library:

geos_c.pxd (Definition file)

# geos_c.pxd
cdef extern from "<geos_c.h>":
    ctypedef struct GEOSContextHandle_t:
        pass
    ctypedef void* GEOSCoordSeq
    ctypedef void* GEOSGeom
    
    GEOSContextHandle_t GEOS_init_r()
    void GEOS_finish_r(GEOSContextHandle_t ctx)
    
    GEOSCoordSeq GEOSCoordSeq_create_r(GEOSContextHandle_t ctx, int size, int dims)
    int GEOSCoordSeq_setXY_r(GEOSContextHandle_t ctx, GEOSCoordSeq seq, int idx, double x, double y)

    GEOSGeom GEOSGeom_createLineString_r(GEOSContextHandle_t ctx, GEOSCoordSeq seq)
    GEOSGeom GEOSGeom_createPoint_r(GEOSContextHandle_t ctx, GEOSCoordSeq seq)

%%cython -lgeos_c
from geos_c cimport *

cpdef points_or_linestrings_from_xy(double[:] x, double[:] y, long[:] uniq_index, long[:] coord_count):
    assert len(x) == len(y)
    assert len(uniq_index) == len(coord_count)
    cdef long point_count = len(x)
    cdef long geometry_count = len(uniq_index)
    cdef GEOSContextHandle_t ctx = GEOS_init_r()
    
    # Specifying type information to improve performance:
    cdef GEOSCoordSeq seq
    cdef GEOSGeom geometry
    cdef long i
    cdef long j
    
    for i in range(geometry_count):
        seq = GEOSCoordSeq_create_r(ctx, coord_count[i], 2)    # 13% (of total execution time)
        for j in range(coord_count[i]):                        # 9.2%
            GEOSCoordSeq_setXY_r(ctx, seq, j, x[i+j], y[i+j])  # 51%
        if coord_count[i] == 1:                                # 9.5%
            geometry = GEOSGeom_createPoint_r(ctx, seq)        # 7.7%
        else:
            geometry = GEOSGeom_createLineString_r(ctx, seq)   # 6.0%
        # Now, I would somehow need to convert geometry to a PyObject
        # And put it in a list/array before returning it from the function
    GEOS_finish_r(ctx)
    # return result (array of PyObject*)

The percentages are from line_profiler. The dataset used consists of ~1.5M points, resulting in ~300K geometries (150K Points, 150K LineStrings). The above code executes in ~0.5s. Although I'm not converting to PyObjects or returning anything yet.

I guess to convert the geometry to PyObjects, it might be possible to use:
PyObject *GeometryObject_FromGEOS(PyTypeObject *type, GEOSGeometry *ptr);

caspervdw · 2021-05-19T18:14:31Z

This issue has been addressed in our latest release 0.10.

vincentsarago · 2021-05-19T19:08:10Z

Hi @caspervdw, I've tried the new version of pygeos and I'm still getting the TypeError error here https://github.com/developmentseed/cogeo-mosaic/blob/master/cogeo_mosaic/mosaic.py#L138-L140

Maybe I miss understand how we can achieve this. Could you share an example of polygons creation with geometries having different number of points ?

🙏

[edit]

example:

import pygeos
geoms = [
    [
        [-109.69028456213142, 82.24871468242942],
        [-109.7014033541393, 82.24884898520948],
        [-109.71749119045133, 82.25167594364217],
        [-110.40506161174704, 82.69798257544574],
        [-110.40408507545881, 82.69906485415044],
        [-109.21345885544275, 82.72341014721215],
        [-109.69028456213142, 82.24871468242942]
    ],
    [
        [-107.80092067196571, 80.0680659326775],
        [-107.81331170172177, 80.07647778722159],
        [-107.85021140725738, 80.14913924022926],
        [-107.84970679612512, 80.15073158009774],
        [-105.2618840578165, 80.11676529443353],
        [-105.26423736574617, 80.1126824941773],
        [-107.72691465067477, 80.06927690341031],
        [-107.80092067196571, 80.0680659326775]
    ]
]
    
dataset_geoms = pygeos.polygons(geoms)

vincentsarago · 2021-05-19T19:26:04Z

alright I figured it (with @kylebarron help)

you first need to create the linear rings

import pygeos
geoms = [
    pygeos.linearrings([
        [-109.69028456213142, 82.24871468242942],
        [-109.7014033541393, 82.24884898520948],
        [-109.71749119045133, 82.25167594364217],
        [-110.40506161174704, 82.69798257544574],
        [-110.40408507545881, 82.69906485415044],
        [-109.21345885544275, 82.72341014721215],
        [-109.69028456213142, 82.24871468242942]
    ]),
    pygeos.linearrings([
        [-107.80092067196571, 80.0680659326775],
        [-107.81331170172177, 80.07647778722159],
        [-107.85021140725738, 80.14913924022926],
        [-107.84970679612512, 80.15073158009774],
        [-105.2618840578165, 80.11676529443353],
        [-105.26423736574617, 80.1126824941773],
        [-107.72691465067477, 80.06927690341031],
        [-107.80092067196571, 80.0680659326775]
    ])
]
    
dataset_geoms = pygeos.polygons(geoms)

caspervdw · 2021-05-20T06:52:49Z

That's right, the direct route from coordinates + indices to polygons is not available. This is not possible for the generic case for polygons with holes (you would have to specify in which hole in which polygon the coordinate would have to go).

Hence the two-step procedure.
It is conceivable to implement this specifically for the polygons without holes case (by far the most common one) by propagating indices to the internal linearrings constructor over here:

pygeos/pygeos/creation.py

Line 223 in 3fa59c7

geometries = linearrings(geometries)

vincentsarago mentioned this issue Jun 21, 2020

Can't create mosaic when geometry have different number of points ? developmentseed/cogeo-mosaic#81

Closed

bergkvist mentioned this issue Jul 19, 2020

Feature Request: geopandas.points_or_linestrings_from_xy geopandas/geopandas#1526

Open

bergkvist mentioned this issue Jul 29, 2020

Integration of PyGEOS and Shapely shapely/shapely#782

Closed

caspervdw mentioned this issue Sep 20, 2020

Mix with use of cython? #51

Closed

caspervdw mentioned this issue Nov 10, 2020

RFC: Bulk geometry creation #241

Closed

caspervdw mentioned this issue Mar 22, 2021

Create simple geometries from coordinates and indices #322

Merged

caspervdw closed this as completed May 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create geometries with different number of points in a vectorized way #138

Create geometries with different number of points in a vectorized way #138

TonyXiang8787 commented Apr 27, 2020

caspervdw commented Apr 28, 2020

brendan-ward commented Apr 28, 2020

TonyXiang8787 commented Apr 28, 2020 •

edited

Loading

jorisvandenbossche commented Apr 28, 2020

jorisvandenbossche commented Apr 28, 2020

TonyXiang8787 commented Apr 28, 2020 •

edited

Loading

jorisvandenbossche commented Apr 28, 2020

bergkvist commented Jul 19, 2020

caspervdw commented Jul 19, 2020

bergkvist commented Jul 23, 2020 •

edited

Loading

caspervdw commented May 19, 2021

vincentsarago commented May 19, 2021 •

edited

Loading

vincentsarago commented May 19, 2021

caspervdw commented May 20, 2021

Create geometries with different number of points in a vectorized way #138

Create geometries with different number of points in a vectorized way #138

Comments

TonyXiang8787 commented Apr 27, 2020

caspervdw commented Apr 28, 2020

brendan-ward commented Apr 28, 2020

TonyXiang8787 commented Apr 28, 2020 • edited Loading

jorisvandenbossche commented Apr 28, 2020

jorisvandenbossche commented Apr 28, 2020

TonyXiang8787 commented Apr 28, 2020 • edited Loading

jorisvandenbossche commented Apr 28, 2020

bergkvist commented Jul 19, 2020

caspervdw commented Jul 19, 2020

bergkvist commented Jul 23, 2020 • edited Loading

caspervdw commented May 19, 2021

vincentsarago commented May 19, 2021 • edited Loading

vincentsarago commented May 19, 2021

caspervdw commented May 20, 2021

TonyXiang8787 commented Apr 28, 2020 •

edited

Loading

TonyXiang8787 commented Apr 28, 2020 •

edited

Loading

bergkvist commented Jul 23, 2020 •

edited

Loading

vincentsarago commented May 19, 2021 •

edited

Loading