Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create geometries with different number of points in a vectorized way #138

Closed
TonyXiang8787 opened this issue Apr 27, 2020 · 14 comments
Closed

Comments

@TonyXiang8787
Copy link

I have found this open-source project by looking for a vectorized geospatial python library. I am impressed about how the library can handle geospatial processing with full vectorization. However, I could not figure out how to create geometries with different number of points in a vectorized way. A very simple example.

I can create two polygons in one call, if the polygons have the same number of points.

pygeos.polygons(
    [
        [[0.0, 0.0], [1.0, 0.0], [1.0, 1.0]],
        [[0.0, 0.0], [1.0, 0.0], [1.0, 1.0]]
    ]
)

However, if I put two polygons with different number of points, it will not work.

pygeos.polygons(
    [
        [[0.0, 0.0], [1.0, 0.0], [1.0, 1.0]],
        [[0.0, 0.0], [1.0, 0.0], [1.0, 1.0], [0.0, 1.0]]
    ]
)

I will get a type error:
TypeError: One of the arguments is of incorrect type. Please provide only Geometry objects.

Is this a design choice? Or does it have to do with the limitations of ufunc in numpy? Or maybe I have not found the correct usage of this library. In that case I apologise for not being able to find the relevant information in the documentation.

@caspervdw
Copy link
Member

Hi @TonyXiang8787, thanks for your question. The fact that your example gives a TypeError is caused by that the functions expects an ndarray. What you supply can't be supported as it is not 'rectangular'. So it fails. This is not by design, this is just because you can't do it differently if you want to use numpy directly.

The issue you are raising is in fact the reverse of #128 : how can you deal with vectorizing over arrays of geometries with different numbers of elements? We just reached a conclusion for the API of extracting the points, I guess we could also implement the reverse of that.

Keep in mind that vectorizing over coordinate input only makes sense if coordinates are already in a numpy ndarray. In your example, the input coordinates are in a (nested) list so you will always need a python for loop to loop over individual coordinate lists. You'll never get the speedup because you need to extract the data from a 'slow' datastructure. Given your example, I'd say you can't be faster than:

np.array([pygeos.polygons(x) for x in
    [
        [[0.0, 0.0], [1.0, 0.0], [1.0, 1.0]],
        [[0.0, 0.0], [1.0, 0.0], [1.0, 1.0], [0.0, 1.0]]
    ]
])

@brendan-ward
Copy link
Contributor

There is some related discussion of this problem space (as I understand it) over here in relation to serializing / deserializing geometry coordinates through the Arrow in-memory format (or associated file formats).

@TonyXiang8787
Copy link
Author

TonyXiang8787 commented Apr 28, 2020

@caspervdw, thank you for your answer. I am aware of that the vectorization only makes sense if coordinates are already a numpy array. Maybe I have oversimplified the example in the description. In our code I actually tried with a numpy object array of numpy numeric arrays (each of them representing one polygon). It still does not work. As long as the input is heterogeneous, I cannot create an array of geometries in one go.

For example, this is what I also tried.

input_data = np.array([
    np.array([[0.0, 0.0], [1.0, 0.0], [1.0, 1.0]]), 
    np.array([[0.0, 0.0], [1.0, 0.0], [1.0, 1.0], [0.0, 1.0]])
], dtype=np.object)

pygeos.polygons(input_data)

This will get the same error.
TypeError: One of the arguments is of incorrect type. Please provide only Geometry objects.

@jorisvandenbossche
Copy link
Member

The fact that your example gives a TypeError is caused by that the functions expects an ndarray

It still seems to be raising a "wrong" error, though. It is giving the error that is raised in get_geom (that's the only place where we raise this error message), which seems not correct.

There is some related discussion of this problem space (as I understand it) over here i

There is a pygeos draft PR about it as well: #93 (although that is for now only about the conversion in the other direction)

@jorisvandenbossche
Copy link
Member

It still seems to be raising a "wrong" error, though.

OK, looking at the implementation of polygons now, this is because we first create linearrings before creating polygons. But we only do the linearring step if it is a float array, so we are actually passing the numpy arrays of arrays to lib.polygons_without_holes, and that function is expecting geometries, hence the error message.
I think we should improve the error messaging though.

@TonyXiang8787
Copy link
Author

TonyXiang8787 commented Apr 28, 2020

@jorisvandenbossche, I have read the other discussion here.

Actually in our project we use the concept of compressed sparse matrix.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html

So all the coordinates of a collection of polygons are stored in a N*2 array. There is another indptr specifying for each polygon, the begin and end index at the flat coordinates array. In this way we have two homogeneous array in continuous memory which we can easily pass to any C/C++ routines.

@jorisvandenbossche
Copy link
Member

I suppose we could in principle have a linearrings-like ufunc that expects an object dtype array of arrays, instead of a 2D float array.
The question is whether that will be much faster as simply writing that in python though.

I think the use case of converting an array of arrays / list of lists (like the input_data in the example above) to polygons is a good use case, and it would be nice to provide a user-friendly interface for this this case (even if it is just falling back to a python loop for creating the linearrings)

So all the coordinates of a collection of polygons are stored in a N*2 array. There is another indptr specifying for each polygon, the begin and end position at the flat coordinates array. In this way we have two homogeneous array in memory which we can easily pass to any C/C++ routines.

@TonyXiang8787 ah, that sounds very similar to the structure discussed in geoarrow/geoarrow#4 / prototyped in #93
(is it an open source project? Would be interested in knowing a bit more about the exact use case / get feedback on geoarrow/geoarrow#4)

@bergkvist
Copy link

I need something similar. Essentially, I have a long list points (around 300,000), which are grouped together by an id like this:

 lat | lon | sequence_number | group
-----+-----+-----------------+-------
 8.0 | 8.1 |       1         |  1
 4.0 | 4.1 |       1         |  2
 3.0 | 3.1 |       2         |  2
 2.0 | 2.1 |       3         |  2
...

I transform it into the following using np.transpose, np.unique and np.split:

coordinates == [np.array([8.0, 8.1]), np.array([[4.0, 4.1], [3.0, 3.1], [2.0, 2.1]]), ...]
coordinate_count = np.array([1, 3, ...])

But since I now have a mix of points and line strings of different lengths, pygeos basically won't help me at this point (unless I'm misunderstanding something)

It seems like my best shot is to avoid pygeos alltogether, and serialize it to a GeoJSON structure manually:

def first_value(rows):
    first_column = list(zip(*rows))[0]
    # Prevent NumPy from broadcasting into multidimensional array
    return np.array([*first_column, None])[:-1]

is_point = (coordinate_count == 1)
geometry = pd.Series(
    [ dict(zip(['type', 'coordinates'], row))
        for row in np.transpose([
            np.where(is_point, 'Point', 'LineString'),
            np.where(is_point, first_value(coordinates), coordinates)
        ])
    ]
)

Before using GeoPandas to load it back. It seems like there should be a more efficient way of doing this.

@caspervdw
Copy link
Member

This lack of fast IO is (in my opinion) an important thing for pygeos to improve. I think the main problem here is that there is no generally accepted binary format for exchanging data on lists/arrays of geometries. Unless I am missing something @jorisvandenbossche

Of course we have GeoJSON, but it’s not binary, and it also forces you to use WGS84.

If we make up our mind about the format, I think we just need to write a serializer / deserializer in C / C++/ Cython for it.

@bergkvist
Copy link

bergkvist commented Jul 23, 2020

Based on my previous post:

 lat | lon | sequence_number | group
-----+-----+-----------------+-------
 8.0 | 8.1 |       1         |  1
 4.0 | 4.1 |       1         |  2
 3.0 | 3.1 |       2         |  2
 2.0 | 2.1 |       3         |  2
...

Let's assume this is a regular pandas dataframe (df), which is primarily sorted by group, and secondarily by sequence number.

Then we have:

lat = df.lat.values
lon = df.lon.values
group = df.group.values

unique_group, unique_group_index, coord_count = np.unique(group, return_index=True, return_counts=True)

# Let's assume a custom function exists:
geometry_array = points_or_linestrings_from_xy(lat, lon, unique_group_index, coord_count)

Which could be (partially) implemented using Cython with the geos_c library:

geos_c.pxd (Definition file)
# geos_c.pxd
cdef extern from "<geos_c.h>":
    ctypedef struct GEOSContextHandle_t:
        pass
    ctypedef void* GEOSCoordSeq
    ctypedef void* GEOSGeom
    
    GEOSContextHandle_t GEOS_init_r()
    void GEOS_finish_r(GEOSContextHandle_t ctx)
    
    GEOSCoordSeq GEOSCoordSeq_create_r(GEOSContextHandle_t ctx, int size, int dims)
    int GEOSCoordSeq_setXY_r(GEOSContextHandle_t ctx, GEOSCoordSeq seq, int idx, double x, double y)

    GEOSGeom GEOSGeom_createLineString_r(GEOSContextHandle_t ctx, GEOSCoordSeq seq)
    GEOSGeom GEOSGeom_createPoint_r(GEOSContextHandle_t ctx, GEOSCoordSeq seq)
%%cython -lgeos_c
from geos_c cimport *

cpdef points_or_linestrings_from_xy(double[:] x, double[:] y, long[:] uniq_index, long[:] coord_count):
    assert len(x) == len(y)
    assert len(uniq_index) == len(coord_count)
    cdef long point_count = len(x)
    cdef long geometry_count = len(uniq_index)
    cdef GEOSContextHandle_t ctx = GEOS_init_r()
    
    # Specifying type information to improve performance:
    cdef GEOSCoordSeq seq
    cdef GEOSGeom geometry
    cdef long i
    cdef long j
    
    for i in range(geometry_count):
        seq = GEOSCoordSeq_create_r(ctx, coord_count[i], 2)    # 13% (of total execution time)
        for j in range(coord_count[i]):                        # 9.2%
            GEOSCoordSeq_setXY_r(ctx, seq, j, x[i+j], y[i+j])  # 51%
        if coord_count[i] == 1:                                # 9.5%
            geometry = GEOSGeom_createPoint_r(ctx, seq)        # 7.7%
        else:
            geometry = GEOSGeom_createLineString_r(ctx, seq)   # 6.0%
        # Now, I would somehow need to convert geometry to a PyObject
        # And put it in a list/array before returning it from the function
    GEOS_finish_r(ctx)
    # return result (array of PyObject*)

The percentages are from line_profiler. The dataset used consists of ~1.5M points, resulting in ~300K geometries (150K Points, 150K LineStrings). The above code executes in ~0.5s. Although I'm not converting to PyObjects or returning anything yet.

I guess to convert the geometry to PyObjects, it might be possible to use:
PyObject *GeometryObject_FromGEOS(PyTypeObject *type, GEOSGeometry *ptr);

@caspervdw
Copy link
Member

This issue has been addressed in our latest release 0.10.

@vincentsarago
Copy link

vincentsarago commented May 19, 2021

Hi @caspervdw, I've tried the new version of pygeos and I'm still getting the TypeError error here https://github.com/developmentseed/cogeo-mosaic/blob/master/cogeo_mosaic/mosaic.py#L138-L140

Maybe I miss understand how we can achieve this. Could you share an example of polygons creation with geometries having different number of points ?

🙏

[edit]

example:

import pygeos
geoms = [
    [
        [-109.69028456213142, 82.24871468242942],
        [-109.7014033541393, 82.24884898520948],
        [-109.71749119045133, 82.25167594364217],
        [-110.40506161174704, 82.69798257544574],
        [-110.40408507545881, 82.69906485415044],
        [-109.21345885544275, 82.72341014721215],
        [-109.69028456213142, 82.24871468242942]
    ],
    [
        [-107.80092067196571, 80.0680659326775],
        [-107.81331170172177, 80.07647778722159],
        [-107.85021140725738, 80.14913924022926],
        [-107.84970679612512, 80.15073158009774],
        [-105.2618840578165, 80.11676529443353],
        [-105.26423736574617, 80.1126824941773],
        [-107.72691465067477, 80.06927690341031],
        [-107.80092067196571, 80.0680659326775]
    ]
]
    
dataset_geoms = pygeos.polygons(geoms)

@vincentsarago
Copy link

alright I figured it (with @kylebarron help)

you first need to create the linear rings

import pygeos
geoms = [
    pygeos.linearrings([
        [-109.69028456213142, 82.24871468242942],
        [-109.7014033541393, 82.24884898520948],
        [-109.71749119045133, 82.25167594364217],
        [-110.40506161174704, 82.69798257544574],
        [-110.40408507545881, 82.69906485415044],
        [-109.21345885544275, 82.72341014721215],
        [-109.69028456213142, 82.24871468242942]
    ]),
    pygeos.linearrings([
        [-107.80092067196571, 80.0680659326775],
        [-107.81331170172177, 80.07647778722159],
        [-107.85021140725738, 80.14913924022926],
        [-107.84970679612512, 80.15073158009774],
        [-105.2618840578165, 80.11676529443353],
        [-105.26423736574617, 80.1126824941773],
        [-107.72691465067477, 80.06927690341031],
        [-107.80092067196571, 80.0680659326775]
    ])
]
    
dataset_geoms = pygeos.polygons(geoms)

@caspervdw
Copy link
Member

That's right, the direct route from coordinates + indices to polygons is not available. This is not possible for the generic case for polygons with holes (you would have to specify in which hole in which polygon the coordinate would have to go).

Hence the two-step procedure.
It is conceivable to implement this specifically for the polygons without holes case (by far the most common one) by propagating indices to the internal linearrings constructor over here:

geometries = linearrings(geometries)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants