-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create geometries with different number of points in a vectorized way #138
Comments
Hi @TonyXiang8787, thanks for your question. The fact that your example gives a TypeError is caused by that the functions expects an ndarray. What you supply can't be supported as it is not 'rectangular'. So it fails. This is not by design, this is just because you can't do it differently if you want to use numpy directly. The issue you are raising is in fact the reverse of #128 : how can you deal with vectorizing over arrays of geometries with different numbers of elements? We just reached a conclusion for the API of extracting the points, I guess we could also implement the reverse of that. Keep in mind that vectorizing over coordinate input only makes sense if coordinates are already in a numpy ndarray. In your example, the input coordinates are in a (nested) list so you will always need a python for loop to loop over individual coordinate lists. You'll never get the speedup because you need to extract the data from a 'slow' datastructure. Given your example, I'd say you can't be faster than:
|
There is some related discussion of this problem space (as I understand it) over here in relation to serializing / deserializing geometry coordinates through the Arrow in-memory format (or associated file formats). |
@caspervdw, thank you for your answer. I am aware of that the vectorization only makes sense if coordinates are already a numpy array. Maybe I have oversimplified the example in the description. In our code I actually tried with a numpy object array of numpy numeric arrays (each of them representing one polygon). It still does not work. As long as the input is heterogeneous, I cannot create an array of geometries in one go. For example, this is what I also tried. input_data = np.array([
np.array([[0.0, 0.0], [1.0, 0.0], [1.0, 1.0]]),
np.array([[0.0, 0.0], [1.0, 0.0], [1.0, 1.0], [0.0, 1.0]])
], dtype=np.object)
pygeos.polygons(input_data) This will get the same error. |
It still seems to be raising a "wrong" error, though. It is giving the error that is raised in
There is a pygeos draft PR about it as well: #93 (although that is for now only about the conversion in the other direction) |
OK, looking at the implementation of |
@jorisvandenbossche, I have read the other discussion here. Actually in our project we use the concept of compressed sparse matrix. So all the coordinates of a collection of polygons are stored in a N*2 array. There is another |
I suppose we could in principle have a I think the use case of converting an array of arrays / list of lists (like the
@TonyXiang8787 ah, that sounds very similar to the structure discussed in geoarrow/geoarrow#4 / prototyped in #93 |
I need something similar. Essentially, I have a long list points (around 300,000), which are grouped together by an id like this:
I transform it into the following using coordinates == [np.array([8.0, 8.1]), np.array([[4.0, 4.1], [3.0, 3.1], [2.0, 2.1]]), ...]
coordinate_count = np.array([1, 3, ...]) But since I now have a mix of points and line strings of different lengths, pygeos basically won't help me at this point (unless I'm misunderstanding something) It seems like my best shot is to avoid pygeos alltogether, and serialize it to a GeoJSON structure manually: def first_value(rows):
first_column = list(zip(*rows))[0]
# Prevent NumPy from broadcasting into multidimensional array
return np.array([*first_column, None])[:-1]
is_point = (coordinate_count == 1)
geometry = pd.Series(
[ dict(zip(['type', 'coordinates'], row))
for row in np.transpose([
np.where(is_point, 'Point', 'LineString'),
np.where(is_point, first_value(coordinates), coordinates)
])
]
) Before using GeoPandas to load it back. It seems like there should be a more efficient way of doing this. |
This lack of fast IO is (in my opinion) an important thing for pygeos to improve. I think the main problem here is that there is no generally accepted binary format for exchanging data on lists/arrays of geometries. Unless I am missing something @jorisvandenbossche Of course we have GeoJSON, but it’s not binary, and it also forces you to use WGS84. If we make up our mind about the format, I think we just need to write a serializer / deserializer in C / C++/ Cython for it. |
Based on my previous post:
Let's assume this is a regular pandas dataframe (df), which is primarily sorted by group, and secondarily by sequence number. Then we have: lat = df.lat.values
lon = df.lon.values
group = df.group.values
unique_group, unique_group_index, coord_count = np.unique(group, return_index=True, return_counts=True)
# Let's assume a custom function exists:
geometry_array = points_or_linestrings_from_xy(lat, lon, unique_group_index, coord_count) Which could be (partially) implemented using Cython with the geos_c library: geos_c.pxd (Definition file)# geos_c.pxd
cdef extern from "<geos_c.h>":
ctypedef struct GEOSContextHandle_t:
pass
ctypedef void* GEOSCoordSeq
ctypedef void* GEOSGeom
GEOSContextHandle_t GEOS_init_r()
void GEOS_finish_r(GEOSContextHandle_t ctx)
GEOSCoordSeq GEOSCoordSeq_create_r(GEOSContextHandle_t ctx, int size, int dims)
int GEOSCoordSeq_setXY_r(GEOSContextHandle_t ctx, GEOSCoordSeq seq, int idx, double x, double y)
GEOSGeom GEOSGeom_createLineString_r(GEOSContextHandle_t ctx, GEOSCoordSeq seq)
GEOSGeom GEOSGeom_createPoint_r(GEOSContextHandle_t ctx, GEOSCoordSeq seq) %%cython -lgeos_c
from geos_c cimport *
cpdef points_or_linestrings_from_xy(double[:] x, double[:] y, long[:] uniq_index, long[:] coord_count):
assert len(x) == len(y)
assert len(uniq_index) == len(coord_count)
cdef long point_count = len(x)
cdef long geometry_count = len(uniq_index)
cdef GEOSContextHandle_t ctx = GEOS_init_r()
# Specifying type information to improve performance:
cdef GEOSCoordSeq seq
cdef GEOSGeom geometry
cdef long i
cdef long j
for i in range(geometry_count):
seq = GEOSCoordSeq_create_r(ctx, coord_count[i], 2) # 13% (of total execution time)
for j in range(coord_count[i]): # 9.2%
GEOSCoordSeq_setXY_r(ctx, seq, j, x[i+j], y[i+j]) # 51%
if coord_count[i] == 1: # 9.5%
geometry = GEOSGeom_createPoint_r(ctx, seq) # 7.7%
else:
geometry = GEOSGeom_createLineString_r(ctx, seq) # 6.0%
# Now, I would somehow need to convert geometry to a PyObject
# And put it in a list/array before returning it from the function
GEOS_finish_r(ctx)
# return result (array of PyObject*) The percentages are from line_profiler. The dataset used consists of ~1.5M points, resulting in ~300K geometries (150K Points, 150K LineStrings). The above code executes in ~0.5s. Although I'm not converting to PyObjects or returning anything yet. I guess to convert the geometry to PyObjects, it might be possible to use: |
This issue has been addressed in our latest release 0.10. |
Hi @caspervdw, I've tried the new version of pygeos and I'm still getting the Maybe I miss understand how we can achieve this. Could you share an example of 🙏 [edit] example: import pygeos
geoms = [
[
[-109.69028456213142, 82.24871468242942],
[-109.7014033541393, 82.24884898520948],
[-109.71749119045133, 82.25167594364217],
[-110.40506161174704, 82.69798257544574],
[-110.40408507545881, 82.69906485415044],
[-109.21345885544275, 82.72341014721215],
[-109.69028456213142, 82.24871468242942]
],
[
[-107.80092067196571, 80.0680659326775],
[-107.81331170172177, 80.07647778722159],
[-107.85021140725738, 80.14913924022926],
[-107.84970679612512, 80.15073158009774],
[-105.2618840578165, 80.11676529443353],
[-105.26423736574617, 80.1126824941773],
[-107.72691465067477, 80.06927690341031],
[-107.80092067196571, 80.0680659326775]
]
]
dataset_geoms = pygeos.polygons(geoms) |
alright I figured it (with @kylebarron help) you first need to create the linear rings import pygeos
geoms = [
pygeos.linearrings([
[-109.69028456213142, 82.24871468242942],
[-109.7014033541393, 82.24884898520948],
[-109.71749119045133, 82.25167594364217],
[-110.40506161174704, 82.69798257544574],
[-110.40408507545881, 82.69906485415044],
[-109.21345885544275, 82.72341014721215],
[-109.69028456213142, 82.24871468242942]
]),
pygeos.linearrings([
[-107.80092067196571, 80.0680659326775],
[-107.81331170172177, 80.07647778722159],
[-107.85021140725738, 80.14913924022926],
[-107.84970679612512, 80.15073158009774],
[-105.2618840578165, 80.11676529443353],
[-105.26423736574617, 80.1126824941773],
[-107.72691465067477, 80.06927690341031],
[-107.80092067196571, 80.0680659326775]
])
]
dataset_geoms = pygeos.polygons(geoms) |
That's right, the direct route from coordinates + indices to polygons is not available. This is not possible for the generic case for polygons with holes (you would have to specify in which hole in which polygon the coordinate would have to go). Hence the two-step procedure. Line 223 in 3fa59c7
|
I have found this open-source project by looking for a vectorized geospatial python library. I am impressed about how the library can handle geospatial processing with full vectorization. However, I could not figure out how to create geometries with different number of points in a vectorized way. A very simple example.
I can create two polygons in one call, if the polygons have the same number of points.
However, if I put two polygons with different number of points, it will not work.
I will get a type error:
TypeError: One of the arguments is of incorrect type. Please provide only Geometry objects.
Is this a design choice? Or does it have to do with the limitations of
ufunc
innumpy
? Or maybe I have not found the correct usage of this library. In that case I apologise for not being able to find the relevant information in the documentation.The text was updated successfully, but these errors were encountered: