Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Use pyarrow to read and serialize geoPandas geometry #582

Closed
thomcom opened this issue Jun 30, 2022 · 0 comments · Fixed by #583
Closed

[FEA] Use pyarrow to read and serialize geoPandas geometry #582

thomcom opened this issue Jun 30, 2022 · 0 comments · Fixed by #583
Labels
feature request New feature or request Needs Triage Need team to review and classify

Comments

@thomcom
Copy link
Contributor

thomcom commented Jun 30, 2022

Is your feature request related to a problem? Please describe.
In #575, I'm correcting an error with my GeoArrow implementation. The specification in pyarrow for polygons and linestrings is

polygon_type = pa.list_(
    pa.field("polygons", pa.list_(
        pa.field("rings", pa.list_(
            pa.field("vertices", pa.list_(
                pa.field("xy", pa.float64(), nullable=False), 2),
            nullable=False)), nullable=False))))

linestring_type = pa.list_(
    pa.field("lines", pa.list_(
        pa.field("offsets", pa.list_(
            pa.field("xy", pa.float64(), nullable=False), 2),
        nullable=False)), nullable=False))

Because I implemented arrow from scratch in #300, and in a spasm of reasoning, I forgot to use the prefix-scan method for indexing the above types.

The first step in correcting it, given where I am with pyarrow now, is to change the input of our GeoArrow types to use arrow directly.

Describe the solution you'd like

cuspatial/io/geopandas_adapter.py implements the arrow spec when reading from GeoSeries objects in a GeoPandas dataframe.

_load_geometry_offsets computes the sizes and prefix sum/scan buffers for each of the four arrow lists and allocates memory for them.

_read_geometries iterates over the geometries again, copying individual coordinates and offsets into the correct position in each buffer.

Instead, specify the GeoArrow format types as per above in geoarrow.py, then use those formats and densely packed tuple buffers from GeoSeries[i].__geo_interface__['coordinates'] to create the geoarrowbuffers.py objects.

This will break unit tests for LineStrings and MultiLineStrings, which have been implemented incorrectly, leading to the original Issue #575 which will have another PR to fix them.

@thomcom thomcom added feature request New feature or request Needs Triage Need team to review and classify labels Jun 30, 2022
@rapids-bot rapids-bot bot closed this as completed in #583 Jul 7, 2022
rapids-bot bot pushed a commit that referenced this issue Jul 7, 2022
…583)

This closes #582.
This PR removes the input based on an iterative Shapely reader and `cudf` buffers. Now the input is stored directly in a pyarrow `UnionArray`. The next PR will remove most of the interior functionality that is not necessary.

Authors:
  - H. Thomson Comer (https://github.com/thomcom)

Approvers:
  - Michael Wang (https://github.com/isVoid)

URL: #583
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Needs Triage Need team to review and classify
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant