Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create pygeoarrow and use it for cuSpatial feature storage and i/o #583

Merged

Conversation

thomcom
Copy link
Contributor

@thomcom thomcom commented Jun 30, 2022

This closes #582.
This PR removes the input based on an iterative Shapely reader and cudf buffers. Now the input is stored directly in a pyarrow UnionArray. The next PR will remove most of the interior functionality that is not necessary.

@github-actions github-actions bot added the Python Related to Python code label Jun 30, 2022
@thomcom thomcom marked this pull request as ready for review July 1, 2022 21:42
@thomcom thomcom requested a review from a team as a code owner July 1, 2022 21:42
@thomcom thomcom requested review from vyasr, trxcllnt and isVoid July 1, 2022 21:42
@thomcom
Copy link
Contributor Author

thomcom commented Jul 1, 2022

The next PR will have tons more - than + I think.

@thomcom thomcom added bug Something isn't working 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 1, 2022
@thomcom
Copy link
Contributor Author

thomcom commented Jul 1, 2022

This PR is messy because the old wrapper and pyarrow are co-existing.

@thomcom thomcom removed the bug Something isn't working label Jul 1, 2022
python/cuspatial/cuspatial/geometry/pygeoarrow.py Outdated Show resolved Hide resolved
python/cuspatial/cuspatial/geometry/geoarrowbuffers.py Outdated Show resolved Hide resolved
python/cuspatial/cuspatial/geometry/geoarrowbuffers.py Outdated Show resolved Hide resolved
python/cuspatial/cuspatial/geometry/geoarrowbuffers.py Outdated Show resolved Hide resolved
python/cuspatial/cuspatial/geometry/geoarrowbuffers.py Outdated Show resolved Hide resolved
python/cuspatial/cuspatial/geometry/geoarrowbuffers.py Outdated Show resolved Hide resolved
python/cuspatial/cuspatial/geometry/geocolumn.py Outdated Show resolved Hide resolved
python/cuspatial/cuspatial/geometry/geocolumn.py Outdated Show resolved Hide resolved
python/cuspatial/cuspatial/tests/test_from_geopandas.py Outdated Show resolved Hide resolved
Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is my first review of geoarrow PRs, so bear with me if I have missed some background knowledge. I have a high level question for GeoColumn. Why is GeoColumn a NumericalColumn? It seems to me no single existing cudf column type is sufficient to represent what GeoColumn expresses. It seems a bit awkward to piggy back on existing methods defined for NumericalColumn.

However, each "pure geometry" arrays can be represented as a ListColumn in cuDF. My intuition is that we can use ListColumn for each of the pure geometry arrays. The indexing logic introduced here for GeoColumn can continue serve as the top level indexing (similar to UnionArray indexing).

@thomcom
Copy link
Contributor Author

thomcom commented Jul 7, 2022

Thanks for your comments! I'll cover each and hopefully we can merge this. This is a partial contribution, and many of your concerns will disappear in the next two.

GeoColumn was a NumericalColumn because it enables it to be used with existing cudf algorithms that focus on the other dtypes. For example, one can call groupby on a GeoDataframe on a non-geometry column, which will then produce meaningful results. See test/test_geodataframe.py for a simple example. I assume that this will work even better when the type is updated to a ListColumn.

Thanks for the reminder, I'll be sure to create an issue for Refactor from NumericalColumn to ListColumn.

thomcom and others added 8 commits July 7, 2022 10:20
Co-authored-by: Michael Wang <isVoid@users.noreply.github.com>
Co-authored-by: Michael Wang <isVoid@users.noreply.github.com>
Co-authored-by: Michael Wang <isVoid@users.noreply.github.com>
Co-authored-by: Michael Wang <isVoid@users.noreply.github.com>
Co-authored-by: Michael Wang <isVoid@users.noreply.github.com>
@thomcom thomcom requested a review from isVoid July 7, 2022 15:42
Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, as requested I only went pretty broad-strokes this time and I see no red flags. I see there are internal calls to assert_eqs that can hopefully be replaced in future PRs.

And just to be clear, groupby is only promised at a indexed_frame level in cudf and is very much a separate implementation unrelated to column. see:
https://github.com/rapidsai/cudf/blob/branch-22.08/python/cudf/cudf/core/groupby/groupby.py

It totally makes sense to implement geoseries and geodataframe inheriting from series and dataframe to make use of the compute APIs for non-geometric columns. But geocolumn is too eccentric that I think it's best to implement an underlying UnionColumn type in cuspatial and geocolumn can inherit from it.

@thomcom
Copy link
Contributor Author

thomcom commented Jul 7, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 62e67f5 into rapidsai:branch-22.08 Jul 7, 2022
@thomcom thomcom deleted the fea-use-pyarrow-for-geopandas branch July 29, 2022 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Related to Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Use pyarrow to read and serialize geoPandas geometry
3 participants