Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow using cx indexer without spatial index #54

Merged
merged 6 commits into from
Jan 26, 2021
Merged

Conversation

philippjfr
Copy link
Member

The problem with using cx on a DaskGeoDataFrame was that it would use cx on each partition which in turn would spatially index that partition because it was using the sindex attribute. It seems as though these spatial indexing wouldn't persist either so every time you did some indexing it would create a completely new sindex slowing down the indexing massively. I now only use the sindex if it already exists. I would like there to be a more explicit API on GeometryArrays, GeoSeries, GeoDataFrames, DaskGeoSeries and DaskGeoDataFrame that says "create an spatial index" rather than using the implicit behavior of creating a spatial index when sindex is accessed.

@jlstevens
Copy link
Collaborator

I would like there to be a more explicit API on GeometryArrays, GeoSeries, GeoDataFrames, DaskGeoSeries and DaskGeoDataFrame that says "create an spatial index" rather than using the implicit behavior of creating a spatial index when sindex is accessed.

I agree with this sentiment.

@philippjfr
Copy link
Member Author

philippjfr commented Jan 21, 2021

So I'm guessing that the fact that the implicitly created _sindex doesn't persist on the partitions is expected behavior for dask. I'm guessing dask makes (shallow) copies of the partition dataframes internally precisely to avoid you from performing any inplace modifications of the internal state.

@philippjfr
Copy link
Member Author

Added explicit build_sindex methods to all relevant types.

@jonmmease
Copy link
Collaborator

Taking a look... I'm pretty sure that the partition-level spatial indexes used to persist, but it also makes sense that this isn't something Dask would want a library to rely on 🙂

spatialpandas/dask.py Outdated Show resolved Hide resolved
@jonmmease
Copy link
Collaborator

These look like reasonable updates to me. And overall, it might just be easier to remove all implicit spatial index initialization, and just have methods that need one raise an error if it hasn't been initialized yet.

Added explicit build_sindex methods to all relevant types.

Did you push these changes? I didn't see them in the diff just now.

For DaskGeoDataFrame, I think we'd also want the option to initialize the partition-level spatial indexex with an option to the read_parquet_dask method.

@philippjfr
Copy link
Member Author

Sorry pushed now.

@jonmmease
Copy link
Collaborator

The build_sindex methods look good. Have you been able to confirm that spatial index in the resulting Dask data structures persists? It's not obvious to me why the sindex would persist in the return value, but not the input value since these both reference the same partition-level GeoDataFrames.

@philippjfr
Copy link
Member Author

philippjfr commented Jan 22, 2021

The build_sindex methods look good. Have you been able to confirm that spatial index in the resulting Dask data structures persists?

I have yes. It's also very unclear to me but here's the test cases I'm using (tested before and after this PR respectively):

df.cx[-19986136.0: -18986136.0, 445.5396728515625:10000].compute()

p = df.cx_partitions[-19986136.0:-18986136.0, 445.5396728515625:10000]

def report_sindex(df):
    print(df.geometry.array._sindex)
    return df

p.map_partitions(report_sindex, meta=p._meta).persist()
None
df = df.build_sindex().persist()

def report_sindex(df):
    print(df.geometry.array._sindex)
    return df

df.map_partitions(report_sindex, meta=df._meta).persist()
<spatialpandas.spatialindex.rtree.HilbertRtree object at 0x12ca72f40>
<spatialpandas.spatialindex.rtree.HilbertRtree object at 0x12d154a90>
<spatialpandas.spatialindex.rtree.HilbertRtree object at 0x12d29d1c0>
<spatialpandas.spatialindex.rtree.HilbertRtree object at 0x12d29d760>
<spatialpandas.spatialindex.rtree.HilbertRtree object at 0x130bdd3a0>
<spatialpandas.spatialindex.rtree.HilbertRtree object at 0x12d35d880>
<spatialpandas.spatialindex.rtree.HilbertRtree object at 0x12d1f1070>

@jonmmease
Copy link
Collaborator

Ok, looks good!

@philippjfr
Copy link
Member Author

Any opinion whether I should release this as spatialpandas 0.3.7 or 0.4.0?

@jonmmease
Copy link
Collaborator

Any opinion whether I should release this as spatialpandas 0.3.7 or 0.4.0?

Not a strong one. I'd lean toward 0.4.0 since it fixes a bug by adding to the API.

@philippjfr philippjfr merged commit 6bd3311 into master Jan 26, 2021
@ianthomas23 ianthomas23 deleted the cx_without_sindex branch September 8, 2023 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants