Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpatialPandas design and features #1

Open
31 of 60 tasks
jonmmease opened this issue Oct 28, 2019 · 33 comments
Open
31 of 60 tasks

SpatialPandas design and features #1

jonmmease opened this issue Oct 28, 2019 · 33 comments
Labels
help wanted Extra attention is needed
Milestone

Comments

@jonmmease
Copy link
Collaborator

jonmmease commented Oct 28, 2019

spatialpandas

Pandas and Dask extensions for vectorized spatial and geometric operations.

This proposal is a plan towards extracting the functionality of the spatial/geometric utilities developed in Datashader into this separate, general-purpose library. Some of the functionality here has now (as of mid-2020) been implemented as marked below, but much remains to be done!

Goals

This project has several goals:

  1. Provide a set of pandas ExtensionArrays for storing columns of discrete geometric objects (Points, Polygons, etc.). Unlike the GeoPandas GeoSeries, there will be a separate extension array for each geometry type (PointsArray, PolygonsArray, etc.), and the underlying representation for the entire array will be stored in a single contiguous ragged array that is suitable for fast processing using numba.
  2. (partially done; see below) Provide a collection of vectorized geometric operations on these extension arrays, including most of the same operations that are included in shapely/geopandas, but implemented in numba rather than relying on the GEOS and libspatialindex C++ libraries. These vectorized Numba implementations will support CPU thread-based parallelization, and will also provide the foundation for future GPU support.
  3. Provide Dask DataFrame extensions for spatially partitioning DataFrames containing geometry columns. Also provide Dask DataFrame/Series implementations of the same geometric operations, but that also take advantage of spatial partitioning. This would effectively replace the Datashader SpatialPointsFrame and would support all geometry types, not only points.
  4. Provide round-trip serialization of pandas/dask data structures containing geometry columns to/from parquet. Writing a Dask DataFrame to a partitioned parquet file would optionally include Hilbert R-tree packing to optimize the partitions for later spatial access on read.
  5. Fast import/export to/from shapely and geopandas. This may rely on the pygeos library to interface directly with the GEOS objects, rather than calling shapely methods.

These features will make it very efficient for Datashader to process large geometry collections. They will also make it more efficient for HoloViews to perform linked selection on large spatially distributed datasets.

Non-goals

spatialpandas will be focused on geometry only, not geography. As such:

  • No built-in support for loading data from geography-specific file formats
  • No dependency on GDAL/fiona
  • No coordinate reference frame logic

Features

Extension Arrays

The spatialpandas.geometry package will contain geometry classes and pandas extension arrays for each geometry type

  • Point/PointArray: single point / array of points, one point per element. Maps to shapely Point class.
  • MultiPoint/MultiPointArray: multiple points / array of points, multiple points per element. Maps to shapely Points class.
  • MultiLines/MultiLinesArray: One or more lines / array of lines, multiple lines per element. Maps to shapely LineString and MultiLineString classes.
  • Ring/RingArray: Single closed ring / array of rings, one per element. Maps to shapely LinearRing class.
  • Polygon/PolygonArray: One or more polygons, each with zero or more holes / array of polygons with holes, multiple per element. Maps to shapely Polygon/MultiPolygon classes.

Spatial Index

  • The spatialpandas.spatialindex module will contain a vectorized and parallel numba implementation of a Hilbert-RTree.

  • Each extension array has an sindex property that holds a lazily generated spatial index.

Extension Array Geometric Operations

The extension arrays above will have methods/properties for shapely-compatible geometric operations. These are implemented as parallel vectorized numba functions. Supportable operations include:

  • area
  • length
  • bounds
  • boundary
  • buffer
  • centroid
  • convex_hull
  • covers
  • contains
  • crosses
  • difference
  • disjoint
  • distance
  • envelope
  • exterior
  • hausdorff_distance
  • interpolate
  • intersection
  • intersects_bounds
  • minimum_rotated_rectangle
  • overlaps
  • project
  • simplify
  • union
  • unary_union
  • affine_transform
  • rotate
  • scale
  • skew
  • translate
  • sjoin (spatial join) (see tools/sjoin) (partial, see intersection and spatial join support for Point arrays #21)

Only a minimal subset of these will be implemented by the core developers, but others can be added relatively easily by users and other contributors by starting with one of the implemented methods as an example, then adding code from other published libraries (but Numba-ized and Dask-ified if possible!).

Pandas accessors

Custom pandas Series accessor is included to expose these geometric operations at the Series level. E.g.

  • df.column.spatial.area returns a pandas Series with the same index as df, containing area values.
  • df.column.spatial.cx is a geopandas-style spatial indexer for filtering a series spatially using a spatial index.

Custom pandas DataFrame accessor is included to track the current "active" geometry column for a DataFrame, and provide DataFrame level operations. E.g.

  • df.spatial.cx will filter a dataframe spatially based on the current active geometry column.

Dask accessors

A custom Dask Series accessor is included to expose geometric operations on a Dask geometry Series. E.g.

  • ddf.column.spatial.area returns a dask Series with the same index as ddf, containing area values.

  • The accessor also holds a spatial index of bounding box of the shapes in each partition. This allows spatial operations (e.g. cx, spatial join) to skip processing entire partitions that will not contain relevant geometries.

  • A Custom dask DataFrame accessor is included that is exactly analogous to the pandas version.

Conversions

  • Fast conversions to and from geopandas will be provided, potentially relying on pygeos.

Parquet support

  • read/to parquet for Pandas DataFrames will be able to rely on the standard pandas parquet functions with extension array support.

  • Special read/to parquet functions will be provided for Dask DataFrames.

  • to_parquet will add extra metadata to the parquet file to specify the geometric bounds of each partition. There will also be the option for to_parquet to use Hilbert R-tree packing to optimize the partitions for later spatial access on read.

  • read_parquet will read this partition bounding-box metadata and use it to pre-populate the spatial accessor for each geometry column with a partition-level spatial index.

Compatibility

  • We would aim to use consistent naming between geopandas and spatialpandas whenever possible. Since spatialpandas will rely on a DataFrame accessor rather than a DataFrame subclass, spatial properties and methods are under the accessor rather than on the DataFrame.

For example, geodf.cx becomes spdf.spatial.cx.

  • Eventually, a subclass compatibility layer could likely be added that would simply dispatch certain class methods/properties to the spatial accessor.

Testing

  • The test suite will rely heavily on property testing (potentially with hypothesis) to compare the spatialpandas results to geopandas/shapely results.
@jonmmease
Copy link
Collaborator Author

@brendancol
Copy link

@jonmmease this sounds awesome, and while it's a serious amount of work, at least GEOS already exists to be used as reference while implementing.

@brendancol
Copy link

brendancol commented Oct 29, 2019

Also, one anecdotal comment...

Two or three years ago, I attempted to implement an RTree using Numba and had trouble with numba optimizing trees with varying numbers of nodes in each level. It kept falling back to python mode and I gave up after the RTree couldn't beat a "brute force" query of the rectangles using numba in nopython mode.

3 thoughts:

  1. My issue could have been user error (i.e. my error).
  2. Numba may now be able to handle varying number of nodes in each level without falling back to python mode.
  3. Numba may still have this limitation and @sklam may be a good person to comment on this thread.

@sklam
Copy link

sklam commented Oct 29, 2019

@brendancol, can you point me to the old RTree implementation, or any current code that needs to be numba-ified?

@brendancol
Copy link

brendancol commented Oct 29, 2019

@brendancol, can you point me to the old RTree implementation, or any current code that needs to be numba-ified?

So its been a long time since I looked at this...and certainly don't want to waste your time. The associated code is here: https://github.com/parietal-io/py-rbush

We were probably running tests, profiling, and then just stopped mid-stream...

To boil down the approach, the RTree was using numba classes with deferred types. A simple question you may be able to help answer is:

"Since 2017, have there been significant developments in numba JITed classes and deferred types?"

@sklam Maybe we could think up a simple example (like the BTree example) which uses a varying number of nodes per level to help reproduce this issue and / or demonstrate an approach.

@jonmmease sorry to take over this thread...we can move this numba discussion somewhere else...

@jonmmease
Copy link
Collaborator Author

@brendancol, no problem. I'm actually playing around with a numba RTree implementation right now.

For the use cases I have in mind, I don't think there's a need to iteratively update the RTree after creation. In this case, I think I could get away with an array-based implementation of the binary tree (something like https://www.geeksforgeeks.org/binary-tree-array-implementation/). At least that's what I'm trying at the moment. No performance to report yet 🙂

@brendancol
Copy link

@jonmmease Awesome! I think part of my problem was probably "thinking by analogy"...Myself and @chbrandt were doing a port of https://github.com/mourner/rbush

@sklam
Copy link

sklam commented Oct 29, 2019

Numba has support for a typed version of list, dict and heapq now. It would make things easier.

@jbednar
Copy link
Member

jbednar commented Oct 29, 2019

Jon, this sounds like a fabulous proposal to me! Some musings:

  • I think it would be good to lead the proposal (and thus any docs, blogs, home pages, etc.) with the goals that distinguish this project from other existing libraries. Otherwise, people can be confused and ask "why not just use PyGEOS", etc., until they reach the non-goals section. To my mind, those distinguishing and unique features are, in order:
    1. This project is explicitly independent of any assumptions about what space these spatial data structures and operations cover, in contrast to Shapely (which is "deeply rooted in the conventions of the geographic information systems (GIS) world" (even though it is often used outside of it)). To be useful for Datashader and other non-GIS applications, this project would need to have no such ties.
    2. As such, this project does not have any dependencies on the traditional geographic stack (GDAL, fiona, etc.), being built solely on top of general-purpose SciPy and PyData libraries and tools, which means that it should be easily installable using Conda or pip without any tricky binary-compatibility issues.
    3. All code included is pure Python compiled to machine code with Numba and able to run distributed and/or out of core using Dask (or Numba where appropriate), to allow it to be fully scalable for even the largest problems.
  • At the same time, we need to make it clear that the traditional geo stack will always support much more functionality than this library ever will; this should be very clearly stated and acknowledged.
  • As such, it is crucial that this library support fast and convenient conversion into data structures that can be used with the traditional stack, because it doesn't matter how fast its own operations are if users are forever having to spend code-development time filling in even the most trivial functions already available elsewhere.
  • We need these basic capabilities for Datashader, and are willing to create the initial library and maintain it for Datashading purposes, but it should be clear that the Datashader maintainers are responsible only for the most basic function, and that any expansion of the supported functionality beyond that would be for the broader community to contribute. E.g. it seems to me that the initial library should support (and thus validate and test) only a small fraction of the geometric options listed, unless e.g. we could validate our tool against the existing tools' existing test suites. Most of the utilities involved are quite straightforward to implement and don't add up to much code, but they are well outside the purview of Datashader's maintainers, and so I'm reluctant for us to own the responsibility for documenting them, testing them, and tracking down any issues found in the tests.

@jonmmease
Copy link
Collaborator Author

@brendancol, here's the numba R-tree implementation I came up with today https://github.com/jonmmease/spatialpandas/blob/master/spatialpandas/spatialindex/rtree.py

I'll make up some benchmarks at some point, but what I'm seeing is that it matches brute force up to around 1000 elements, and then does significantly better the larger the inputs size gets. Several hundred times faster at a million elements. It also matches or is a bit faster than the rtree package, and its much faster to create the initial index than rtree. All of that is for a single lookup. What's really exciting is that these intersection calculations can be called from within numba functions. So I think it should be possible to write a really rast numba implementation of a spatial join on top of this.

@brendancol
Copy link

brendancol commented Oct 29, 2019

@brendancol, here's the numba R-tree implementation I came up with today https://github.com/jonmmease/spatialpandas/blob/master/spatialpandas/spatialindex/rtree.py

I'll make up some benchmarks at some point, but what I'm seeing is that it matches brute force up to around 1000 elements, and then does significantly better the larger the inputs size gets. Several hundred times faster at a million elements. It also matches or is a bit faster than the rtree package, and its much faster to create the initial index than rtree. All of that is for a single lookup. What's really exciting is that these intersection calculations can be called from within numba functions. So I think it should be possible to write a really rast numba implementation of a spatial join on top of this.

@jonmmease Sick!!! So excited to use this! Also another advantage to consider...libspatialindex isn't threadsafe and I've seen a bunch of seg faults related to rtree and multiprocessing. This will be super helpful for the Python GIS community!

Question: This has value outside of pandas. New standalone RTree library?

@jonmmease
Copy link
Collaborator Author

Question: This has value outside of pandas. New standalone RTree library?

Maybe eventually, sure. I'll probably want to wait to make sure it actually handles the usecases here before splitting it off though.

@jorisvandenbossche
Copy link

@jonmmease thanks a lot for writing up this proposal!

To repeat what I said in the datashader issue: I think there is certainly value in having a good, reusable implementation of geometry data support with a ragged array representation.
Below already some comments on the goals (from a geopandas developer, so somewhat critical :-)), will try write down more comments on the rest of the discussion later.

About the goals:

Goals 1 (ragged-array based extension arrays) and 2 (numba-based algorithms) are IMO distinct goals for a project like this.
But goal 3 (dask extension to have spatially partitioned DataFrames) and 4 (round-trip serialization with parquet making use of spatial index) are perfectly valid (and realistic) goals for GeoPandas as well (it's also what was written in the SBIR project, and more or less what the POC in https://github.com/mrocklin/dask-geopandas/ is doing (partially)).
And moreover, I think implementations of those, could be rather ignorant of the actual representation of the geometries under the hood.

So I would love to see or explore if we can build common infrastructure for this, instead of each of the projects doing this themselves.

One possible way (but need to think this through more) is to make the dask extensions, IO functionality, etc to work with different kinds of GeometryArrays (if they can rely on the public methods of those, that might be possible). But will think about this aspect some more.

@jbednar
Copy link
Member

jbednar commented Oct 30, 2019

Just to clarify, goal 4 is only in service of the other goals -- anything we work on needs to have some sort of persistence/serialization format, so that it's practical to work with, and the goal is not to provide one (if at all possible), just for there to be one. So it would be ideal if there is such a format available that both GeoPandas and this non-GeoPandas code can both use.

In any case, I'm very excited about the potential for having algorithms, persistence formats, etc. that can work with both GEOS and a Python-native, Numba-izable object, which would be amazing if we can achieve that.

@jonmmease
Copy link
Collaborator Author

But goal 3 (dask extension to have spatially partitioned DataFrames) and 4 (round-trip serialization with parquet making use of spatial index) are perfectly valid (and realistic) goals for GeoPandas as well...

Thanks, yeah that certainly makes sense. Is there consensus toward what would make sense as a parquet serialization encoding for geopandas?

(it's also what was written in the SBIR project, and more or less what the POC in https://github.com/mrocklin/dask-geopandas/ is doing (partially)). And moreover, I think implementations of those, could be rather ignorant of the actual representation of the geometries under the hood... So I would love to see or explore if we can build common infrastructure for this, instead of each of the projects doing this themselves.

ok, great. I'll take a look at dask-geopandas to get a sense of how things are structured there. At the Dask level, I think the main thing that is needed is for the Dask DataFrame geometry columns to have metadata containing the bounding box of each partition.

It might be nice if there was some general concept of ExtensionArray metadata that the Dask-level ExtensionArray would have access to for each of it's partitions. Ideally, there would be a standard way to store this metadata in the parquet partitions so that it would automatically be persisted when using the standard dask parquet functions.

@jorisvandenbossche
Copy link

Is there consensus toward what would make sense as a parquet serialization encoding for geopandas?

For geopandas we are currently working on parquet support, using WKB (geopandas/geopandas#1180). Which is different that nested lists, but I think we good metadata it should be possible to support different formats.

ok, great. I'll take a look at dask-geopandas to get a sense of how things are structured there. At the Dask level, I think the main thing that is needed is for the Dask DataFrame geometry columns to have metadata containing the bounding box of each partition.

Know that the dask-geopandas repo is outdated (so probably won't work with current dask), and was only a small proof of concept.
But I think the main question here is how extend dask with the spatial partitioning information. I was under the assumption that that would require a subclass (see also dask/dask#5438). Currently dask dataframes use the index for partition bounds, but a subclass could override that with spatial bounds. Whether it is possible to add this as metadata only to the geometry columns and have this information available in the algorithms (which would avoid the need of a subclass), that I don't know.

@jbednar
Copy link
Member

jbednar commented Jul 16, 2020

@jonmmease, I've updated your initial comment on this issue to put checkboxes on all of the features that I think have now been implemented, and unticked checkboxes on all those not yet implemented. If you get a chance, it would be great if you could skim through that list and check my guesses, to see if I got any wrong that you can update.

@jbednar jbednar changed the title spatialpandas proposal SpatialPandas design and features Jul 16, 2020
@jonmmease
Copy link
Collaborator Author

Looks good, I updated a couple check boxes:

I didn't make any changes for this above, but I didn't end up using the data frame accessor paradigm. It turned out to be cleaner to use subclasses the way GeoPandas does. In the beginning, didn't realize how much support panda and dask.dataframe already had for supporting custom subclasses. So this ended up being easier, and resulted in a more GeoPandas compatible API.

@benbovy
Copy link

benbovy commented Jul 27, 2020

This project looks very interesting!

On a related topic, @willirath and I started working on xoak, which extends xarray with spatial indexes (currently based on scikit-learn's BallTree, as a proof-of-concept).

From what I've read here, we have many goals in common, i.e., domain-agnostic features and package dependencies (only PyData core libraries like Numba and Dask), support parallel/distributed indexing with Dask (see xarray-contrib/xoak#1 (comment) for a concrete use case).

The rtree index implemented in spatialpandas thus has certainly a lot of value outside of Pandas, and I'd be very excited to contribute to a standalone library that could be reused by xarray as well. Supporting such index is part of xarray's development roadmap, that we are going to tackle very soon. More specifically, it would be great to have a more flexible implementation like in scikit-learn where BallTree and KdTree are built on top of a common, static binary tree, but here using Numba rather than Cython (although I don't know if it would be easy to do this with Numba).

@jonmmease
Copy link
Collaborator Author

Hi @benbovy, I'm not so involved with this project any more, but I think it would be great to pull out the Hilbert RTree implementation into a standalone package if you would find it useful! Have you looked through the implementation, and does the approach make sense to you? We don't have any specific documentation for it at this point, but it's well tested against RTree and I think the API and code are fairly well documented through docstrings and inline comments.

I won't be able to help out much with actually standing up a new package for it, but I'm happy to answer any questions that come up.

@jbednar
Copy link
Member

jbednar commented Jul 27, 2020

I'm the de facto owner of spatialpandas right now, but only reluctantly, in that it fills a key unmet need for our HoloViz.org visualization toolchains but I think it really should have a life outside of that purpose and I'd love someone else to run with it from here. In the meantime, I would be very happy to see any of its code disappear into any well-maintained library that we could then simply import, assuming that such a library only depends on the core PyData libraries (xarray, pandas, numpy, Numba, Dask, etc.).

@jbednar jbednar added this to the Wishlist milestone Jul 27, 2020
@jbednar jbednar added the help wanted Extra attention is needed label Jul 27, 2020
@benbovy
Copy link

benbovy commented Jul 29, 2020

I tend to agree with @jorisvandenbossche's comment

Goals 1 (ragged-array based extension arrays) and 2 (numba-based algorithms) are IMO distinct goals for a project like this.

I'm wondering if something like Awkward wouldn't help eventually. I've been watching the SciPy 2020 presentation and was pretty amazed, especially when seeing how it integrates with numba.

Awkward is still in heavy development and I don't know if how it will evolve in the future, but it has much potential to be reused for goal 1 without the need to maintain independent implementations of numpy/numba compatible ragged-arrays, IMO. I think it could even provide a common basis for supporting spatial structures in both (geo)pandas and xarray (see, e.g., scikit-hep/awkward#350 and scikit-hep/awkward#27 (comment)).

Regarding goal 2, I'd love to see flexible numba implementation of spatial indexes in a well-maintained, separate library. I think that some people coming from the xarray side (including me) would be willing to contribute on a regular basis. Here too Awkward's numba integration might help if we need to deal with more complex (dynamic) data structures than simple arrays.

@benbovy
Copy link

benbovy commented Jul 29, 2020

Have you looked through the implementation, and does the approach make sense to you?

I had a brief look at it. I haven't worked extensively on the implementation of such algorithms yet, but overall the code seems already pretty clear and well documented to me, thanks for that @jonmmease!

I won't be able to help out much with actually standing up a new package for it, but I'm happy to answer any questions that come up.

Will do if needed, thanks!

@brendancol
Copy link

@jonmmease @benbovy I really like the roadmap here! I'm a datashader contributor and maintainer of xarray-spatial

Xarray-Spatial is looking to add a numba-ized version of GDAL's polygonize and we targeting spatialpandas.PolygonArray as the return type. My hope is that I'll get up-to-speed with spatialpandas internals and then can actually help with contributions.

Keep me in mind for as a contributor, happy to chat features / do zooms.

@jbednar
Copy link
Member

jbednar commented Aug 8, 2020

@benbovy , Awkward is really cool! We looked closely into it when developing spatialpandas, including discussions with Jim Pivarski in January. @jonmmease probably remembers the details better, but my own (sketchy) recollection was that it was very promising but not quite ready to do what we needed at the time. Note that spatialpandas already does rely on Numba, so I don't think there's necessarily any special benefit to using Awkward for that particular purpose.

@jonmmease
Copy link
Collaborator Author

We looked closely into it when developing spatialpandas, including discussions with Jim Pivarski in January. @jonmmease probably remembers the details better...

I think it was mostly a matter of what was already supported in Awkward as of last fall when I started working on spatialpandas. In particular we were looking for native Arrow/Parquet serialization, which it looks like was recently added to Akward (scikit-hep/awkward#343).

I didn't get far enough into researching Awkward to understand whether it possible to use a schema to define the number of allowed nesting levels. There would also be the question of how to distinguish geometry types that share identical representations (e.g. MultiLine and Polygon).

Most of the complexity in spatialpandas that isn't directly related to geometric algorithms is in the geometry extension array implementations, so it would definitely be a win in terms of reduced code complexity of Awkward could be adopted.

@jorisvandenbossche
Copy link

jorisvandenbossche commented May 7, 2021

Sorry to hijack this thread, but thought to give a heads up here since several potentially interested people might follow this one (also cc @brl0).

During the upcoming Dask Summit, we have a 2-hour workshop scheduled about scaling geospatial vector data on Thursday May 20th at 11-13:00 UTC (https://summit.dask.org/schedule/presentation/22/scaling-geospatial-vector-data/).

First of all: you're of course all welcome to attend/participate!
But my specific question: it would be interesting to have someone from spatialpandas (or familiar more familiar with spatialpandas) give a brief presentation about spatialpandas. There is quite some potential overlap with dask-geopandas (in goals and implementation), and I think it would be interesting to discuss potential synergies / better compatibility between both.

We can also follow-up on geopandas/community#4 (general issue about the workshop).

@jonmmease
Copy link
Collaborator Author

@jorisvandenbossche. I don't know what @jbednar's current thinking is regarding spatialpandas, but I don't plan to personally work on it any further.

That said, I'd be happy to summarize what's been done in case there are any ideas worth learning from / stealing. I think the main idea that would be relevant to the Dask discussion would be the use of a space filling curve and R-tree to make it possible to use the regular Dask DataFrame partition index as a spatial index as well (rather than creating a whole new Dask data structure).

Does that sound good?

@jbednar
Copy link
Member

jbednar commented May 12, 2021

I think spatialpandas is great, but at Anaconda we are only planning to take it forward if we get specific funding to do so or if it's required for a specific project. Otherwise we just plan to fix bugs in it when that's required. I'd hope @brl0 would be able to push it forward. We are also looking to Awkward for possible implementations of ragged arrays that Datashader can use.

I'd love for any elements of SpatialPandas that are useful to GeoPandas or Dask-GeoPandas to migrate to one of those projects, but note that for our purposes as maintainers of general-purpose software we cannot depend on fiona or gdal, and thus as far as I know cannot simply delete SpatialPandas in favor of Dask-GeoPandas even if Dask-GeoPandas subsumed SpatialPandas functionality.

I'll try to attend the workshop, but @jonmmease knows a lot more about the internals, so I would just discuss such scoping and roadmap issues.

@jorisvandenbossche
Copy link

@jonmmease indeed, the spatial (re)partitioning concepts of spatialpandas are most relevant I think for the discussion, so a brief overview of the items you list would be great. Thanks a lot in advance!

note that for our purposes as maintainers of general-purpose software we cannot depend on fiona or gdal

Fiona/gdal is not a hard requirement for GeoPandas, you only need it if you want to use it for IO (this has always been the case, but it's only since the last release we actually stopped importing fiona on geopandas import (which created this seemingly hard dependency), but only import it when being used).
The only "non-Python" required dependency is GEOS, which is a simple C++ library without any external dependencies (I don't think it's a harder dependency as eg numpy).

@jbednar
Copy link
Member

jbednar commented May 14, 2021

For us "dependency" is defined both in terms of "will the software be usable by us for what we want to do, without that package" and "does either the pip or the conda package depend on that"? If GeoPandas can make both of those dependencies go away, I'd love to see some reunification here!

@jonmmease
Copy link
Collaborator Author

One thing that @jorisvandenbossche and I talked about a while back is whether it would be practical for GeoPandas to define an interface for the "Geometry Column" so that something like the spatialpandas contiguous memory representation could be used as an alternative to the default array of GEOS/Shapely objects representation.

I believe this functionality, combined with dask-geopandas and the removal of fiona/gdal as hard dependencies, is everything that would be needed for Datashader to be able to switch from spatialpandas to GeoPandas. And maybe in that scenario, spatialpandas shrinks down to only consist of the implementation of the Geometry Column interface.

@jbednar
Copy link
Member

jbednar commented May 14, 2021

And then we can sneak that alternative Geometry Column interface as an option in GeoPandas somewhere and our work here is done! Oh, did I say that out loud? :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants