Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GeometryIndex #7

Merged
merged 28 commits into from
Nov 23, 2022
Merged

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Nov 18, 2022

This new index replaces ShapelySTRTreeIndex and provides the same functionality in addition to all the basic features of xarray.indexes.PandasIndex. The STRTree is not fully initialized until we use it.

After trying both, I found that encapsulating PandasIndex was a bit easier and cleaner than inheriting from it, i.e., we can define an alternative __init__ signature (crs) that works well with the .from_variables() class method (note: Index.__init__() is rarely used directly).

@codecov
Copy link

codecov bot commented Nov 18, 2022

Codecov Report

Base: 24.39% // Head: 97.72% // Increases project coverage by +73.33% 🎉

Coverage data is based on head (6d770d7) compared to base (5198341).
Patch coverage: 97.72% of modified lines in pull request are covered.

Additional details and impacted files
@@             Coverage Diff             @@
##             main       #7       +/-   ##
===========================================
+ Coverage   24.39%   97.72%   +73.33%     
===========================================
  Files           1        1               
  Lines          41      132       +91     
===========================================
+ Hits           10      129      +119     
+ Misses         31        3       -28     
Impacted Files Coverage Δ
xvec/index.py 97.72% <97.72%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Remove Index API that is meant to be only used via Xarray internals.
@benbovy
Copy link
Member Author

benbovy commented Nov 18, 2022

I think this is ready if someone wants to have a look (@martinfleis @keewis).

Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is superb!

What about calling the index GeometryIndex instead of GeoVectorIndex? Would that be ambiguous? We can later add something like GeographyIndex when S2 becomes ready.

Later, we should also make a decorator fetching the docstring for isel, create_variables etc. from the original PandasIndex.

pyproject.toml Outdated Show resolved Hide resolved
xvec/__init__.py Show resolved Hide resolved
xvec/index.py Outdated Show resolved Hide resolved
xvec/index.py Outdated Show resolved Hide resolved
xvec/index.py Outdated
*,
options: Mapping[str, Any],
):
# TODO: try getting CRS from coordinate attrs or GeometryArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO: try getting CRS from coordinate attrs or GeometryArray
# TODO: try getting CRS from coordinate attrs or GeometryArray or SRID

Just so we don't forget about this option as well. It is not common but when loading geoms from PostGIS, this may be useful.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can somehow get xarray to extract the CRS from the GeometryArray and store it on the object? That would probably have to be done in a hook that would be defined here... some sort of array type → hook callable mapping where the hooks return either the full Variable object or a data, attrs tuple, maybe?

Also, that would require some thought into how to store the crs. Not sure if we could copy whatever rioxarray / odc-geo are doing, since those seem to have one global crs per DataArray / Dataset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it would be nice to extract the CRS from the GeometryArray, but for dimension coordinates Xarray currently converts the GeometryArray into a plain numpy array, therefore losing the CRS info. We need to fix that when refactoring IndexVariable in Xarray.

Also, that would require some thought into how to store the crs. Not sure if we could copy whatever rioxarray / odc-geo are doing, since those seem to have one global crs per DataArray / Dataset?

With a CRSIndex (corteva/rioxarray#588) or RasterIndex built from three coordinates (2D spatial + spatial_ref), having one global crs per DataArray / Dataset rioxarray may not be required anymore and the conventions regarding the coordinate names may even be relaxed.

While for rasters it makes sense to have a scalar coordinate holding the CRS info to avoid duplicating it, for vector data the coordinate <-> index relationship is 1:1 so we could add the CRS info directly as attribute(s) of the vector coordinate.

It may be a good idea to add CRS info (a serializable version) as a coordinate attribute, if not present. So basic CRS info (identifier) would be accessible from the DataArray / Dataset repr, and the full CRS object would be accessible from the index object. E.g.,

>>> ds.geom_coord.attrs["crs"]
'EPSG:4267'
>>> ds.xindexes["geom_coord"].crs
<pyproj.crs.CRS>

xvec/index.py Show resolved Hide resolved
xvec/index.py Outdated
Comment on lines 131 to 134
# check for possible CRS of geometry labels
# (by default assume same CRS than the index)
if hasattr(label_array, "crs") and label_array.crs != self.crs:
raise ValueError("conflicting CRS for input geometries")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also need a change if crs becomes optional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we do when label CRS is not None and index CRS is None? Raise or ignore the label CRS?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In geopandas we warn that the two CRS are not matching but let the operation happen. I'd do the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for being consistent with geopandas. Is it for all operations that could depend on a CRS?

(Note: it might be tricky to set the right stacklevel for the warnings here since the index API is called in various places of Xarray's internals, but not a big deal I think).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When two CRS are checked against each other like in overlay or sjoin, if one gdf has a CRS and the other does not, it always warns and executes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I wonder if we shouldn't have a special case for .sel(). I find it convenient to just pass one or more (list/array) shapely geometries as input labels without an explicit CRS. A warning may be annoying.

xvec/index.py Outdated Show resolved Hide resolved
xvec/index.py Show resolved Hide resolved
xvec/index.py Outdated Show resolved Hide resolved
benbovy and others added 3 commits November 22, 2022 13:25
Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
@benbovy
Copy link
Member Author

benbovy commented Nov 22, 2022

Thanks @martinfleis for the review! I agree CRS could be optional.

What about calling the index GeometryIndex instead of GeoVectorIndex? Would that be ambiguous? We can later add something like GeographyIndex when S2 becomes ready.

Yes I agree it would make sense to rename it if later we add another index for geographic (S2) features, although alternatively we might want to reuse the same index class for both shapely/GEOS and S2 and select the right backend depending on the coordinate elements and/or the CRS?

Should we pick a name that refer to both vector and georeferenced (even if optional) geometrical features?

@martinfleis
Copy link
Member

alternatively we might want to reuse the same index class for both shapely/GEOS and S2 and select the right backend depending on the coordinate elements and/or the CRS?

I'd like to have the behaviour of this the same as we will have in geopandas which is unfortunately not been discussed yet... If geopandas implements S2 under the geometry property, then we should also have one class to deal with both. If we decide to use something like geography property instead, we should have two classes here to be consistent. I am not sure if something on that is in geopandas/community#10 but there was no decision yet.

Should we pick a name that refer to both vector and georeferenced (even if optional) geometrical features?

to understand properly, by "vector" you mean an numpy array of shapely geoms and georeferenced the same with assigned CRS? I am not sure I follow the distinction here.

@benbovy
Copy link
Member Author

benbovy commented Nov 22, 2022

Ah I better understand your point now. I'm +1 for consistency with geopandas and for using GeometryIndex. Depending on what will be decided, we could still rename the index class later (with a temporary alias).

@benbovy benbovy changed the title Add GeoVectorIndex Add GeometryIndex Nov 22, 2022
Try being consistent with geopandas.
@benbovy
Copy link
Member Author

benbovy commented Nov 22, 2022

In the last commit I took and adapted some utility functions from geopandas.

For concat this is similar than in geopandas (i.e., raises if different non-empty CRS are found, otherwise maybe warns). For join and reindex_like it maybe warns (no error) but I'm not sure if this is what we really want for alignment. Shouldn't we do like for concat?.

Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For concat this is similar than in geopandas (i.e., raises if different non-empty CRS are found, otherwise maybe warns). For join and reindex_like it maybe warns (no error) but I'm not sure if this is what we really want for alignment. Shouldn't we do like for concat?.

I think that in all these cases, we should raise. I know that geopandas only warns in some cases, but all these are more similar to concat than sjoin. I even remember discussion that not even sjoin should allow that and raise but we never proposed a change there.

Apart from this and the notes in code I'd merge this and make further changes in follow-up PRs.

xvec/index.py Outdated Show resolved Hide resolved
xvec/index.py Outdated

# check for possible CRS of geometry labels
# (by default assume same CRS than the index)
label_crs = getattr(label_array, "crs", None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this situation when label_array has crs attribute actually happen? Since we cast all inputs to the numpy.array above, even if we have something like a GeometryArray, we would strip it of CRS. Codecov marking l. 193 as untested seems to prove that.

I think we need to check for crs of labels not label_array.

Suggested change
label_crs = getattr(label_array, "crs", None)
label_crs = getattr(labels, "crs", None)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it depends (note: labels is a dict and label is the extracted value).

  • If label is a GeometryArray then yes we need to check label.crs
  • If label is a xarray.Variable or xarray.DataArray wrapping a GeometryArray (i.e., the ._data attribute), we need to check label_array.crs

The latter is not supported yet in Xarray, though, and we still need to figure out a more general way to extract CRS from various input types. So I suggest to remove the CRS check here and address that in a follow-up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me. Let's move this to an issue to keep track.

xvec/index.py Outdated
Comment on lines 260 to 262
def _repr_inline_(self, max_width: int):
srs = _format_crs(self.crs, max_width=max_width)
return f"{self.__class__.__name__} (crs={srs})"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _repr_inline_(self, max_width: int):
srs = _format_crs(self.crs, max_width=max_width)
return f"{self.__class__.__name__} (crs={srs})"
def _repr_inline_(self, max_width: int):
if max_width is None:
max_width = 50
srs = _format_crs(self.crs, max_width=max_width)
return f"{self.__class__.__name__} (crs={srs})"

Somehow this is raising a TypeError. Cell 6 in the Quickstart notebook:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/mambaforge/envs/xvec_dev/lib/python3.10/site-packages/IPython/core/formatters.py:342, in BaseFormatter.__call__(self, obj)
    340     method = get_real_method(obj, self.print_method)
    341     if method is not None:
--> 342         return method()
    343     return None
    344 else:

File ~/mambaforge/envs/xvec_dev/lib/python3.10/site-packages/xarray/core/common.py:167, in AbstractArray._repr_html_(self)
    165 if OPTIONS["display_style"] == "text":
    166     return f"<pre>{escape(repr(self))}</pre>"
--> 167 return formatting_html.array_repr(self)

File ~/mambaforge/envs/xvec_dev/lib/python3.10/site-packages/xarray/core/formatting_html.py:318, in array_repr(arr)
    316 if hasattr(arr, "xindexes"):
    317     indexes = _get_indexes_dict(arr.xindexes)
--> 318     sections.append(index_section(indexes))
    320 sections.append(attr_section(arr.attrs))
    322 return _obj_repr(arr, header_components, sections)

File ~/mambaforge/envs/xvec_dev/lib/python3.10/site-packages/xarray/core/formatting_html.py:195, in _mapping_section(mapping, name, details_func, max_items_collapse, expand_option_name, enabled)
    188 expanded = _get_boolean_with_default(
    189     expand_option_name, n_items < max_items_collapse
    190 )
    191 collapsed = not expanded
    193 return collapsible_section(
    194     name,
--> 195     details=details_func(mapping),
    196     n_items=n_items,
    197     enabled=enabled,
    198     collapsed=collapsed,
    199 )

File ~/mambaforge/envs/xvec_dev/lib/python3.10/site-packages/xarray/core/formatting_html.py:155, in summarize_indexes(indexes)
    154 def summarize_indexes(indexes):
--> 155     indexes_li = "".join(
    156         f"<li class='xr-var-item'>{summarize_index(v, i)}</li>"
    157         for v, i in indexes.items()
    158     )
    159     return f"<ul class='xr-var-list'>{indexes_li}</ul>"

File ~/mambaforge/envs/xvec_dev/lib/python3.10/site-packages/xarray/core/formatting_html.py:156, in <genexpr>(.0)
    154 def summarize_indexes(indexes):
    155     indexes_li = "".join(
--> 156         f"<li class='xr-var-item'>{summarize_index(v, i)}</li>"
    157         for v, i in indexes.items()
    158     )
    159     return f"<ul class='xr-var-list'>{indexes_li}</ul>"

File ~/mambaforge/envs/xvec_dev/lib/python3.10/site-packages/xarray/core/formatting_html.py:139, in summarize_index(coord_names, index)
    136 name = "<br>".join([escape(str(n)) for n in coord_names])
    138 index_id = f"index-{uuid.uuid4()}"
--> 139 preview = escape(inline_index_repr(index))
    140 details = short_index_repr_html(index)
    142 data_icon = _icon("icon-database")

File ~/mambaforge/envs/xvec_dev/lib/python3.10/site-packages/xarray/core/formatting.py:413, in inline_index_repr(index, max_width)
    411 def inline_index_repr(index, max_width=None):
    412     if hasattr(index, "_repr_inline_"):
--> 413         repr_ = index._repr_inline_(max_width=max_width)
    414     else:
    415         # fallback for the `pandas.Index` subclasses from
    416         # `Indexes.get_pandas_indexes` / `xr_obj.indexes`
    417         repr_ = repr(index)

File ~/Git/xvec/xvec/index.py:261, in GeometryIndex._repr_inline_(self, max_width)
    260 def _repr_inline_(self, max_width: int):
--> 261     srs = _format_crs(self.crs, max_width=max_width)
    262     return f"{self.__class__.__name__} (crs={srs})"

File ~/Git/xvec/xvec/index.py:21, in _format_crs(crs, max_width)
     18 else:
     19     srs = "None"
---> 21 return srs if len(srs) <= max_width else " ".join([srs[:max_width], "..."])

TypeError: '<=' not supported between instances of 'int' and 'NoneType'

This bit seems to override our default with None

File ~/mambaforge/envs/xvec_dev/lib/python3.10/site-packages/xarray/core/formatting.py:413, in inline_index_repr(index, max_width)
    411 def inline_index_repr(index, max_width=None):
    412     if hasattr(index, "_repr_inline_"):
--> 413         repr_ = index._repr_inline_(max_width=max_width)

We need to catch that and override in that case. Either here like in the code suggestion above or in _format_crs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, we can fix it here but I think this should rather be fixed in xarray.core.formatting.inline_index_repr? cc @keewis.

Let's address later the extraction of CRS info from various input types
in a more general way?
Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go! Thanks a lot!

@benbovy
Copy link
Member Author

benbovy commented Nov 23, 2022

Just adding a few more tests before merging

@benbovy
Copy link
Member Author

benbovy commented Nov 23, 2022

Good to go now :). Thanks again for the helpful comments and suggestions!

@martinfleis martinfleis merged commit 2f1b032 into xarray-contrib:main Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Suggestions for ShapelySTRTreeIndex
3 participants