Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regridding to a list of polygons (spatial averaging) #24

Merged
merged 41 commits into from
Nov 25, 2020

Conversation

aulemahal
Copy link
Collaborator

@aulemahal aulemahal commented Sep 10, 2020

This PR adds support for a list of polygons as arguments of Regridder. When used in combination with a conservative method, the regridding output is equivalent to a spatial averaging of the input data on each polygon.

Uses shapely for the polygon implementation. The list of polygon is translated as a Mesh with multiple elements of various number of nodes. In the regridding process, the same behaviour than having locstream as output is used, thus regridded data has a locations dim that matches the order of the polygon list.

The Regridder will flatten the list of polygons (MultiPolygons object will be expanded) and will ignore potential holes in polygons. It makes its job easier, but is problematic for the end user. I thus created the SpatialAverager subclass that takes care of theses caveats. It divides the list of polygons into exteriors (remembering the "owners" of flattenned out MultiPolygons) and holes and then creates the weights using 2 Regridder calls. Weights are merged (flattenned polygons re-combined and holes subtracted) and normalized (requiring a conversion from COO sparse matrices to CSC). The SpatialAverager is then instantiated with these merged weights.

A new notebook was added and the "current limitations" page was update to highlight that this is not a efficient and elegant implementation of meshes, but a step forward.

Closes #11

Previous comment:

STILL A DRAFT. Before going forward, I wanted to discuss the following : Using GeoPandas we could make the use of this even more elegant. When ds_out is GeoDataFrame, the output could have the same index along "locations", instead of a simple integer one. Moreover, this would allow the implementation of a `spatial_averaging` method that could manage the holes and MutliPolygon cases. Right now, when a MultiPolygon is encountered it is flattened out, split into its polygon parts. And holes are simply ignored. The downside would be all new dependencies (fiona and pyproj), it could be made optional tho?

@aulemahal aulemahal marked this pull request as draft September 10, 2020 22:02
@aulemahal
Copy link
Collaborator Author

@andersy005 @huard @raphaeldussin I'd like to have your input on adding geopandas + fiona + pyproj to the dependencies, either by default or as optional dependencies in order to have a cleaner PR. Accepting GeoDataframes would be more elegant than lists of shapely polygons, and would allow for a quite straightforward spatial averaging method : xesmf.spatial_averages(ds, geo_df) -> pd.Series with the same index as geo_df.

@raphaeldussin
Copy link
Contributor

I'm all for using the libs that will make the code more elegant and thus easier to maintain!
Unsure about making the requirements optional or by default. Are we expecting potential conflicts between these
and the "core" dependencies?

I'm not super familiar with the Mesh part of ESMF, is the regridding possible from grid -> mesh and mesh -> grid ?
It looks like you can do mesh -> locstream, unsure about locstream -> mesh

I think we should try to allow all the possible in/out type combination that are supported by ESMF.

What do you think?

@huard
Copy link
Contributor

huard commented Sep 14, 2020

The trade-off here is having to worry about these additional dependencies. In my experience, these are not "light" dependencies. I agree they would make the user-interface a lot friendlier. However, my concern is that they'll complicate the job of maintainers.

Could we think about a compromise? For example, we could have a notebook that shows how to interface geopandas with xESMF. We run the notebook as part of the test suite, but geopandas is not part of the installation hard requirements.

@andersy005
Copy link
Member

Another alternative is to make these additional packages soft dependencies by moving the imports of these additional packages inside the functions themselves

i.e.

def func(ds):
  try:
    from shapely.geometry import MultiPolygon
  except ImportError as exc:
    message = f"shapely package is missing... Please install it via pip or conda: conda install shapely..."
    raise exc(message)
  # Do actual work here
  ...

rather than

from shapely.geometry import MultiPolygon
def func(ds):
  # Do actual work here
  ...

@aulemahal
Copy link
Collaborator Author

Thanks for the replies!
I think I will try to implement a "spatial_averaging" method without geopandas first, and make it optional if I really need it.

The last commits add the polylist_in option and we now have possibility to regrid grid or locstream to/from list of polygons, with the same method restrictions as for ESMF. (No restrictions for grid <-> Mesh, some for locstream <-> Mesh).

However, I was not able to make the nearests, bilinear and patch methods work with lists including polygons of more than 4 nodes. The documentation of ESMF says these should work, but I get a rc=506 error:
PET0 ESMCI_Array.C:1319 ESMCI::DD::setupSeqIndexFactorLo Invalid argument - factorIndexList contains seqIndex outside Array bounds
The very last commit uses a complex if statement to raise more meaningful error messages. May be the number of nodes is not the problem, but in my few tests that was the only difference that made the code break.
(P.S. : "nearest_s2d" works with polygons as input, "nearest_d2s" with polygons as output)

@aulemahal
Copy link
Collaborator Author

The SpatialAverager is ready for review! I edited the root comment to explain what this PR does.

I finally didn't need geopandas for the implmentation itself. However, I added it as a doc requirement and used it in my example notebook.

@aulemahal aulemahal marked this pull request as ready for review September 23, 2020 22:05
@aulemahal
Copy link
Collaborator Author

aulemahal commented Sep 23, 2020

Caveats of this PR:

  1. For changes in the Regridder obj. Some regridding methods involving Meshes that ESMF supports do not work here. The conditional statements in the init are quite complicated but here is the summary:
  • With a list of polygons as input:
    • "bilinear" and "patch" never work, but should.
    • "nearest_d2s" only works when polygons have 4 or fewer nodes.
  • With a list of polygons as output:
    • 'bilinear', 'patch' and 'nearest_d2s' only work when polygons have 4 or fewer nodes.

In all cases, the error is as said above: rc=506 error, PET0 ESMCI_Array.C:1319 ESMCI::DD::setupSeqIndexFactorLo Invalid argument - factorIndexList contains seqIndex outside Array bounds.

  1. For the SpatialAverager. Spatial averaging should doable with a list of polygons as input. But the complexity of managing all the mutli-part geometries and holes cases was beyond the time I can put on this PR. I leave it for the future, if a need emerges.

Copy link
Contributor

@huard huard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First batch of comments.
I'll take some more time to try it out.

doc/limitations.rst Outdated Show resolved Hide resolved
xesmf/backend.py Outdated Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/util.py Outdated Show resolved Hide resolved
xesmf/util.py Outdated Show resolved Hide resolved
@huard
Copy link
Contributor

huard commented Oct 1, 2020

My main concern is to see the Regridder class getting bloated by all these options for inputs and outputs. If later we implement support for meshes, the logic will become even messier. Also, I think we're abusing the intent for ds_in and ds_out by passing polygons instead of datasets as the name implies.
Question: Is there a use case for polygon_in (that's not already covered by locstreams)? If not, we could decide to not support this and simplify the API.

My suggestion (for discussion) would be to

  1. Create a BaseRegridder class only accepting ESMF objects (Grids, Mesh, LocStream).
  2. Refactor Regridder to use BaseRegridder, including the current logic to convert datasets to either Grids or LocStream.
  3. Refactor SpatialAverager to use BaseRegridder and hold all the polygon handling logic. Note that the weights computation could probably overload _compute_weights instead of living in the __init__.

@mathause You might be interested to review this PR and provide your view on the implementation and the API.

Copy link
Contributor

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar enough with ESMF to comment on the implementation but I would agree that the interface feels overloaded & using polylist_in/out seems unwieldy. @huard's suggestion seems reasonable to me.

setup.py Outdated
@@ -16,7 +16,7 @@
if on_rtd:
INSTALL_REQUIRES = []
else:
INSTALL_REQUIRES = ['esmpy>=8.0.0', 'xarray', 'numpy', 'scipy']
INSTALL_REQUIRES = ['esmpy>=8.0.0', 'xarray', 'numpy', 'scipy', 'shapely']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could shapely be made an optional dependency?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could. I thought it was easy enough to install, but I don't know of all possible edge cases!

doc/limitations.rst Outdated Show resolved Hide resolved
doc/limitations.rst Outdated Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/smm.py Outdated Show resolved Hide resolved
@raphaeldussin
Copy link
Contributor

@mathause locstream is the name used in ESMF, for better or worse :)

@huard
Copy link
Contributor

huard commented Oct 2, 2020

We should probably define the ESMF jargon in the docs though. Otherwise, there is a risk that users just don't know that a feature exists because it's called something else.

@huard
Copy link
Contributor

huard commented Oct 13, 2020

@mathause I had thought about it, and agree it would be nice to make the link with regionmask. Could you please open a new issue and propose ideas to leverage the strengths of xesmf and regionmask?

@aulemahal
Copy link
Collaborator Author

I had a flash this afternoon and tried a modified version of the Mesh.from_polygons method. Using pre-allocated structured numpy arrays instead of dynamically growing lists, I was able to accelerate the mesh construction from polygons by a large factor.

For the notebook example, time went from 1.7 s to 0.9s.
For the whole dataset (245 polygons, simplified to 0.02°, ~100 000 nodes), time went from 274 s to 35 s !

However, I am using a numpy >=1.16 method in order to convert structured arrays to unstructured ones. I also tried a version with 2D arrays instead of structured ones and the speed up was nowhere near what I get here.

@huard
Copy link
Contributor

huard commented Oct 21, 2020

@bekozi This PR is inspired by what you've done in OCGIS to compute averages over polygons. Would you mind taking a look ?

@aulemahal aulemahal mentioned this pull request Nov 24, 2020
@huard
Copy link
Contributor

huard commented Nov 25, 2020

@raphaeldussin Review reminder ; )

There is another PR in the pipeline that started from this one. Merging this would make it easier to review.

@@ -443,12 +547,30 @@ def esmf_regrid_finalize(regrid):
regrid.destroy()
regrid.srcfield.destroy()
regrid.dstfield.destroy()
regrid.srcfield.grid.destroy()
regrid.dstfield.grid.destroy()
# regrid.srcfield.grid.destroy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason for commenting out these lines?
ESMF is prone to memory leaks if objects are not destroyed properly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new structure, we have the BaseRegridder object that takes ESMF objects as input. Thus, destroying them after the weight computation could cause issues for a user who created those themself.
More specifically, I commented out these lines because SpatialAverager uses the same grid in two different BaseRegridder instances, so I instead of recreating it, I decided to reuse the same object.
A partial solution could be to destroy the grids where appropriate in Regridder and SpatialAverager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let's see if this causes problems down the line. I will put an issue in to keep track of the potential problem


# double check
assert regrid.finalized
assert regrid.srcfield.finalized
assert regrid.dstfield.finalized
assert regrid.srcfield.grid.finalized
assert regrid.dstfield.grid.finalized
# assert regrid.srcfield.grid.finalized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

filename=None,
reuse_weights=False,
extrap_method=None,
extrap_dist_exponent=None,
extrap_num_src_pnts=None,
add_nans=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no docstring for this option. Is it similar than the fix I introduced in PR #19 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woups, I don't exactly remember, but this might be a leftover from my first BaseRegridder draft. I'll remove it. NaN's are indeed added to the weights the same way as done in #19, only the conditional statement has changed.

@raphaeldussin
Copy link
Contributor

just a few minor things, also linting failed and so did readthedocs so I could not read the new notebook

@aulemahal
Copy link
Collaborator Author

I can't understand the linting bug... I update pre-commit, did autoupdate and it finds nothing to do on my machine: the prenotebook hook passes. Moreover, what the pre-commit run suggests in the linting action seems ridiculous? I don't think there is a limit on notebook line lengths?

raphaeldussin added a commit that referenced this pull request Nov 25, 2020
fails in PR #24 for no obvious reason
@aulemahal
Copy link
Collaborator Author

aulemahal commented Nov 25, 2020

Voilà!
As said elsewhere, the prenotebook hook needs nodejs to prettify the notebooks. It does nothing (silently) when nodejs isn't installed. Which it wasn't on my machine (and @huard 's I presume), but is on the machines used for the checks.

@raphaeldussin
Copy link
Contributor

just a few more nitpicks: I would put the tutorial under intermediate, not beginners.
Id add to the imports:

import warnings
warnings.filterwarnings("ignore")

and remove last empty cell (or should we remove keep-empty in the precommit yaml?)

@raphaeldussin
Copy link
Contributor

I'm happy with the PR, great job @aulemahal
@huard I let you press the green button, I got a turkey to go take care of ;)

@huard huard merged commit 70fa826 into pangeo-data:master Nov 25, 2020
@huard huard deleted the fix-pd-11 branch November 25, 2020 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Mesh support and a spatial averaging method
7 participants