NaN threshold for conservative method #39

slevang · 2024-05-21T21:58:12Z

Nice package! In testing it out where I've previously used xesmf, I noticed two features lacking from the conservative method:

the regridded dataset can't be constructed lazily if dask-backed due to this line
no ability to keep target cells where the input points are partially NaN, as noted in Adding NaN thresholding #32

Number 1 is easy, number 2 is trickier. I added a naive implementation for the nan_threshold capabilities of xesmf here for discussion. As noted in the previous issue, to do this 100% correctly we would need to track the NaN fraction as we reduce over each dimension, which I'm not doing here. The nan_threshold value doesn't translate directly to total fraction of NaN cells due to the sequential reduction. It would also get complicated for isolated NaNs in the temporal dimension.

I'm not sure any of this matters much for a dataset where you have consistent NaN's e.g. SST. Here's an example of the new functionality used on the MUR dataset. Note this is a 33TB array but we can now generate the (lazily) regridded dataset instantaneously.

import xarray as xr
import xarray_regrid

sst = xr.open_zarr("https://mur-sst.s3.us-west-2.amazonaws.com/zarr-v1").analysed_sst
grid = xarray_regrid.Grid(
    north=90,
    east=180,
    south=-90,
    west=-180,
    resolution_lat=1,
    resolution_lon=1,
)
target = xarray_regrid.create_regridding_dataset(grid)
target = target.rename(latitude="lat", longitude="lon")

ds0p0 = sst.regrid.conservative(target, nan_threshold=0)
ds0p5 = sst.regrid.conservative(target, nan_threshold=0.5)
ds1p0 = sst.regrid.conservative(target, nan_threshold=1)

src/xarray_regrid/methods/conservative.py

BSchilperoort

Thank you so much for contributing this! I'll try to give it a full review and run it next week, but here are already some comments.

Did you compare the performance to xESMF to ensure that the masking has the same results here as there? I can also try that later.

Lastly, would you be able to write some unit tests to test the nan masking? And update the changelog? If anything is unclear please let me know.

src/xarray_regrid/methods/conservative.py

slevang · 2024-05-26T01:26:59Z

Thanks, and yes for sure, hopefully next week I will find time to add tests and benchmarks.

slevang · 2024-05-26T18:26:25Z

Did some quick profiling on a ~4GB array of 1/4deg global data coarsening to 1deg. Dask array on a 32 CPU node. Results:

This PR, skipna=False: 32s
This PR, skipna=True: 64s
main: 96s

So adding skipna forces roughly one additional pass through the array with the weight renormalization. The reason this PR is faster than main is because the current code has the np.any(np.isnan()) check which forces computation, plus the separately calculated isnan array, which forces 3 passes through the data. If I cut out the logic branch of checking for NaNs on main and go straight to the einsum, we recover the ~32s run above.

slevang

Finally found some time to get back to this.

I did manage to come up with an implementation that should accurately track the fraction of NaN points across dimensions (see the tests added). The problem currently is that there is a major performance penalty because we are tracking the nans independently across all dimensions (such as an additional time dim) which causes the einsum ops to have much larger dimensionality.

Also did some general reworking to use xr.dot, consolidate some things like different the DataArray/Dataset pathways, and knocked out some other small improvements to conservative that have been noted in the issues.

pyproject.toml

src/xarray_regrid/methods/conservative.py

src/xarray_regrid/utils.py

src/xarray_regrid/methods/conservative.py

slevang · 2024-07-14T16:08:14Z

Made the modification to take notnull.any(non_regrid_dims) which leaves us at about a 3x performance penalty for skipna=True in the benchmarks I've run. I think this should maybe be a configurable arg though in cases where you want to track NaNs very carefully throughout the dataset.

src/xarray_regrid/utils.py

BSchilperoort · 2024-07-22T11:41:14Z

tests/test_regrid.py

+    xr.testing.assert_allclose(da_coarsen, da_regrid)
+
+
+@pytest.mark.skip(reason="requires xesmf")


You could use pytest.mark.skipif here, and run the test automatically if xesmf is available.

src/xarray_regrid/methods/conservative.py

tests/test_most_common.py

pyproject.toml

benchmarks/benchmarking_conservative.ipynb

src/xarray_regrid/methods/conservative.py

BSchilperoort

Thanks for all the changes you've made! I've invited you to this repository so you don't have to wait for me to have the CI run.

src/xarray_regrid/methods/conservative.py

@slevang

* add nan_threshold option * track nan frac across dims, use xr.dot, consolidate ds/da paths, tests * fixes and cleanup plus initial notebook cell * speed up by only tracking the max of nonnull points over non_regrid_dims * fix tests for newer dependency versions * Improve typing of call_on_dataset * Fix typing in updated conservative routines * Apply code formatting * Ensure hashable is a valid input for coordinate identifier * Make tests & typing pass * Make `create_regridding_dataset` a method of `Grid` #38 * Update notebooks and dependencies * Allow xesmf test to run if it's available * Add @slevang to the contributors list * Update changelog * Ignore linter in test * Update readme with badges and "why use..." text --------- Co-authored-by: Sam Levang <slevang@salientpredictions.com>

BSchilperoort · 2024-09-04T08:06:10Z

Merged as part of #41

add nan_threshold option

d850bde

slevang commented May 21, 2024

View reviewed changes

src/xarray_regrid/methods/conservative.py Outdated Show resolved Hide resolved

slevang marked this pull request as ready for review May 22, 2024 15:15

BSchilperoort reviewed May 24, 2024

View reviewed changes

slevang added 2 commits July 13, 2024 15:53

track nan frac across dims, use xr.dot, consolidate ds/da paths, tests

4e09eaa

fixes and cleanup plus initial notebook cell

a3f7c99

slevang commented Jul 14, 2024

View reviewed changes

slevang requested a review from BSchilperoort July 14, 2024 15:19

speed up by only tracking the max of nonnull points over non_regrid_dims

72b023a

fix tests for newer dependency versions

e6a8f8a

BSchilperoort reviewed Jul 22, 2024

View reviewed changes

BSchilperoort approved these changes Jul 22, 2024

View reviewed changes

src/xarray_regrid/methods/conservative.py Show resolved Hide resolved

This was linked to issues Jul 22, 2024

Conservative: detect/handle non-regularly spaced intervals #6

Closed

Dealing with 1x1 grids #25

Closed

Adding NaN thresholding #32

Closed

This was referenced Sep 3, 2024

Nan threshold for conservative regridding (continuation of #39) #41

Merged

Improve performance of conservative routine #42

Closed

BSchilperoort closed this Sep 4, 2024

BSchilperoort mentioned this pull request Sep 5, 2024

Transfering xarray-regrid to this organization xarray-contrib/xarray-contrib#14

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaN threshold for conservative method #39

NaN threshold for conservative method #39

slevang commented May 21, 2024

BSchilperoort left a comment

slevang commented May 26, 2024

slevang commented May 26, 2024

slevang left a comment

slevang commented Jul 14, 2024

BSchilperoort Jul 22, 2024

BSchilperoort left a comment

BSchilperoort commented Sep 4, 2024

		xr.testing.assert_allclose(da_coarsen, da_regrid)


		@pytest.mark.skip(reason="requires xesmf")

NaN threshold for conservative method #39

NaN threshold for conservative method #39

Conversation

slevang commented May 21, 2024

BSchilperoort left a comment

Choose a reason for hiding this comment

slevang commented May 26, 2024

slevang commented May 26, 2024

slevang left a comment

Choose a reason for hiding this comment

slevang commented Jul 14, 2024

BSchilperoort Jul 22, 2024

Choose a reason for hiding this comment

BSchilperoort left a comment

Choose a reason for hiding this comment

BSchilperoort commented Sep 4, 2024