Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For nearest-neighbor remapping, ensure results are independent of processor count if there are equidistant source points #276

Open
2 tasks done
billsacks opened this issue Jul 31, 2024 Discussed in #261 · 8 comments
Assignees
Labels
bug Something isn't working source: discussions who: NCAR Originates from NCAR

Comments

@billsacks
Copy link
Member

billsacks commented Jul 31, 2024

For nearest-neighbor remapping, if there are equidistant source points, there is currently some logic that says that, if there are equidistant source points, arbitrarily use the point with the smallest ID. But, according to @oehmke , that logic isn't done in the multi-processor case, because currently the IDs aren't sent between processors. This results in nearest-neighbor mapping giving different results with different processor counts if there are equidistant source points. @oehmke proposes adding a send of the IDs so that the multi-processor case can break ties using the ID, similarly to in the single-processor case.

Discussed in https://github.com/orgs/esmf-org/discussions/261

Originally posted by samsrabin July 10, 2024

Requirements

Affiliation(s)

NSF-NCAR

ESMF Version

No response

Issue

In CTSM, we use ESMF to read some input files. One particular pair of input files, specifying crop sowing window start and end dates, is at half-degree resolution. We tell ESMF to do nearest-neighbor1 spatial interpolation as necessary to match the simulation grid.

When I do a run at 10°x15° resolution, some of the simulation gridcell centers are located exactly at the "corners" of four half-degree input pixels, meaning that those four neighbors are equally near. It doesn't matter to me which of those ESMF chooses as the "nearest neighbor," as long as it's consistent.

Unfortunately, it's not: At least one gridcell has a different "nearest neighbor" chosen depending on how many processors the job is split across.

As an example, I've made a figure based on two cases that are identical in setup except that Case 1 used 128 processors and Case 2 used 64. Due to this issue, a certain crop in the gridcell centered at latitude 0, longitude 30°E2 gets sowing window of days 7-82 in Case 1 and 336-46 in Case 2.

The white/gray/black in this figure represents the half-degree sowing window files. Gray pixels match the values in Case 1, black pixels match Case 2, and white pixels match neither. The red lines intersect at the center of the 10x15 CTSM gridcell.
screenshot_1104
It looks like Case 1 reads from the pixel to the southwest, whereas Case 2 reads from the pixel to the northwest.

Some notes:

  • I'm not 100% certain this is an ESMF issue as opposed to something weird that CTSM is doing, but I'm at the point where I've done all the troubleshooting I can within CTSM.
  • This reproduces every time, over dozens of tests.

Tagging @ekluzek, @billsacks, and @briandobbins, who have expressed interest in this. By the way, I think I mentioned to y'all that I was having an ERP test pass but the equivalent PEM test fail—this is why! The read of sowing windows only happens at the very beginning of the test, so changing processor count halfway through makes no difference.

Autotag

@oehmke

Footnotes

  1. It needs to be nearest-neighbor because dates are modulo—interpolating between Jan. 2 [day 2] and Dec. 31 (day 365) should give Jan. 1 (day 1), not July 3-4 (day [2+365]/2 = 183.5)—and that's not something ESMF can do, to my knowledge.

  2. There are other crops in this gridcell that also get different sowing windows. There are no crops in any other gridcell that get different sowing windows, but that doesn't necessarily mean different "nearest" neighbors are getting chosen. That might be happening, just with input pixels that don't differ.

@samsrabin
Copy link

Following up: Is this something that's on the roadmap to be in the ESMF version used in the CESM3 release? No worries if not, but in that case I'll need to make some of my tooling more robust and official.

@oehmke
Copy link
Contributor

oehmke commented Aug 30, 2024

Yep, it's on the roadmap to ESMF 8.8.0, which is what we're targeting for CESM3. I'm hoping to get it done soon-ish, so we can make sure that it works awhile before the release.

@samsrabin
Copy link

Excellent, thanks!

@anntsay
Copy link

anntsay commented Dec 11, 2024

CESM person that needs this may have another workaround.. so punting to 8.9.0 for now.

@samsrabin
Copy link

CESM person that needs this may have another workaround.. so punting to 8.9.0 for now.

@anntsay Do you mean me, with the tweaked input files? If not, what is the workaround you're referring to?

@billsacks
Copy link
Member Author

Ann's comment came from my brief/vague verbal comment. Yes, @samsrabin , I was referring to you here. My understanding was that you have a workaround that can/will be applied in the upcoming CESM3 release code, so the lack of a fix in ESMF won't be a hold-up for CESM3... but actually, I just read back through the comments here and see that we had given the message that the more robust fix would make it into 8.8, which now no longer looks like it will be possible. Sorry: that earlier comment came before we pushed the 8.8 release timing earlier by a couple of months.

So, @samsrabin , can you let us know how much of a problem it will be for you that the ESMF release that will be used in CESM3 likely won't have this more robust fix in place?

@samsrabin
Copy link

Very little problem at all, I just need to make the scripts I used for the tweaking more robust. Thanks!

@billsacks
Copy link
Member Author

Thanks @samsrabin , and sorry about this. There were different forces pulling in different directions regarding the ESMF 8.8 release timing. It's still possible that ESMF 8.9 will be ready in time for CESM3, but that will depend largely on the CESM3 timing, so at this point we can't count on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working source: discussions who: NCAR Originates from NCAR
Projects
None yet
Development

No branches or pull requests

4 participants