Replies: 3 comments 4 replies
-
Thanks for this summary, Bill - as a very naive question, what advantages, if any, are offered by the immediate model-specific abort approach? My initial inclination is that if our components are all run via ESMF, I'm good with ESMF doing the error handling when things go wrong, especially if that gives us consistent behavior across a wider range of compilers, support for some kind of trace with NVHPC, and less duplicate code, I feel that's a pretty strong combination. But I'd like to hear if folks feel the custom abort approach is valuable for other reasons (eg, faster)? |
Beta Was this translation helpful? Give feedback.
-
The issue that this PR solved was that some errors were written to the ESMF PET logs and others were written to the component logs still others were written to the cesm log.The change made here will write errors both to the PET log and to the cesm and/or component log. I think that it's a good change and would recommend that we stick with it until |
Beta Was this translation helpful? Give feedback.
-
@jedwards4b - please correct me if I'm wrong, but my understanding of #532 is that it folded together two behavior changes:
It is point (2) that I'm focused on here. My impression of @ekluzek 's original issue (#527 ) was that he suggested a way to do (1) without (2). To be clear, I'm not arguing that we should necessarily back out (2), but that this should at least be a point of discussion, with some conversation around what we might be losing with this approach. To @briandobbins 's point:
My impression is that the main advantage of the new approach is that it simplifies code, since the return-up-the-call-tree approach requires a lot of extra error-handling code. I do feel like that's a significant point. And I'm just realizing now that if that error-handling code was accidentally left out somewhere, then I think the old approach could fail to abort when it should, which is a perhaps even more significant point. @jedwards4b since you've probably given this more thought, I'd also be interested to hear if you think there are other advantages to this approach. Basically, I do think both approaches have their pros and cons; I just think it's worth having a little more discussion of their merits before changing the approach. |
Beta Was this translation helpful? Give feedback.
-
This discussion follows from #532 (from @jedwards4b ) , and particularly from this question in #527 (from @ekluzek )
I discussed this with the ESMF Core Team today (@anntsay, @oehmke, @danrosen25, @theurich, @uturuncoglu), and there were some points raised in this discussion that felt worth passing along. The short summary is that the new approach is probably fine as long as you can reliably get a backtrace, but the ESMF team has concerns about how reliably users will get a backtrace from shr_abort_abort, and so want to be sure that relevant team leads are aware of this potential loss in behavior. So, @briandobbins, looping you in on this so you can be aware for CESM; and @DeniseWorthen, can you confirm that relevant people in UFS are comfortable with this, such as Jun or others, especially on the bolded point (1) below? More details below:
There is general support on the ESMF team for models to decide they simply want to abort with a backtrace rather than passing return codes up the call stack, but at the same time a recognition that it may be hard or impossible to get this backtrace reliably on all compilers. There were some concerns raised about having additional dependence in CMEPS on code that is (I believe) basically duplicated in multiple places (I think shr_sys_mod and shr_abort_mod are accessed from https://github.com/ESCOMP/CESM_share in CESM, but from copies in the CDEPS repo for UFS). The team felt that this functionality would ideally reside in ESMF, and in fact, @danrosen25 has done some work down that path, but hit a roadblock due to challenges getting this working reliably across all compilers in various situations.
In terms of reasons to return up the call stack rather than aborting in place, three were raised, in order of importance (most important first):
#ifdef
s in https://github.com/ESCOMP/CESM_share/blob/2bb87b80c47a7e6b1aad7acfe45d26b7b9e8bb5e/src/shr_abort_mod.F90#L78 and the comment here: https://github.com/ESCOMP/CESM_share/blob/2bb87b80c47a7e6b1aad7acfe45d26b7b9e8bb5e/src/shr_abort_mod.F90#L69-L72.) To some degree it is up to each model development team whether they are comfortable losing this backtrace information for some compilers, but there was also some concern raised on the ESMF team that team members might have a harder time helping with future problems if this backtrace information isn't available.So again, to summarize, the ESMF team is mainly concerned about the possible loss of backtrace information with some compilers. If there were a mechanism to reliably get this backtrace information on all compilers, then that would be better, and could ideally be included in ESMF itself as an option. Without a reliable backtrace mechanism across all compilers, the ESMF team's recommendation would be to maintain the return-up-the-call-stack approach. But the decision on this can be up to the relevant CESM and UFS team leads / members, as long as they are aware of this downside and the implication that the ESMF team might have a harder time helping with some issues in the future.
Beta Was this translation helpful? Give feedback.
All reactions