Thoughts on the recent change to abort when an error is encountered rather than returning up the call stack #534

billsacks · 2025-01-30T00:40:49Z

billsacks
Jan 30, 2025
Maintainer

This discussion follows from #532 (from @jedwards4b ) , and particularly from this question in #527 (from @ekluzek )

@billsacks can you speak to the need for ESMF NUOPC to return all the way up the stack verses calling an explicit abort?

I discussed this with the ESMF Core Team today (@anntsay, @oehmke, @danrosen25, @theurich, @uturuncoglu), and there were some points raised in this discussion that felt worth passing along. The short summary is that the new approach is probably fine as long as you can reliably get a backtrace, but the ESMF team has concerns about how reliably users will get a backtrace from shr_abort_abort, and so want to be sure that relevant team leads are aware of this potential loss in behavior. So, @briandobbins, looping you in on this so you can be aware for CESM; and @DeniseWorthen, can you confirm that relevant people in UFS are comfortable with this, such as Jun or others, especially on the bolded point (1) below? More details below:

There is general support on the ESMF team for models to decide they simply want to abort with a backtrace rather than passing return codes up the call stack, but at the same time a recognition that it may be hard or impossible to get this backtrace reliably on all compilers. There were some concerns raised about having additional dependence in CMEPS on code that is (I believe) basically duplicated in multiple places (I think shr_sys_mod and shr_abort_mod are accessed from https://github.com/ESCOMP/CESM_share in CESM, but from copies in the CDEPS repo for UFS). The team felt that this functionality would ideally reside in ESMF, and in fact, @danrosen25 has done some work down that path, but hit a roadblock due to challenges getting this working reliably across all compilers in various situations.

In terms of reasons to return up the call stack rather than aborting in place, three were raised, in order of importance (most important first):

It appears that the current shr_abort implementation only provides backtraces for the IBM, GNU and Intel compilers, using compiler-specific extensions; NOT for NAG, NVHPC or other compilers. (See the #ifdefs in https://github.com/ESCOMP/CESM_share/blob/2bb87b80c47a7e6b1aad7acfe45d26b7b9e8bb5e/src/shr_abort_mod.F90#L78 and the comment here: https://github.com/ESCOMP/CESM_share/blob/2bb87b80c47a7e6b1aad7acfe45d26b7b9e8bb5e/src/shr_abort_mod.F90#L69-L72.) To some degree it is up to each model development team whether they are comfortable losing this backtrace information for some compilers, but there was also some concern raised on the ESMF team that team members might have a harder time helping with future problems if this backtrace information isn't available.
Returning up the call stack gives a mechanism for providing more meaningful error messages at various levels of the call stack, which can sometimes give more insight into the real source of the error. There isn’t a ton of that, but there are some places where that happens and there have been thoughts about building it out. One aspect of this that is most useful is some extra NUOPC-related information, including the instance and the phase you're in. In @danrosen25 's words: Specifically, When you return from a NUOPC component with an rc=ERROR (i.e. return from Advertise, Advance, etc) the NUOPC system logs the component that errored. You'll lose this in the shared_abort because that stack trace will only print that you are in a NUOPC phase but not with the component. The core file will have this information but it's not always reasonable to dump all core files.
In principle returning up the call stack allows for cleaning up resources, but in practice not much is done in this respect, since there generally isn't much reason to do this cleanup when you're about to abort, so this isn't a very important reason.

So again, to summarize, the ESMF team is mainly concerned about the possible loss of backtrace information with some compilers. If there were a mechanism to reliably get this backtrace information on all compilers, then that would be better, and could ideally be included in ESMF itself as an option. Without a reliable backtrace mechanism across all compilers, the ESMF team's recommendation would be to maintain the return-up-the-call-stack approach. But the decision on this can be up to the relevant CESM and UFS team leads / members, as long as they are aware of this downside and the implication that the ESMF team might have a harder time helping with some issues in the future.

briandobbins · 2025-01-30T02:49:30Z

briandobbins
Jan 30, 2025
Collaborator

Thanks for this summary, Bill - as a very naive question, what advantages, if any, are offered by the immediate model-specific abort approach?

My initial inclination is that if our components are all run via ESMF, I'm good with ESMF doing the error handling when things go wrong, especially if that gives us consistent behavior across a wider range of compilers, support for some kind of trace with NVHPC, and less duplicate code, I feel that's a pretty strong combination. But I'd like to hear if folks feel the custom abort approach is valuable for other reasons (eg, faster)?

0 replies

jedwards4b · 2025-01-30T12:39:38Z

jedwards4b
Jan 30, 2025
Maintainer

The issue that this PR solved was that some errors were written to the ESMF PET logs and others were written to the component logs still others were written to the cesm log.The change made here will write errors both to the PET log and to the cesm and/or component log. I think that it's a good change and would recommend that we stick with it until
someone comes up with an example in which it can be shown to be inferior to the previous approach.

0 replies

billsacks · 2025-01-30T14:21:40Z

billsacks
Jan 30, 2025
Maintainer Author

@jedwards4b - please correct me if I'm wrong, but my understanding of #532 is that it folded together two behavior changes:

As you say, it makes it so errors are written to both the PET log and to the cesm and/or component log. I think we all agree that's a good change.
In addition, it removes the behavior of returning up the call stack (writing a message at each level of the call stack), instead aborting immediately.

It is point (2) that I'm focused on here. My impression of @ekluzek 's original issue (#527 ) was that he suggested a way to do (1) without (2). To be clear, I'm not arguing that we should necessarily back out (2), but that this should at least be a point of discussion, with some conversation around what we might be losing with this approach.

To @briandobbins 's point:

I'd like to hear if folks feel the custom abort approach is valuable for other reasons (eg, faster)?

My impression is that the main advantage of the new approach is that it simplifies code, since the return-up-the-call-tree approach requires a lot of extra error-handling code. I do feel like that's a significant point. And I'm just realizing now that if that error-handling code was accidentally left out somewhere, then I think the old approach could fail to abort when it should, which is a perhaps even more significant point. @jedwards4b since you've probably given this more thought, I'd also be interested to hear if you think there are other advantages to this approach. Basically, I do think both approaches have their pros and cons; I just think it's worth having a little more discussion of their merits before changing the approach.

4 replies

jedwards4b Jan 30, 2025
Maintainer

I will look into doing 1 without 2 as opposed to the current approach.

jedwards4b Jan 30, 2025
Maintainer

I propose replacing
call shr_sys_abort(errmsg, rc, file, line) with

call shr_log_write('ERROR', errmsg, rc, file, line)
return

where shr_log_write is a new routine in shr_log_mod.F90:

  subroutine shr_log_write(error_type, string, rc, line, file)                                                                                            
    use esmf, only : ESMF_LOGWRITE, ESMF_LOGMSG_ERROR, ESMF_FINALIZE, ESMF_END_ABORT, ESMF_FAILURE, ESMF_SUCCESS                                          
    ! Consistent stopping mechanism                                                                                                                       
                                                                                                                                                          
    !----- arguments -----                                                                                                                                
    character(len=*)    , intent(in) :: error_type  ! type of error                                                                                       
    character(len=*)    , intent(in) :: string  ! error message string                                                                                    
    integer(shr_kind_in), intent(inout), optional :: rc      ! error code                                                                                 
    integer(shr_kind_in), intent(in), optional :: line                                                                                                    
    character(len=*), intent(in), optional :: file                                                                                                        
                                                                                                                                                          
    ! Local version of the string.                                                                                                                        
    ! (Gets a default value if string is not present.)                                                                                                    
    character(len=shr_kind_cx) :: local_string                                                                                                            
    !-------------------------------------------------------------------------------                                                                      
                                                                                                                                                          
    local_string = trim(string)                                                                                                                           
    if(present(rc)) then                                                                                                                                  
       if (rc /= ESMF_SUCCESS) then                                                                                                                       
          write(local_string, *) trim(local_string), ' rc=',rc                                                                                            
       endif                                                                                                                                              
       rc = ESMF_FAILURE                                                                                                                                  
    endif                                                                                                                                                 
                                                                                                                                                          
    call ESMF_LogWrite(local_string, ESMF_LOGMSG_ERROR, line=line, file=file)                                                                             
    if (shr_log_unit == output_unit .or. shr_log_unit == error_unit) then                                                                                 
       ! If the log unit number is standard output or standard error, just                                                                                
       ! print to that.                                                                                                                                   
       allocate(log_units(1), source=[shr_log_unit])                                                                                                      
    else                                                                                                                                                  
       ! Otherwise print the same message to both the log unit and standard                                                                               
       ! error.                                                                                                                                           
       allocate(log_units(2), source=[error_unit, shr_log_unit])                                                                                          
    end if                                                                                                                                                
                                                                                                                                                          
    do i = 1, size(log_units)                                                                                                                             
       write(log_units(i),*) trim(error_type), ": ", trim(local_string)                                                                                   
       flush(log_units(i))                                                                                                                                
    end do                                                                                                                                                
                                                                                                                                                          
  end subroutine shr_log_write

I would like to do this in a consistent manner - that is if the method is to return the error up the stack than that should be followed in all cases. However some of the routines, for example, those in shr_drydep_mod.F90 don't provide a return code to go back up the stack.

ekluzek Jan 30, 2025
Maintainer

I like this @jedwards4b it keeps it available across CESM in a share subroutine that can be used everywhere. That was an upgrade to my original idea.

Totally agree to do this consistently throughout the code in CMEPS. A possible modernization effort we should do in subcomponents is to adopt this programming pattern as well? It might depend on the model though. CTSM doesn't have the error handling built in to go up the stack for most of it's error handling. Probably the NUOPC cap should be changed to this pattern though.

I didn't realize that shr_sys_abort doesn't have the backtrace for NAG which I think is really important. In practice it's been rare for me to see NAG without some type of backtrace. But, this probably means the times I haven't seen it is because of this issue. Although I don't know how I could ever see it based on what you say here. The NvHpc compiler is also important -- but we seem to have tons of problems with it.

ekluzek Jan 30, 2025
Maintainer

The one suggestion I do have @jedwards4b is to have a new subroutine for this, so that it has error in the subroutine name. Maybe
shr_log_errorWrite?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts on the recent change to abort when an error is encountered rather than returning up the call stack #534

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Thoughts on the recent change to abort when an error is encountered rather than returning up the call stack #534

billsacks Jan 30, 2025 Maintainer

Replies: 3 comments · 4 replies

briandobbins Jan 30, 2025 Collaborator

jedwards4b Jan 30, 2025 Maintainer

billsacks Jan 30, 2025 Maintainer Author

jedwards4b Jan 30, 2025 Maintainer

jedwards4b Jan 30, 2025 Maintainer

ekluzek Jan 30, 2025 Maintainer

ekluzek Jan 30, 2025 Maintainer

billsacks
Jan 30, 2025
Maintainer

Replies: 3 comments 4 replies

briandobbins
Jan 30, 2025
Collaborator

jedwards4b
Jan 30, 2025
Maintainer

billsacks
Jan 30, 2025
Maintainer Author

jedwards4b Jan 30, 2025
Maintainer

jedwards4b Jan 30, 2025
Maintainer

ekluzek Jan 30, 2025
Maintainer

ekluzek Jan 30, 2025
Maintainer