Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timer Initialize pair issue with MAPL 2.6.4 using ifort19 in GCHP #779

Closed
lizziel opened this issue Mar 26, 2021 · 24 comments
Closed

Timer Initialize pair issue with MAPL 2.6.4 using ifort19 in GCHP #779

lizziel opened this issue Mar 26, 2021 · 24 comments

Comments

@lizziel
Copy link
Contributor

lizziel commented Mar 26, 2021

I am running GCHP using MAPL v2.6.4 and am getting a run-time error when using ifort19.0.5 with OpenMPI 4.0.2. Strangely I have not encountered any run issues when using GNU fortran compilers.

The traceback is as follows:

pe=00000 FAIL at line=00168    BaseProfiler.F90                         <Timer Initialize likely does not find its pair>
pe=00000 FAIL at line=01851    MAPL_Generic.F90                         <status=1>
pe=00000 FAIL at line=00629    MAPL_CapGridComp.F90                     <status=1>
pe=00000 FAIL at line=00931    MAPL_CapGridComp.F90                     <status=1>
pe=00000 FAIL at line=00245    MAPL_Cap.F90                             <status=1>
pe=00000 FAIL at line=00211    MAPL_Cap.F90                             <status=1>
pe=00000 FAIL at line=00154    MAPL_Cap.F90                             <status=1>
pe=00000 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00000 FAIL at line=00030    GCHPctm.F90                              <status=1>

The error is caught in subroutine stop_name (see code here).

Have you seen this type of error before and do you know a fix?

@mathomp4
Copy link
Member

@lizziel You are stumping @tclune and myself. But it is interesting that it's hitting the same code as we are looking at in #762.

Is this reproducible for you? We tried looking through the call stack and we don't see any compiler dependence throughout it!

@lizziel
Copy link
Contributor Author

lizziel commented Mar 29, 2021

Yes this is reproducible. I rebuilt from scratch with the same libraries and got exactly the same error. I then tried with ifort 18.0.5. That gives me a different error, specifically when loading the logging yaml file here. Are the libraries I am using all expected to be compatible with ifort18?

forrtl: severe (189): LHS and RHS of an assignment statement have incompatible types
Image              PC                Routine            Line        Source
gchp               0000000002DE89C6  Unknown               Unknown  Unknown
libMAPL.shared.so  00002B04BD5AB957  fy_lexer_mp_pop_t     Unknown  Unknown
libMAPL.shared.so  00002B04BD5A9857  fy_lexer_mp_get_t     Unknown  Unknown
libMAPL.shared.so  00002B04BD574C29  fy_parser_mp_top_     Unknown  Unknown
libMAPL.shared.so  00002B04BD57294E  fy_parser_mp_load     Unknown  Unknown
libMAPL.shared.so  00002B04BD4D8431  pfl_loggermanager     Unknown  Unknown
gchp               00000000018F5C16  mapl_applications         108  ApplicationSupport.F90
gchp               00000000018F5F0C  mapl_applications          39  ApplicationSupport.F90
gchp               0000000000F4D61A  mapl_capmod_mp_ne         106  MAPL_Cap.F90
gchp               000000000053A60C  MAIN__                     29  GCHPctm.F90
gchp               00000000005393DE  Unknown               Unknown  Unknown
libc-2.17.so       00002B04C0651555  __libc_start_main     Unknown  Unknown
gchp               00000000005392E9  Unknown               Unknown  Unknown

For the ifort18 run I then used the logger work-around I used for pFlogger issue 54 (comment out setting logging file in cap_options). Doing that fixes the new problem, but the run then crashes. The traceback points to the same source as the issue with ifort19. For ifort 18 the symptom is a seg fault rather than graceful fail in BaseProfiler.F90 with ifort19.

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
gchp               0000000002E07D6D  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B069F01F630  Unknown               Unknown  Unknown
libMAPL.profiler.  00002B069BC91DAE  mapl_baseprofiler     Unknown  Unknown
gchp               0000000001577312  mapl_genericmod_m        1851  MAPL_Generic.F90

I have been doing RELEASE for all my builds. I'll try DEBUG and see if that gives more information.

@tclune
Copy link
Collaborator

tclune commented Mar 30, 2021

Returning to this issue with a bit more clarity. In my first discussion with @mathom4, I was trying to see how the start() and stop() operations on the Initialize timer could be imbalanced. It is very clear from the code that as far as GenericWrapper() is concerned the timers are properly handled. What I missed in that discussion, is that this implies some procedure deeper in the tree must have started a timer without a corresponding() stop. The error diagnostic could in theory be improved to report what timer it was expecting in the stop.

The usual culprit at this point would then be a condition in some lower layer that causes an early return before some timer is stopped. Unfortunately the call stack implies that this early return is not reported and is thus possibly a "normal" situation. OTOH, since this has not been reported for GEOS and seems to work for GCHP with GFortran, it's a fairly unusual thing.

I recommend editing line 168 of profiler/BaseProfiler.F90:

_ASSERT(.false., "Timer "//name// " likely does not find its pair")

To instead have:

             _ASSERT(.false., "Mismatched stop timer.  Found "//name//" but expected "//node%get_name())

This should very quickly narrow where the real problem is happening. And if this works, we'll fix this an the other messages in that layer more permanently.

@lizziel
Copy link
Contributor Author

lizziel commented Mar 31, 2021

Thanks @tclune. I'll try that and see if it fixes it. In the meantime yesterday I switched MAPL to v2.6.3 and that resolved the problem. I still get the LHS and RHS of an assignment statement have incompatible types error when using ifort18 however. Have you tested with ifort18, or is the minimum expected to be 2019?

@tclune
Copy link
Collaborator

tclune commented Mar 31, 2021

I have not used 18 in a while (Intel makes it hard to sustain older compilers across the mandatory OS X upgrades on my laptop). That particular error sounds familiar though, and I don't think I ever came up with a workaround. If you can remind me where you are getting that error message, I can take a fresh look and suggest some possible variants.

@tclune
Copy link
Collaborator

tclune commented Mar 31, 2021

I should add - we're not doing anything these days that should make Intel 18 obsolete. We're just fighting against random compiler defects.

@lizziel
Copy link
Contributor Author

lizziel commented Mar 31, 2021

Here is the traceback for the ifort18 issue in MAPL v2.6.3 and v2.6.4. The last version used in GCHP was v2.2.7 so this could have come in any of the versions since.

forrtl: severe (189): LHS and RHS of an assignment statement have incompatible types
Image              PC                Routine            Line        Source
gchp               0000000002DE89C6  Unknown               Unknown  Unknown
libMAPL.shared.so  00002B04BD5AB957  fy_lexer_mp_pop_t     Unknown  Unknown
libMAPL.shared.so  00002B04BD5A9857  fy_lexer_mp_get_t     Unknown  Unknown
libMAPL.shared.so  00002B04BD574C29  fy_parser_mp_top_     Unknown  Unknown
libMAPL.shared.so  00002B04BD57294E  fy_parser_mp_load     Unknown  Unknown
libMAPL.shared.so  00002B04BD4D8431  pfl_loggermanager     Unknown  Unknown
gchp               00000000018F5C16  mapl_applications         108  ApplicationSupport.F90
gchp               00000000018F5F0C  mapl_applications          39  ApplicationSupport.F90
gchp               0000000000F4D61A  mapl_capmod_mp_ne         106  MAPL_Cap.F90
gchp               000000000053A60C  MAIN__                     29  GCHPctm.F90
gchp               00000000005393DE  Unknown               Unknown  Unknown
libc-2.17.so       00002B04C0651555  __libc_start_main     Unknown  Unknown
gchp               00000000005392E9  Unknown               Unknown  Unknown

@lizziel
Copy link
Contributor Author

lizziel commented Mar 31, 2021

Here's a quick link to the location in ApplicationSupport.F90.

@tclune
Copy link
Collaborator

tclune commented Mar 31, 2021

And this is with the default logging.yaml file? (Sounding more and more familiar.)

@lizziel lizziel mentioned this issue Mar 31, 2021
5 tasks
@lizziel
Copy link
Contributor Author

lizziel commented Mar 31, 2021

Not quite default. It is the file you gave to Seb a while back, although I had to change propagate from 0 to false: https://github.com/geoschem/geos-chem/blob/dev/run/GCHP/logging.yaml.

My understanding is the only difference from the one you use in GEOSgcm is this.

@tclune
Copy link
Collaborator

tclune commented Apr 1, 2021

I was able to parse your yaml file with a standalone driver using the latest yaFyaml (main branch) and ifort 18.0.5 on Linux.

There is a separate problem with this yaFyaml because I accidentally made it require pFUnit which was not my intent. I'll to a hotfix for this shortly.

The conclusion is that I either entered a workaround for the problem you described, or it is a bit harder to reproduce.

@tclune
Copy link
Collaborator

tclune commented Apr 1, 2021

Could not find anything in the commit history. Indeed the relevant file has not been touched in 7 months. But Liam already at an unrelated PR that fixes the pFUnit issue. So I'll roll out a new release and request that you try it in your code.

@lizziel
Copy link
Contributor Author

lizziel commented Apr 1, 2021

Thanks @tclune. I'll look for that release.

@lizziel
Copy link
Contributor Author

lizziel commented Apr 1, 2021

Ah, just saw it's already released. I'll try it and let you know how it goes.

@lizziel
Copy link
Contributor Author

lizziel commented Apr 1, 2021

Unfortunately that doesn't fix it. Would changing anything in logging.yaml provide more information? Compiling with -DCMAKE_BUILD_TYPE=Debug does not.

@tclune
Copy link
Collaborator

tclune commented Apr 1, 2021

I only use -DCMAKE_BUILD_TYPE=Debug in my own work.

I'm going to attach
reproducer.tar.gz
my reproducer. It would be interesting if you could try it in your environment? You'll need to copy your yaml file into the build dir and run driver.x.

@lizziel
Copy link
Contributor Author

lizziel commented Apr 1, 2021

I haven't gotten that to work yet. We do not use yaFyaml as an external library and I'm taking a break from messing around with cmake. But I did isolate the issue to this line: https://github.com/geoschem/yaFyaml/blob/7f16059ebc95083dd1e77954296798c745f9a287/src/Lexer.F90#L316

@tclune
Copy link
Collaborator

tclune commented Apr 1, 2021

Yes - definitely a compiler bug, but I can't attempt a workaround unless I can reproduce it.

You don't have to use cmake. You could just add my small program within the project that is using yafyaml or even make it a subroutine that gets called from the top of an existing program. Just want to see if it is a difference in your environment vs something to do with the state of the code when it gets down in there.

@lizziel
Copy link
Contributor Author

lizziel commented Apr 2, 2021

Gotcha. I added this to the very top of main GCHPctm.F90.

   type (Parser)  :: p
   type (Configuration) :: cfg
   p = Parser('Core')
   cfg = p%load(FileStream('logging.yaml'))

This also trips the error, but with better traceback, although to the line I already figured out was the issue:

forrtl: severe (189): LHS and RHS of an assignment statement have incompatible types
Image              PC                Routine            Line        Source
gchp               0000000008B38A36  Unknown               Unknown  Unknown
libMAPL.shared.so  00002B8D5EDE3324  fy_lexer_mp_pop_t         316  Lexer.F90
libMAPL.shared.so  00002B8D5EDDE2B0  fy_lexer_mp_get_t         183  Lexer.F90
libMAPL.shared.so  00002B8D5ED3E80B  fy_parser_mp_top_         107  Parser.F90
libMAPL.shared.so  00002B8D5ED3CDA2  Unknown               Unknown  Unknown
gchp               000000000053FF59  MAIN__                     32  GCHPctm.F90
gchp               000000000053BCDE  Unknown               Unknown  Unknown
libc-2.17.so       00002B8D62421555  __libc_start_main     Unknown  Unknown
gchp               000000000053BBE9  Unknown               Unknown  Unknown

I don't think I mentioned what libraries I'm using yet. Here they are in case it makes a difference.

    gFTL-Shared v1.2.0 (includes gFTL v1.3.1)
    pFlogger v1.5.0
    pFUnit v4.2.0
    yaFyaml v0.5.1
    ESMA_cmake v3.0.6
    ecbuild geos/v1.0.5

I found an Intel forum discussion on this error being encountered in ifort18 here.

@tclune
Copy link
Collaborator

tclune commented Apr 2, 2021

Unfortunately, I cannot follow your ifort18 link. I'm in a spiral where it keeps making me register.

The only versions of libraries that can possibly matter are gFTL and gFTL-shared. I'll try again with the tags you mention for those when I get onto discover.

@tclune
Copy link
Collaborator

tclune commented Apr 2, 2021

OK, I rebuilt using the specified versions of gFTL, gFTL-Shared and the latest yaFyaml with 18.0.5. I was able to process your yaml file.

For now, the one obvious workaround to try is to replace the failing line with

allocate(token, source=tokens%at(1))

And if that does not work, let's try to limit the advanced syntax to see if that helps. Replace the entire procedure with:

  function pop_token(this) result(token)
    class(AbstractToken), allocatable :: token
    class(Lexer), intent(inout) :: this

    class(AbstractToken), pointer :: p_token

    p_token => this%processed_tokens%at(1)
    token = p_token

  end function pop_token

@lizziel
Copy link
Contributor Author

lizziel commented Apr 2, 2021

No luck with either. I'm fine with requiring ifort19+ for Intel compilers unless you want to keep pursing this. The link should work unless the VPN you have is interfering.

@tclune
Copy link
Collaborator

tclune commented Apr 2, 2021

OK - if you can move to 19 that would be best. I hate inflicting compiler version creep on others. These days I can more often workaround compiler defects, but still. Currently the latest GFE stack works with both 19 and 21 (don't ask about Intel's numbering here). GEOS uses 19 and I develop using 21.

@tclune tclune closed this as completed Apr 2, 2021
@lizziel
Copy link
Contributor Author

lizziel commented Apr 2, 2021

Sounds good. At this point ifort18 is more than four years old (release Jan 2017) so it seems reasonable discontinue support. Especially when gfortran is an option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants