Run time error for mpi-serial case on cheyenne_intel when created with aux_clm create_test #1793

ekluzek · 2017-08-04T17:23:00Z

I saw this before, but thought it might be a system problem. And maybe it still is, so I'm also having CISL look into this. The time I saw it before was before cheyenne was taken down, and I thought that work might have fixed it. When I run create_test for aux_clm on cheyenne several of the mpi-serial tests fail first in the build and then I have to build and run again. One of them still fails:

ERS_D_Ld7_Mmpi-serial.1x1_smallvilleIA.IHistClm50BgcCropGs.cheyenne_intel.clm-decStart1851_noinitial

It gives the following runtime error in the cesm.log.

 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Variable not found
 NetCDF: Variable not found
 NetCDF: Variable not found
forrtl: severe (184): FASTMEM allocation is requested but the libmemkind library is not linked into the executable.
Image              PC                Routine            Line        Source             
cesm.exe           00000000043DAB60  Unknown               Unknown  Unknown
cesm.exe           0000000003A0ABDC  shr_strconvert_mo          78  shr_strconvert_mod.F90
cesm.exe           000000000389EE82  shr_log_mod_mp_sh          78  shr_log_mod.F90
cesm.exe           0000000000A5605B  glcbehaviormodini         421  glcBehaviorMod.F90
cesm.exe           0000000000A4FFB7  glcbehaviormod_mp         302  glcBehaviorMod.F90
cesm.exe           0000000000A4F885  glcbehaviormod_mp         204  glcBehaviorMod.F90
cesm.exe           00000000008A2EEE  clm_initializemod         151  clm_initializeMod.F90
cesm.exe           0000000000832B69  lnd_comp_mct_mp_l         198  lnd_comp_mct.F90
cesm.exe           000000000044B241  component_mod_mp_         227  component_mod.F90
cesm.exe           0000000000416422  cesm_comp_mod_mp_        1179  cesm_comp_mod.F90
cesm.exe           00000000004420EC  MAIN__                     63  cesm_driver.F90
cesm.exe           000000000040519E  Unknown               Unknown  Unknown
libc-2.19.so       00002AAAAFA44B25  __libc_start_main     Unknown  Unknown
cesm.exe           00000000004050A9  Unknown               Unknown  Unknown

What worked before was to redo the test case from scratch, so I'm trying that now.

The text was updated successfully, but these errors were encountered:

ekluzek · 2017-08-04T17:36:18Z

Yep, redoing over allows it to work.

The case that fails is in:

/glade/p/work/erik/clm_chkimpexpndepunits/cime/scripts/ERS_D_Ld7_Mmpi-serial.1x1_smallvilleIA.IHistClm50BgcCropGs.cheyenne_intel.clm-decStart1851_noinitial.GC.clm4_5_16_r253intel

And the case that works is in:
/glade/p/work/erik/clm_chkimpexpndepunits/cime/scripts/ERS_D_Ld7_Mmpi-serial.1x1_smallvilleIA.IHistClm50BgcCropGs.cheyenne_intel.clm-decStart1851_noinitial.GC.20170804_112738_122q5k

ekluzek · 2017-08-04T17:48:13Z

The CISL ticket for this is:

https://cislcustomersupport.ucar.edu/evj/ExtraView/27622147

billsacks · 2017-08-10T18:31:11Z

I seem to remember seeing issues similar to this when building on the share queue. I've gone back to building on the login nodes and that seems to clear up issues like this.

billsacks · 2017-08-10T18:31:34Z

I'd say this is almost certainly a system problem. If you agree, then we should close this cime issue.

gold2718 · 2017-09-03T19:49:39Z

Was there ever a resolution to the CISL ticket above? I cannot seem to either search for that issue number or go to the URL (I get a 'timed-out' message even if I am logged into the system).

ekluzek · 2017-09-03T19:53:36Z

Dick Valent tried to do some work on it, but didn't figure anything out. He closed it, because he didn't hear back from me, and then I opened it again. But, he's closed it a second time now. I figured I'd let him know it's still a problem, but I'm not sure if I should reopen the ticket.

gold2718 · 2017-09-03T19:56:21Z

Tests which pass on a do-over can either be race condition or system problem. What is the evidence pointing to system problem over race condition?

billsacks · 2017-09-18T17:56:04Z

@ekluzek says this happens all the time for him on cheyenne and/or yellowstone. It doesn't happen for me when I run the clm test suite, though - with a slightly different CLM version, though (I think) the same cime version. That makes me no longer suspect a one-time system problem. I wonder if there could be something in Erik's environment that is making this behave badly for him???

ekluzek · 2017-09-28T17:46:30Z

After thinking about this, and looking at the shared build directory structure I suspect I may have the reason I see this. The reason is that is that I often send both cheyenne and yellowstone tests out with the same test id. And since cheyenne and yellowstone have shared file-systems, but slightly different compiler configurations -- there's likely a race condition between the two builds that either allows it to work or fail. So my workaround that I'm going to do is to add an identifier for the machine on my test submissions. A more robust change to the system (if I can show that this is indeed the problem) would be to have the shared build add an identifier for the machine as well as the compiler for the shared build directories. Having shared file-systems across several machines is a common situation, so it's not a unique problem to NCAR. And it's not obvious to users that the two builds may conflict (many/most wouldn't know a shared build is being done). But, I'm willing to hear opinion from others on this @mvertens @jedwards4b @rljacob @gold2718 @jgfouca @fischer-ncar . The change I'm proposing is fundamentally pretty simple the shared build would have a subdirectory named (machine)_(compiler) rather than just a subdirectory named by (compiler). I haven't looked into how hard that would be to do, but my guess is it can't be too hard.

jedwards4b · 2017-09-28T17:52:05Z

It would be easy to add machine to the shared library path - but in any case it seems a very risky practice to submit two test suites with the same test-id.

billsacks · 2017-09-28T18:15:56Z

I agree with @jedwards4b . I have no problem adding the machine to the sharedlib path if it is indeed easy. But at the same time, we should very strongly discourage people from ever submitting multiple runs of create_test with the same testid. See also the discussion in #582 - though it looks like we never added anything to the create_test documentation saying that testids need to be unique.

billsacks · 2017-09-28T18:17:40Z

Just opened #1933 which I'll take (addressing the documentation of testids).

gold2718 · 2017-09-28T19:39:11Z

My only comment is a concern about adding to the length of the test name. Didn't we just have an issue (#1914) where long test names was causing problems?

ekluzek · 2017-09-28T19:51:28Z

@gold2718 what I'm proposing would only affect the shared build directory structure. The test names already have the machine/compiler combination in them (which is part of why I didn't think they would interfere), so the testname directory's won't change in length at all. But, under the shared build you have directories that look like:

$CIME_OUTPUT_ROOT/sharedlibroot.$TESTID/intel/mpi-serial/debug/nothreads/mct|pio|gptl

What I'm proposing is that would change to...

$CIME_OUTPUT_ROOT/sharedlibroot.$TESTID/cheyenne_intel/mpi-serial/debug/nothreads/mct|pio|gptl

There's nothing under those directories that have the $TESTID in them, so the paths are relatively short, and adding the machine name to them won't make much of a difference.

billsacks · 2018-04-26T21:36:04Z

(From skimming back through this issue, @mvertens @jedwards4b and I felt this could be closed as a wontfix.)

ekluzek added machine specific tp: system tests tp: unknown labels Aug 4, 2017

ekluzek added this to the cesm2 milestone Aug 4, 2017

ekluzek self-assigned this Aug 4, 2017

ekluzek mentioned this issue Sep 3, 2017

To deal with system issues, automatically retry tests #1865

Closed

rljacob added the Assigned label Sep 13, 2017

mvertens closed this as completed Apr 26, 2018

mvertens added st: wontfix and removed Assigned labels Apr 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run time error for mpi-serial case on cheyenne_intel when created with aux_clm create_test #1793

Run time error for mpi-serial case on cheyenne_intel when created with aux_clm create_test #1793

ekluzek commented Aug 4, 2017

ekluzek commented Aug 4, 2017

ekluzek commented Aug 4, 2017

billsacks commented Aug 10, 2017

billsacks commented Aug 10, 2017

gold2718 commented Sep 3, 2017

ekluzek commented Sep 3, 2017

gold2718 commented Sep 3, 2017

billsacks commented Sep 18, 2017

ekluzek commented Sep 28, 2017

jedwards4b commented Sep 28, 2017

billsacks commented Sep 28, 2017

billsacks commented Sep 28, 2017 •

edited

Loading

gold2718 commented Sep 28, 2017

ekluzek commented Sep 28, 2017

billsacks commented Apr 26, 2018

Run time error for mpi-serial case on cheyenne_intel when created with aux_clm create_test #1793

Run time error for mpi-serial case on cheyenne_intel when created with aux_clm create_test #1793

Comments

ekluzek commented Aug 4, 2017

ekluzek commented Aug 4, 2017

ekluzek commented Aug 4, 2017

billsacks commented Aug 10, 2017

billsacks commented Aug 10, 2017

gold2718 commented Sep 3, 2017

ekluzek commented Sep 3, 2017

gold2718 commented Sep 3, 2017

billsacks commented Sep 18, 2017

ekluzek commented Sep 28, 2017

jedwards4b commented Sep 28, 2017

billsacks commented Sep 28, 2017

billsacks commented Sep 28, 2017 • edited Loading

gold2718 commented Sep 28, 2017

ekluzek commented Sep 28, 2017

billsacks commented Apr 26, 2018

billsacks commented Sep 28, 2017 •

edited

Loading