Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run time error for mpi-serial case on cheyenne_intel when created with aux_clm create_test #1793

Closed
ekluzek opened this issue Aug 4, 2017 · 15 comments

Comments

@ekluzek
Copy link
Contributor

ekluzek commented Aug 4, 2017

I saw this before, but thought it might be a system problem. And maybe it still is, so I'm also having CISL look into this. The time I saw it before was before cheyenne was taken down, and I thought that work might have fixed it. When I run create_test for aux_clm on cheyenne several of the mpi-serial tests fail first in the build and then I have to build and run again. One of them still fails:

ERS_D_Ld7_Mmpi-serial.1x1_smallvilleIA.IHistClm50BgcCropGs.cheyenne_intel.clm-decStart1851_noinitial

It gives the following runtime error in the cesm.log.

 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Variable not found
 NetCDF: Variable not found
 NetCDF: Variable not found
forrtl: severe (184): FASTMEM allocation is requested but the libmemkind library is not linked into the executable.
Image              PC                Routine            Line        Source             
cesm.exe           00000000043DAB60  Unknown               Unknown  Unknown
cesm.exe           0000000003A0ABDC  shr_strconvert_mo          78  shr_strconvert_mod.F90
cesm.exe           000000000389EE82  shr_log_mod_mp_sh          78  shr_log_mod.F90
cesm.exe           0000000000A5605B  glcbehaviormodini         421  glcBehaviorMod.F90
cesm.exe           0000000000A4FFB7  glcbehaviormod_mp         302  glcBehaviorMod.F90
cesm.exe           0000000000A4F885  glcbehaviormod_mp         204  glcBehaviorMod.F90
cesm.exe           00000000008A2EEE  clm_initializemod         151  clm_initializeMod.F90
cesm.exe           0000000000832B69  lnd_comp_mct_mp_l         198  lnd_comp_mct.F90
cesm.exe           000000000044B241  component_mod_mp_         227  component_mod.F90
cesm.exe           0000000000416422  cesm_comp_mod_mp_        1179  cesm_comp_mod.F90
cesm.exe           00000000004420EC  MAIN__                     63  cesm_driver.F90
cesm.exe           000000000040519E  Unknown               Unknown  Unknown
libc-2.19.so       00002AAAAFA44B25  __libc_start_main     Unknown  Unknown
cesm.exe           00000000004050A9  Unknown               Unknown  Unknown

What worked before was to redo the test case from scratch, so I'm trying that now.

@ekluzek ekluzek added this to the cesm2 milestone Aug 4, 2017
@ekluzek ekluzek self-assigned this Aug 4, 2017
@ekluzek
Copy link
Contributor Author

ekluzek commented Aug 4, 2017

Yep, redoing over allows it to work.

The case that fails is in:

/glade/p/work/erik/clm_chkimpexpndepunits/cime/scripts/ERS_D_Ld7_Mmpi-serial.1x1_smallvilleIA.IHistClm50BgcCropGs.cheyenne_intel.clm-decStart1851_noinitial.GC.clm4_5_16_r253intel

And the case that works is in:
/glade/p/work/erik/clm_chkimpexpndepunits/cime/scripts/ERS_D_Ld7_Mmpi-serial.1x1_smallvilleIA.IHistClm50BgcCropGs.cheyenne_intel.clm-decStart1851_noinitial.GC.20170804_112738_122q5k

@ekluzek
Copy link
Contributor Author

ekluzek commented Aug 4, 2017

The CISL ticket for this is:

https://cislcustomersupport.ucar.edu/evj/ExtraView/27622147

@billsacks
Copy link
Member

I seem to remember seeing issues similar to this when building on the share queue. I've gone back to building on the login nodes and that seems to clear up issues like this.

@billsacks
Copy link
Member

I'd say this is almost certainly a system problem. If you agree, then we should close this cime issue.

@gold2718
Copy link

gold2718 commented Sep 3, 2017

Was there ever a resolution to the CISL ticket above? I cannot seem to either search for that issue number or go to the URL (I get a 'timed-out' message even if I am logged into the system).

@ekluzek
Copy link
Contributor Author

ekluzek commented Sep 3, 2017

Dick Valent tried to do some work on it, but didn't figure anything out. He closed it, because he didn't hear back from me, and then I opened it again. But, he's closed it a second time now. I figured I'd let him know it's still a problem, but I'm not sure if I should reopen the ticket.

@gold2718
Copy link

gold2718 commented Sep 3, 2017

Tests which pass on a do-over can either be race condition or system problem. What is the evidence pointing to system problem over race condition?

@billsacks
Copy link
Member

@ekluzek says this happens all the time for him on cheyenne and/or yellowstone. It doesn't happen for me when I run the clm test suite, though - with a slightly different CLM version, though (I think) the same cime version. That makes me no longer suspect a one-time system problem. I wonder if there could be something in Erik's environment that is making this behave badly for him???

@ekluzek
Copy link
Contributor Author

ekluzek commented Sep 28, 2017

After thinking about this, and looking at the shared build directory structure I suspect I may have the reason I see this. The reason is that is that I often send both cheyenne and yellowstone tests out with the same test id. And since cheyenne and yellowstone have shared file-systems, but slightly different compiler configurations -- there's likely a race condition between the two builds that either allows it to work or fail. So my workaround that I'm going to do is to add an identifier for the machine on my test submissions. A more robust change to the system (if I can show that this is indeed the problem) would be to have the shared build add an identifier for the machine as well as the compiler for the shared build directories. Having shared file-systems across several machines is a common situation, so it's not a unique problem to NCAR. And it's not obvious to users that the two builds may conflict (many/most wouldn't know a shared build is being done). But, I'm willing to hear opinion from others on this @mvertens @jedwards4b @rljacob @gold2718 @jgfouca @fischer-ncar . The change I'm proposing is fundamentally pretty simple the shared build would have a subdirectory named (machine)_(compiler) rather than just a subdirectory named by (compiler). I haven't looked into how hard that would be to do, but my guess is it can't be too hard.

@jedwards4b
Copy link
Contributor

It would be easy to add machine to the shared library path - but in any case it seems a very risky practice to submit two test suites with the same test-id.

@billsacks
Copy link
Member

I agree with @jedwards4b . I have no problem adding the machine to the sharedlib path if it is indeed easy. But at the same time, we should very strongly discourage people from ever submitting multiple runs of create_test with the same testid. See also the discussion in #582 - though it looks like we never added anything to the create_test documentation saying that testids need to be unique.

@billsacks
Copy link
Member

billsacks commented Sep 28, 2017

Just opened #1933 which I'll take (addressing the documentation of testids).

@gold2718
Copy link

My only comment is a concern about adding to the length of the test name. Didn't we just have an issue (#1914) where long test names was causing problems?

@ekluzek
Copy link
Contributor Author

ekluzek commented Sep 28, 2017

@gold2718 what I'm proposing would only affect the shared build directory structure. The test names already have the machine/compiler combination in them (which is part of why I didn't think they would interfere), so the testname directory's won't change in length at all. But, under the shared build you have directories that look like:

$CIME_OUTPUT_ROOT/sharedlibroot.$TESTID/intel/mpi-serial/debug/nothreads/mct|pio|gptl

What I'm proposing is that would change to...

$CIME_OUTPUT_ROOT/sharedlibroot.$TESTID/cheyenne_intel/mpi-serial/debug/nothreads/mct|pio|gptl

There's nothing under those directories that have the $TESTID in them, so the paths are relatively short, and adding the machine name to them won't make much of a difference.

@billsacks
Copy link
Member

(From skimming back through this issue, @mvertens @jedwards4b and I felt this could be closed as a wontfix.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants