Figure out why PRE.*.ADESP tests run forever on desktops #1112

jgfouca · 2017-02-08T01:41:38Z

Once the issue is resolved, re-enable this test in cime_developer suite.

gold2718 · 2017-02-08T04:45:11Z

I'm not sure I have an appropriate machine on which to debug this as it doesn't get stuck for me (run time on Hobart is 48 seconds).
Could you add some data such as compiler and maybe some gists with run logs so I can at least see where things get stuck?

rljacob · 2017-02-08T17:51:39Z

There was a tar file posted for you on Slack.

jgfouca · 2017-02-08T17:58:08Z

@gold2718 what rob said. I ran for an hour on melvin with 64 cores. Two case directories were produced in my acme scratch area which I tarred up for you to look at.

gold2718 · 2017-02-08T18:23:27Z

Is slack really part of our CIME development system? github is the (public) site for posting issues while Slack (in addition to taking up too much time for me to monitor) is private leading to an incomplete a scattered record of any issue.

rljacob · 2017-02-08T18:28:44Z

In this particular case, Slack provides an easy way to share a binary (tar) file. Can't do that here or in gist. I guess dropbox could be used instead

gold2718 · 2017-02-08T18:33:38Z

If I had to choose between Slack and dropbox, I suppose Slack is better (dropbox is so full of security holes, I refuse to use it out of self defense).
On the other hand, gist is just another git repo so there is no problem adding a binary file. If you really hate binary, there is always base64 :)

gold2718 · 2017-02-08T18:34:22Z

Now can anyone tell me how to find this tar file?

gold2718 · 2017-02-08T19:00:43Z

Okay, I eventually found the file by scrolling back through a large number of messages. Is this really the best we can do? I got no notification and there is no organization. Even an FTP to Yellowstone (or any machine where we both have accounts) with a followup email would be better.

gold2718 · 2017-02-08T19:47:40Z

The second run of the PRE test is testing the pause/resume functionality using the coupler. The run on Melvin quickly writes the first restart file (at model time 3600s) but then seems to hang. I would expect the DESP component to run but do not see any indication of this. Before I expect ESP log output, it finds and reads the rpointer.drv file and checks to make sure the restart file in there exists. Is there anything weird about these simple filesystem statements in get_restart_filenames_a (desp_comp_mod.F90)? I can't prove that the hang is in the DESP module but if I had access to a machine that exhibited the hang, I would put print statements at the top of get_restart_filenames_a and desp_comp_run in desp_comp_mod.F90.
Also, turning up the info_debug namelist parameter (e.g., info_debug=2 in user_nl_desp) would print some information.

gold2718 · 2017-02-08T19:52:25Z

BTW, I tried this test on Yellowstone with CIME_MODEL=acme and it crashes as soon as the first run starts. Is this a known issue? The test was:
execca ./create_test PRE.f19_f19.ADESP.caldera_intel
acme log ends with:

   0:   max pend req (comp2io)  =           0
   0:   enable_hs (comp2io)     = T
   0:   enable_isend (comp2io)  = F
   0:   max pend req (io2comp)  =          64
   0:   enable_hs (io2comp)    = F
   0:   enable_isend (io2comp)  = T
   0:(seq_comm_setcomm)  initialize ID (  1 GLOBAL          ) pelist   =     0     0     1 ( npes =     1) ( nthreads =  1)
   0:Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in PMPI_Comm_create: Invalid group
   0:INFO: 0031-306  pm_atexit: pm_exit_value is 1.
INFO: 0031-251  task 0 exited: rc=1
INFO: 0031-639  Exit status from pm_respond = 0

jedwards4b · 2017-02-08T20:07:23Z

When you test using CIME_MODEL=acme on caldera (yellowstone) its set up to run without a batch system so you need to do
DAV_CORES=16 execca ./create_test PRE_P16.f19_f19.ADESP.caldera_intel

CIME_MODEL=acme DAV_CORES=16 execca ./create_test PRE_P16.f19_f19.ADESP.caldera_intel
 
  Requesting 16 core(s) to caldera queue, 
  to submit ./create_test PRE_P16.f19_f19.ADESP.caldera_intel the usage is to be charged into CESM0005 
  running

  bsub -Is -q caldera -n16 -PCESM0005 -W24:00 "./create_test PRE_P16.f19_f19.ADESP.caldera_intel"

  please wait.. 

Job <482131> is submitted to queue <caldera>.
<<Waiting for dispatch ...>>
<<Starting on pronghorn04-ib>>
Using project from env ACCOUNT: P93300606
Created test in directory /glade/scratch/jedwards/PRE_P16.f19_f19.ADESP.caldera_intel.20170208_125951_nenyrc
RUNNING TESTS:
  PRE_P16.f19_f19.ADESP.caldera_intel
Starting CREATE_NEWCASE for test PRE_P16.f19_f19.ADESP.caldera_intel with 1 procs
Finished CREATE_NEWCASE for test PRE_P16.f19_f19.ADESP.caldera_intel in 2.864726 seconds (PASS)
Starting XML for test PRE_P16.f19_f19.ADESP.caldera_intel with 1 procs
Finished XML for test PRE_P16.f19_f19.ADESP.caldera_intel in 0.375090 seconds (PASS)
Starting SETUP for test PRE_P16.f19_f19.ADESP.caldera_intel with 1 procs
Finished SETUP for test PRE_P16.f19_f19.ADESP.caldera_intel in 4.291831 seconds (PASS)
Starting SHAREDLIB_BUILD for test PRE_P16.f19_f19.ADESP.caldera_intel with 1 procs
Finished SHAREDLIB_BUILD for test PRE_P16.f19_f19.ADESP.caldera_intel in 235.532066 seconds (PASS)
Starting MODEL_BUILD for test PRE_P16.f19_f19.ADESP.caldera_intel with 4 procs
Finished MODEL_BUILD for test PRE_P16.f19_f19.ADESP.caldera_intel in 36.418610 seconds (PASS)
Starting RUN for test PRE_P16.f19_f19.ADESP.caldera_intel with 16 procs
Finished RUN for test PRE_P16.f19_f19.ADESP.caldera_intel in 45.955772 seconds (PASS)
At test-scheduler close, state is:
PASS PRE_P16.f19_f19.ADESP.caldera_intel RUN
    Case dir: /glade/scratch/jedwards/PRE_P16.f19_f19.ADESP.caldera_intel.20170208_125951_nenyrc
test-scheduler took 326.32220912 seconds
yslogin2: ~/sandboxes/cesm2_0_alpha/cime/scripts
:) cat  /glade/scratch/jedwards/PRE_P16.f19_f19.ADESP.caldera_intel.20170208_125951_nenyrc/TestStatus
PASS PRE_P16.f19_f19.ADESP.caldera_intel CREATE_NEWCASE
PASS PRE_P16.f19_f19.ADESP.caldera_intel XML
PASS PRE_P16.f19_f19.ADESP.caldera_intel SETUP
PASS PRE_P16.f19_f19.ADESP.caldera_intel SHAREDLIB_BUILD time=220
PASS PRE_P16.f19_f19.ADESP.caldera_intel MODEL_BUILD time=35
PASS PRE_P16.f19_f19.ADESP.caldera_intel RUN time=39
PASS PRE_P16.f19_f19.ADESP.caldera_intel COMPARE_base_pr
PASS PRE_P16.f19_f19.ADESP.caldera_intel MEMLEAK insuffiencient data for memleak test

gold2718 · 2017-02-08T21:38:24Z

@jgfouca , I can't seem to reproduce that sort of behavior around here. Is there any way you could add print (or logging) statements to the routines described above?

rljacob · 2017-02-08T21:48:20Z

@jgfouca how does this test behave on compute001? That's a machine you could both be on.

jgfouca · 2017-02-09T04:03:25Z

@rljacob trying it now

jgfouca · 2017-02-09T05:46:50Z

@rljacob @gold2718 I let PRE.f19_f19.ADESP run for almost two hours on compute001 before killing it, so that could be a platform to debug on.

gold2718 · 2017-02-09T06:01:21Z

Great, @jgfouca what is this platform and how do I get on it?

rljacob · 2017-02-09T06:07:16Z

Its a workstation at Argonne. I sent Steve an email.

jgfouca · 2017-02-21T20:50:05Z

@gold2718 any luck getting this problem to reproduce on compute001?

gold2718 · 2017-02-21T21:57:25Z

I haven't had a chance to try it out yet (did get an account though).
I do think I know what the issue is and am working on a fix (hint, it's not slow performance, it's a good old-fashioned MPI_bcast hang).
When I have a fix, then I will make sure I can reproduce the issue and demonstrate that my fix works there.

gold2718 · 2017-03-03T22:26:22Z

I'm having trouble reproducing this on compute001 with the current ESMCI/master. If I try:
./create_test PRE.f19_f19.ADESP
I get an all pass with a run time of 291.
Oddly enough, if I try something that should be faster:
./create_test PRE.f45_g37.ADESP
I get a run time of 290. Still, this is not a hang.

jgfouca · 2017-03-03T22:37:52Z

@gold2718 , let me try on melvin. It's possible someone inadvertently fixed this problem.

gold2718 · 2017-03-03T22:44:42Z

To see if it used to hang, I tried:

git checkout a51fb8c8fa980c11a175c255af8c2db9194964b9
 ./create_test PRE.f45_g37.ADESP

Got a run time of 276.
@jgfouca, can you think of anything else I can do to get a reproducer so I can make sure I 'fixed' it?

jgfouca · 2017-03-03T22:50:28Z

@gold2718 it worked for me on melvin. Go ahead and re-add this test to cime_developer.

gold2718 · 2017-03-04T18:49:21Z

@jgfouca, thanks I will do that with my next round of pause/resume upgrades (and will run tests on compute001 as part of my suite).
Of course, I do worry about spontaneous fixes like this. Is the Second Law of Thermodynamics still in force?

In cam5_4_91 tag, a bug was fixed in mo_strato_rates.F90 regarding gamma terms. In the current model,the gamma terms are multiplied together but they needed to be added. This change should not affect current F compsets. [BFB] - Bit-For-Bit

jgfouca added the ty: Bug label Feb 8, 2017

jgfouca assigned gold2718 Feb 8, 2017

rljacob added ready in progress and removed ready labels Mar 3, 2017

gold2718 mentioned this issue Apr 13, 2017

Fixes for PRE test #1350

Merged

jedwards4b closed this as completed in #1350 Apr 17, 2017

ghost removed the in progress label Apr 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out why PRE.*.ADESP tests run forever on desktops #1112

Figure out why PRE.*.ADESP tests run forever on desktops #1112

jgfouca commented Feb 8, 2017

gold2718 commented Feb 8, 2017

rljacob commented Feb 8, 2017

jgfouca commented Feb 8, 2017

gold2718 commented Feb 8, 2017

rljacob commented Feb 8, 2017

gold2718 commented Feb 8, 2017

gold2718 commented Feb 8, 2017

gold2718 commented Feb 8, 2017

gold2718 commented Feb 8, 2017

gold2718 commented Feb 8, 2017

jedwards4b commented Feb 8, 2017

gold2718 commented Feb 8, 2017

rljacob commented Feb 8, 2017

jgfouca commented Feb 9, 2017

jgfouca commented Feb 9, 2017

gold2718 commented Feb 9, 2017

rljacob commented Feb 9, 2017

jgfouca commented Feb 21, 2017

gold2718 commented Feb 21, 2017

gold2718 commented Mar 3, 2017

jgfouca commented Mar 3, 2017

gold2718 commented Mar 3, 2017

jgfouca commented Mar 3, 2017

gold2718 commented Mar 4, 2017

Figure out why PRE.*.ADESP tests run forever on desktops #1112

Figure out why PRE.*.ADESP tests run forever on desktops #1112

Comments

jgfouca commented Feb 8, 2017

gold2718 commented Feb 8, 2017

rljacob commented Feb 8, 2017

jgfouca commented Feb 8, 2017

gold2718 commented Feb 8, 2017

rljacob commented Feb 8, 2017

gold2718 commented Feb 8, 2017

gold2718 commented Feb 8, 2017

gold2718 commented Feb 8, 2017

gold2718 commented Feb 8, 2017

gold2718 commented Feb 8, 2017

jedwards4b commented Feb 8, 2017

gold2718 commented Feb 8, 2017

rljacob commented Feb 8, 2017

jgfouca commented Feb 9, 2017

jgfouca commented Feb 9, 2017

gold2718 commented Feb 9, 2017

rljacob commented Feb 9, 2017

jgfouca commented Feb 21, 2017

gold2718 commented Feb 21, 2017

gold2718 commented Mar 3, 2017

jgfouca commented Mar 3, 2017

gold2718 commented Mar 3, 2017

jgfouca commented Mar 3, 2017

gold2718 commented Mar 4, 2017