Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out why PRE.*.ADESP tests run forever on desktops #1112

Closed
jgfouca opened this issue Feb 8, 2017 · 24 comments
Closed

Figure out why PRE.*.ADESP tests run forever on desktops #1112

jgfouca opened this issue Feb 8, 2017 · 24 comments
Assignees
Labels

Comments

@jgfouca
Copy link
Contributor

jgfouca commented Feb 8, 2017

Once the issue is resolved, re-enable this test in cime_developer suite.

@gold2718
Copy link

gold2718 commented Feb 8, 2017

I'm not sure I have an appropriate machine on which to debug this as it doesn't get stuck for me (run time on Hobart is 48 seconds).
Could you add some data such as compiler and maybe some gists with run logs so I can at least see where things get stuck?

@rljacob
Copy link
Member

rljacob commented Feb 8, 2017

There was a tar file posted for you on Slack.

@jgfouca
Copy link
Contributor Author

jgfouca commented Feb 8, 2017

@gold2718 what rob said. I ran for an hour on melvin with 64 cores. Two case directories were produced in my acme scratch area which I tarred up for you to look at.

@gold2718
Copy link

gold2718 commented Feb 8, 2017

Is slack really part of our CIME development system? github is the (public) site for posting issues while Slack (in addition to taking up too much time for me to monitor) is private leading to an incomplete a scattered record of any issue.

@rljacob
Copy link
Member

rljacob commented Feb 8, 2017

In this particular case, Slack provides an easy way to share a binary (tar) file. Can't do that here or in gist. I guess dropbox could be used instead

@gold2718
Copy link

gold2718 commented Feb 8, 2017

If I had to choose between Slack and dropbox, I suppose Slack is better (dropbox is so full of security holes, I refuse to use it out of self defense).
On the other hand, gist is just another git repo so there is no problem adding a binary file. If you really hate binary, there is always base64 :)

@gold2718
Copy link

gold2718 commented Feb 8, 2017

Now can anyone tell me how to find this tar file?

@gold2718
Copy link

gold2718 commented Feb 8, 2017

Okay, I eventually found the file by scrolling back through a large number of messages. Is this really the best we can do? I got no notification and there is no organization. Even an FTP to Yellowstone (or any machine where we both have accounts) with a followup email would be better.

@gold2718
Copy link

gold2718 commented Feb 8, 2017

The second run of the PRE test is testing the pause/resume functionality using the coupler. The run on Melvin quickly writes the first restart file (at model time 3600s) but then seems to hang. I would expect the DESP component to run but do not see any indication of this. Before I expect ESP log output, it finds and reads the rpointer.drv file and checks to make sure the restart file in there exists. Is there anything weird about these simple filesystem statements in get_restart_filenames_a (desp_comp_mod.F90)? I can't prove that the hang is in the DESP module but if I had access to a machine that exhibited the hang, I would put print statements at the top of get_restart_filenames_a and desp_comp_run in desp_comp_mod.F90.
Also, turning up the info_debug namelist parameter (e.g., info_debug=2 in user_nl_desp) would print some information.

@gold2718
Copy link

gold2718 commented Feb 8, 2017

BTW, I tried this test on Yellowstone with CIME_MODEL=acme and it crashes as soon as the first run starts. Is this a known issue? The test was:
execca ./create_test PRE.f19_f19.ADESP.caldera_intel
acme log ends with:

   0:   max pend req (comp2io)  =           0
   0:   enable_hs (comp2io)     = T
   0:   enable_isend (comp2io)  = F
   0:   max pend req (io2comp)  =          64
   0:   enable_hs (io2comp)    = F
   0:   enable_isend (io2comp)  = T
   0:(seq_comm_setcomm)  initialize ID (  1 GLOBAL          ) pelist   =     0     0     1 ( npes =     1) ( nthreads =  1)
   0:Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in PMPI_Comm_create: Invalid group
   0:INFO: 0031-306  pm_atexit: pm_exit_value is 1.
INFO: 0031-251  task 0 exited: rc=1
INFO: 0031-639  Exit status from pm_respond = 0

@jedwards4b
Copy link
Contributor

When you test using CIME_MODEL=acme on caldera (yellowstone) its set up to run without a batch system so you need to do
DAV_CORES=16 execca ./create_test PRE_P16.f19_f19.ADESP.caldera_intel

CIME_MODEL=acme DAV_CORES=16 execca ./create_test PRE_P16.f19_f19.ADESP.caldera_intel
 
  Requesting 16 core(s) to caldera queue, 
  to submit ./create_test PRE_P16.f19_f19.ADESP.caldera_intel the usage is to be charged into CESM0005 
  running

  bsub -Is -q caldera -n16 -PCESM0005 -W24:00 "./create_test PRE_P16.f19_f19.ADESP.caldera_intel"

  please wait.. 

Job <482131> is submitted to queue <caldera>.
<<Waiting for dispatch ...>>
<<Starting on pronghorn04-ib>>
Using project from env ACCOUNT: P93300606
Created test in directory /glade/scratch/jedwards/PRE_P16.f19_f19.ADESP.caldera_intel.20170208_125951_nenyrc
RUNNING TESTS:
  PRE_P16.f19_f19.ADESP.caldera_intel
Starting CREATE_NEWCASE for test PRE_P16.f19_f19.ADESP.caldera_intel with 1 procs
Finished CREATE_NEWCASE for test PRE_P16.f19_f19.ADESP.caldera_intel in 2.864726 seconds (PASS)
Starting XML for test PRE_P16.f19_f19.ADESP.caldera_intel with 1 procs
Finished XML for test PRE_P16.f19_f19.ADESP.caldera_intel in 0.375090 seconds (PASS)
Starting SETUP for test PRE_P16.f19_f19.ADESP.caldera_intel with 1 procs
Finished SETUP for test PRE_P16.f19_f19.ADESP.caldera_intel in 4.291831 seconds (PASS)
Starting SHAREDLIB_BUILD for test PRE_P16.f19_f19.ADESP.caldera_intel with 1 procs
Finished SHAREDLIB_BUILD for test PRE_P16.f19_f19.ADESP.caldera_intel in 235.532066 seconds (PASS)
Starting MODEL_BUILD for test PRE_P16.f19_f19.ADESP.caldera_intel with 4 procs
Finished MODEL_BUILD for test PRE_P16.f19_f19.ADESP.caldera_intel in 36.418610 seconds (PASS)
Starting RUN for test PRE_P16.f19_f19.ADESP.caldera_intel with 16 procs
Finished RUN for test PRE_P16.f19_f19.ADESP.caldera_intel in 45.955772 seconds (PASS)
At test-scheduler close, state is:
PASS PRE_P16.f19_f19.ADESP.caldera_intel RUN
    Case dir: /glade/scratch/jedwards/PRE_P16.f19_f19.ADESP.caldera_intel.20170208_125951_nenyrc
test-scheduler took 326.32220912 seconds
yslogin2: ~/sandboxes/cesm2_0_alpha/cime/scripts
:) cat  /glade/scratch/jedwards/PRE_P16.f19_f19.ADESP.caldera_intel.20170208_125951_nenyrc/TestStatus
PASS PRE_P16.f19_f19.ADESP.caldera_intel CREATE_NEWCASE
PASS PRE_P16.f19_f19.ADESP.caldera_intel XML
PASS PRE_P16.f19_f19.ADESP.caldera_intel SETUP
PASS PRE_P16.f19_f19.ADESP.caldera_intel SHAREDLIB_BUILD time=220
PASS PRE_P16.f19_f19.ADESP.caldera_intel MODEL_BUILD time=35
PASS PRE_P16.f19_f19.ADESP.caldera_intel RUN time=39
PASS PRE_P16.f19_f19.ADESP.caldera_intel COMPARE_base_pr
PASS PRE_P16.f19_f19.ADESP.caldera_intel MEMLEAK insuffiencient data for memleak test

@gold2718
Copy link

gold2718 commented Feb 8, 2017

@jgfouca , I can't seem to reproduce that sort of behavior around here. Is there any way you could add print (or logging) statements to the routines described above?

@rljacob
Copy link
Member

rljacob commented Feb 8, 2017

@jgfouca how does this test behave on compute001? That's a machine you could both be on.

@jgfouca
Copy link
Contributor Author

jgfouca commented Feb 9, 2017

@rljacob trying it now

@jgfouca
Copy link
Contributor Author

jgfouca commented Feb 9, 2017

@rljacob @gold2718 I let PRE.f19_f19.ADESP run for almost two hours on compute001 before killing it, so that could be a platform to debug on.

@gold2718
Copy link

gold2718 commented Feb 9, 2017

Great, @jgfouca what is this platform and how do I get on it?

@rljacob
Copy link
Member

rljacob commented Feb 9, 2017

Its a workstation at Argonne. I sent Steve an email.

@jgfouca
Copy link
Contributor Author

jgfouca commented Feb 21, 2017

@gold2718 any luck getting this problem to reproduce on compute001?

@gold2718
Copy link

I haven't had a chance to try it out yet (did get an account though).
I do think I know what the issue is and am working on a fix (hint, it's not slow performance, it's a good old-fashioned MPI_bcast hang).
When I have a fix, then I will make sure I can reproduce the issue and demonstrate that my fix works there.

@gold2718
Copy link

gold2718 commented Mar 3, 2017

I'm having trouble reproducing this on compute001 with the current ESMCI/master. If I try:
./create_test PRE.f19_f19.ADESP
I get an all pass with a run time of 291.
Oddly enough, if I try something that should be faster:
./create_test PRE.f45_g37.ADESP
I get a run time of 290. Still, this is not a hang.

@jgfouca
Copy link
Contributor Author

jgfouca commented Mar 3, 2017

@gold2718 , let me try on melvin. It's possible someone inadvertently fixed this problem.

@gold2718
Copy link

gold2718 commented Mar 3, 2017

To see if it used to hang, I tried:

git checkout a51fb8c8fa980c11a175c255af8c2db9194964b9
 ./create_test PRE.f45_g37.ADESP

Got a run time of 276.
@jgfouca, can you think of anything else I can do to get a reproducer so I can make sure I 'fixed' it?

@jgfouca
Copy link
Contributor Author

jgfouca commented Mar 3, 2017

@gold2718 it worked for me on melvin. Go ahead and re-add this test to cime_developer.

@gold2718
Copy link

gold2718 commented Mar 4, 2017

@jgfouca, thanks I will do that with my next round of pause/resume upgrades (and will run tests on compute001 as part of my suite).
Of course, I do worry about spontaneous fixes like this. Is the Second Law of Thermodynamics still in force?

@ghost ghost removed the in progress label Apr 17, 2017
jgfouca pushed a commit that referenced this issue Jun 2, 2017
In cam5_4_91 tag, a bug was fixed in mo_strato_rates.F90 regarding
gamma terms. In the current model,the gamma terms are multiplied
together but they needed to be added. This change should not
affect current F compsets.

[BFB] - Bit-For-Bit
jgfouca pushed a commit that referenced this issue Mar 13, 2018
In cam5_4_91 tag, a bug was fixed in mo_strato_rates.F90 regarding
gamma terms. In the current model,the gamma terms are multiplied
together but they needed to be added. This change should not
affect current F compsets.

[BFB] - Bit-For-Bit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants