-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpld_control_p8 failure w/ GNU/OpenMPI on Cheyenne #1737
Comments
@ulmononian I'm not able to run a test using ecflow on this feature branch for Cheyenne. Is that expected?
|
Followup. I copied your run directory from See /glade/scratch/worthen/cpld_control_p8 |
sorry about that, @DeniseWorthen. i may have configured the new ecflow paths incorrectly. if you want to use ecflow, perhaps reverting to the develop ECFLOW_PATH ( |
this is very useful and very interesting. i had a suspicion it was related to the aerosol component based upon the initial error i was seeing, but this seems to point much more strongly to that. i'lll take a look at your rundir and also try running |
Something seems wrong with the ecflow install. "ran out of endpoints" seems more of a generic error. If ecflow can't start the server or connect to the server in any way, thats the error that shows. |
@BrianCurtis-NOAA What about the
|
@BrianCurtis-NOAA @DeniseWorthen apologies for the ecflow issues. i actually only use rocoto or run the tests sequentially (which of course is an issue on cheyenne). the ecflow changes in rt.sh most likely need corrected, but it was not a priority for me yet as i was testing w/out ecflow. sorry for the inconvenience here. |
I just ran this:
Output:
Then I ran:
(GUI shows up); then:
Note that you don't need all of this, but I wanted to make sure it's not interfering. This suffices:
|
@climbfuji thanks for this. should we just use the basic module use/module load (miniconda/ecflow) for each machine's stanza within rt.sh? still unsure how to set (if still needed) |
@ulmononian My atm-aero test finally (!) ran. It gives me this
/glade/scratch/worthen/atmaero_control_p8 |
yes wow that test takes a long time to get going. mine also just finished and i got the same results ( |
You might need to check that there are no existing gocart*nc files before you start up. I think that creates a failure (?) but I'm not 100% sure. (Not sure if you're re-using a run directory or not.) |
this was done in a fresh rt_#### directory, so i don't believe there were any existing gocart*.nc files present. |
@jkbk2004 @mark-a-potts @climbfuji @DeniseWorthen :
a complete set of rt_gnu.conf RTs is running here: the exp. path on orion is here: |
@jkbk2004 @mark-a-potts @climbfuji @DeniseWorthen: to follow-up on my previous comment: all but two RTs in rt_gnu.conf pass for orion.gnu tests: a further hint that it is a gocart/aerosols issue is the fact that |
I contacted Gocart team at GEOS-ESM/GOCART#227 |
thank you, @jkbk2004! |
@ulmononian should this issue stay open, since we're actively transitioning away from Cheyenne to Derecho, and none of has the ability to test anymore on Cheyenne? |
Closing this issue as RTs can no longer be run on Cheyenne. Cpld_control_p8_gnu will continue to run on Hera and Hercules going forward. |
Description
cpld_control_p8
(S2SWA
) fails on Cheyenne when built/run with spack-stack/1.3.1 (gnu-9.2
/openmpi-4.1.1
). Compilation is successful, but the model fails at seemingly random times. Theerr
files generated when using non-debugesmf/8.3.0b09
andmapl/2.22.0
are not very enlightening (see Output section code block). Debug versions of esmf and mapl were then used, which illuminated a bit more (w/ the modelout
showing failure around write-time of a WW3 restart file); please see attached files in Output section for these debug logs.The run directory can be found here
/glade/scratch/bcameron/FV3_RT/rt_64181
, with a WM base at/glade/scratch/bcameron/rt_work/wmSS1.3.1_test/
.Perhaps noteworthy is that
cpld_control_nowave_noaero_p8
runs successfully using the same stack:/glade/scratch/bcameron/FV3_RT/rt_58295
. I also ran several of the other RT configurations in rt_gnu.conf and all are successful:/glade/scratch/bcameron/FV3_RT/rt_62118
; however, Cheyenne keeps cancelling the full suite from running. Probably user error, in case someone else would like to run the full GNU suite.The fork branch this was tested with is https://github.com/ulmononian/ufs-weather-model/tree/feature/spack_stack_ue, which is the branch associated with PR #1707. It contains updated Cheyenne lua files for both Intel and GNU, as well as an updated
ufs_common
to reflect the module versions contained inspack-stack/1.3.1
.@climbfuji @mark-a-potts
To Reproduce:
git clone --recursive -b feature/spack_stack_ue https://github.com/ulmononian/ufs-weather-model.git
cd ufs-weather-model/tests
./rt.sh -k -n cpld_control_p8
Then, check out the
err
andout
logs in the run directory.Additional context
#1707 (spack-stack merge) is being held up (primarily) by this issue.
Output
It goes on like this for a while and ends with this backtrace:
And the lines just before the model fails (a snippet from the debug
out
file):Full
err
andout
files produced when using esmf/mapl debug versions:err.txt
out.txt
The text was updated successfully, but these errors were encountered: