Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpld_control_p8 failure w/ GNU/OpenMPI on Cheyenne #1737

Closed
ulmononian opened this issue May 5, 2023 · 19 comments
Closed

cpld_control_p8 failure w/ GNU/OpenMPI on Cheyenne #1737

ulmononian opened this issue May 5, 2023 · 19 comments
Assignees
Labels
bug Something isn't working

Comments

@ulmononian
Copy link
Collaborator

ulmononian commented May 5, 2023

Description

cpld_control_p8 (S2SWA) fails on Cheyenne when built/run with spack-stack/1.3.1 (gnu-9.2/openmpi-4.1.1). Compilation is successful, but the model fails at seemingly random times. The err files generated when using non-debug esmf/8.3.0b09 and mapl/2.22.0 are not very enlightening (see Output section code block). Debug versions of esmf and mapl were then used, which illuminated a bit more (w/ the model out showing failure around write-time of a WW3 restart file); please see attached files in Output section for these debug logs.

The run directory can be found here /glade/scratch/bcameron/FV3_RT/rt_64181, with a WM base at /glade/scratch/bcameron/rt_work/wmSS1.3.1_test/.

Perhaps noteworthy is that cpld_control_nowave_noaero_p8 runs successfully using the same stack: /glade/scratch/bcameron/FV3_RT/rt_58295. I also ran several of the other RT configurations in rt_gnu.conf and all are successful: /glade/scratch/bcameron/FV3_RT/rt_62118; however, Cheyenne keeps cancelling the full suite from running. Probably user error, in case someone else would like to run the full GNU suite.

The fork branch this was tested with is https://github.com/ulmononian/ufs-weather-model/tree/feature/spack_stack_ue, which is the branch associated with PR #1707. It contains updated Cheyenne lua files for both Intel and GNU, as well as an updated ufs_common to reflect the module versions contained in spack-stack/1.3.1.

@climbfuji @mark-a-potts

To Reproduce:

git clone --recursive -b feature/spack_stack_ue https://github.com/ulmononian/ufs-weather-model.git
cd ufs-weather-model/tests
./rt.sh -k -n cpld_control_p8

Then, check out the err and out logs in the run directory.

Additional context

#1707 (spack-stack merge) is being held up (primarily) by this issue.

Output

[r14i5n19:60175:0:60175] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
==== backtrace (tid:  60175) ====
 0  /glade/u/apps/ch/opt/ucx/1.11.0/lib/libucs.so.0(ucs_handle_error+0xe4) [0x2b1b6d9311b4]
 1  /glade/u/apps/ch/opt/ucx/1.11.0/lib/libucs.so.0(+0x2a4dc) [0x2b1b6d9314dc]
 2  /glade/u/apps/ch/opt/ucx/1.11.0/lib/libucs.so.0(+0x2a74b) [0x2b1b6d93174b]
 3  /lib64/libpthread.so.0(+0x11c00) [0x2b1b578fcc00]
 4  /glade/u/apps/ch/opt/ucx/1.11.0/lib/libucs.so.0(ucs_arbiter_dispatch_nonempty+0x48) [0x2b1b6d926dd8]
 5  /glade/u/apps/ch/opt/ucx/1.11.0/lib/ucx/libuct_ib.so.0(+0x491f2) [0x2b1b6ddbc1f2]
 6  /glade/u/apps/ch/opt/ucx/1.11.0/lib/libucp.so.0(ucp_worker_progress+0x6a) [0x2b1b6d47711a]
 7  /glade/u/apps/ch/opt/openmpi/4.1.1/gnu/10.1.0/lib/openmpi/mca_osc_ucx.so(ompi_osc_ucx_lock+0x431) [0x2b1b75296f81]
 8  /glade/u/apps/ch/opt/openmpi/4.1.1/gnu/10.1.0/lib/libmpi.so.40(MPI_Win_lock+0xd9) [0x2b1b5825c5b9]
 9  /glade/u/apps/ch/opt/openmpi/4.1.1/gnu/10.1.0/lib/libmpi_mpifh.so.40(pmpi_win_lock_+0x30) [0x2b1b57fb3e80]
10  ./fv3.exe() [0x1a9c07e]

It goes on like this for a while and ends with this backtrace:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x2b1b578fcbff in pthread_mutex_setprioceiling
        at /usr/src/debug/glibc-2.22/nptl/pthread_mutex_setprioceiling.c:88
#1  0x2b1b6d926dd8 in ucs_list_del
        at /glade/work/vanderwb/build/openmpi/34030/ucx-1.11.0/src/ucs/datastruct/list.h:105
#2  0x2b1b6d926dd8 in ucs_arbiter_dispatch_nonempty
        at datastruct/arbiter.c:284
#3  0x2b1b6ddbc1f1 in ucs_arbiter_dispatch
        at /glade/work/vanderwb/build/openmpi/34030/ucx-1.11.0/src/ucs/datastruct/arbiter.h:386
#4  0x2b1b6ddbc1f1 in uct_dc_mlx5_iface_progress_pending
        at dc/dc_mlx5_ep.h:335
#5  0x2b1b6ddbc1f1 in uct_dc_mlx5_poll_tx
        at dc/dc_mlx5.c:254
#6  0x2b1b6ddbc1f1 in uct_dc_mlx5_iface_progress
        at dc/dc_mlx5.c:271
#7  0x2b1b6ddbc1f1 in uct_dc_mlx5_iface_progress_ll
        at dc/dc_mlx5.c:281
#8  0x2b1b6d477119 in ucs_callbackq_dispatch
        at /glade/work/vanderwb/build/openmpi/34030/ucx-1.11.0/src/ucs/datastruct/callbackq.h:211
#9  0x2b1b6d477119 in uct_worker_progress
        at /glade/work/vanderwb/build/openmpi/34030/ucx-1.11.0/src/uct/api/uct.h:2592
#10  0x2b1b6d477119 in ucp_worker_progress
        at core/ucp_worker.c:2635
#11  0x2b1b75296f80 in ???
#12  0x2b1b5825c5b8 in ???
#13  0x2b1b57fb3e7f in ???
#14  0x1a9c07d in __pfio_directoryservicemod_MOD_put_directory
        at /glade/work/epicufsrt/contrib/spack-stack/spack-stack-1.3.1/cache/build_stage/spack-stage-mapl-2.22.0-5z4kqhw26h2bbjtvp7tifxwimvhl4wgj/spack-src/pfio/DirectoryService.F90:537
#15  0x1a9e808 in __pfio_directoryservicemod_MOD_new_directoryservice
        at /glade/work/epicufsrt/contrib/spack-stack/spack-stack-1.3.1/cache/build_stage/spack-stage-mapl-2.22.0-5z4kqhw26h2bbjtvp7tifxwimvhl4wgj/spack-src/pfio/DirectoryService.F90:110
#16  0x198e657 in __mapl_servermanager_MOD_initialize
        at /glade/work/epicufsrt/contrib/spack-stack/spack-stack-1.3.1/cache/build_stage/spack-stage-mapl-2.22.0-5z4kqhw26h2bbjtvp7tifxwimvhl4wgj/spack-src/base/ServerManager.F90:121
#17  0x15aba46 in __mapl_capmod_MOD_initialize_io_clients_servers
        at /glade/work/epicufsrt/contrib/spack-stack/spack-stack-1.3.1/cache/build_stage/spack-stage-mapl-2.22.0-5z4kqhw26h2bbjtvp7tifxwimvhl4wgj/spack-src/gridcomps/Cap/MAPL_Cap.F90:199
#18  0x146530a in modeldatainitialize
        at /glade/scratch/bcameron/rt_work/wmSS1.3.1_test/GOCART/ESMF/UFS/Aerosol_Cap.F90:330

And the lines just before the model fails (a snippet from the debug out file):

PASS: fcstRUN phase 1, n_atmsteps =               29 time is         1.450486
UFS Aerosols: Advancing from 2021-03-22T11:48:00 to 2021-03-22T12:00:00

 Writing:      7 Slices to File:  gocart.inst_aod.20210322_1200z.nc4
WW3: writing restart file ufs.cpld.ww3.r.2021-03-22-43200
9 total processes killed (some possibly by mpirun during cleanup)

Full err and out files produced when using esmf/mapl debug versions:

err.txt

out.txt

@ulmononian ulmononian added the bug Something isn't working label May 5, 2023
@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented May 5, 2023

@ulmononian I'm not able to run a test using ecflow on this feature branch for Cheyenne. Is that expected?

Checking if the server is already running on cheyenne4 and port 36124
/glade/work/jedipara/cheyenne/spack-stack/ecflow-5.8.4/bin/ecflow_client: error while loading shared libraries: libssl.so.1.1: cannot open shared object file: No such file or directory

Backing up check point and log files

OK starting ecFlow server...

Placing server into RESTART mode...
/glade/work/jedipara/cheyenne/spack-stack/ecflow-5.8.4/bin/ecflow_client: error while loading shared libraries: libssl.so.1.1: cannot open shared object file: No such file or directory
restart of server failed
+ set -e
+ ECFLOW_RUNNING=true
+ export ECF_PORT
+ export ECF_HOST
+ ecflow_client --load=/glade/work/worthen/ufs_dev/tests/ecflow_run/regtest_2932.def
[08:57:53 5.5.2023] ClientInvoker: Connection error: (Client::handle_connect: Ran out of end points: connection error( Connection refused ) for request( --load=/glade/work/worthen/ufs_dev/tests/ecflow_run/regtest_2932.def ) on cheyenne4:36124)
[08:58:03 5.5.2023] ClientInvoker: Connection error: (Client::handle_connect: Ran out of end points: connection error( Connection refused ) for request( --load=/glade/work/worthen/ufs_dev/tests/ecflow_run/regtest_2932.def ) on cheyenne4:36124)
[08:58:03 5.5.2023] Request( --load=/glade/work/worthen/ufs_dev/tests/ecflow_run/regtest_2932.def ), Failed to connect to cheyenne4:36124. After 2 attempts. Is the server running ?
ClientEnvironment:
[08:58:03 5.5.2023] Ecflow version(5.5.3) boost(1.74.0) compiler(gcc 7.5.0) protocol(JSON cereal 1.3.0) Compiled on Oct 13 2020 23:00:59
   ECF_HOST/ECF_PORT : host_vec_index_ = 0 host_vec_.size() = 1
   cheyenne4:36124
   ECF_NAME =
   ECF_PASS =
   ECF_RID =
   ECF_TRYNO = 1
   ECF_HOSTFILE =
   ECF_TIMEOUT = 86400
   ECF_ZOMBIE_TIMEOUT = 43200
   ECF_CONNECT_TIMEOUT = 0
   ECF_DENIED = 0
   NO_ECF = 0
   ECF_DEBUG_CLIENT = 0

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented May 5, 2023

Followup. I copied your run directory from /glade/scratch/bcameron/FV3_RT/rt_64181/cpld_control_p8/ and removed the Aerosol component. I compiled S2SW in GNU+Debug and it is running fine. I think the issue must be w/ the aerosol component. Have you tried running just the atmaero_control_p8 case?

See /glade/scratch/worthen/cpld_control_p8

@ulmononian
Copy link
Collaborator Author

ulmononian commented May 5, 2023

@ulmononian I'm not able to run a test using ecflow on this feature branch for Cheyenne. Is that expected?

Checking if the server is already running on cheyenne4 and port 36124
/glade/work/jedipara/cheyenne/spack-stack/ecflow-5.8.4/bin/ecflow_client: error while loading shared libraries: libssl.so.1.1: cannot open shared object file: No such file or directory

Backing up check point and log files

OK starting ecFlow server...

Placing server into RESTART mode...
/glade/work/jedipara/cheyenne/spack-stack/ecflow-5.8.4/bin/ecflow_client: error while loading shared libraries: libssl.so.1.1: cannot open shared object file: No such file or directory
restart of server failed
+ set -e
+ ECFLOW_RUNNING=true
+ export ECF_PORT
+ export ECF_HOST
+ ecflow_client --load=/glade/work/worthen/ufs_dev/tests/ecflow_run/regtest_2932.def
[08:57:53 5.5.2023] ClientInvoker: Connection error: (Client::handle_connect: Ran out of end points: connection error( Connection refused ) for request( --load=/glade/work/worthen/ufs_dev/tests/ecflow_run/regtest_2932.def ) on cheyenne4:36124)
[08:58:03 5.5.2023] ClientInvoker: Connection error: (Client::handle_connect: Ran out of end points: connection error( Connection refused ) for request( --load=/glade/work/worthen/ufs_dev/tests/ecflow_run/regtest_2932.def ) on cheyenne4:36124)
[08:58:03 5.5.2023] Request( --load=/glade/work/worthen/ufs_dev/tests/ecflow_run/regtest_2932.def ), Failed to connect to cheyenne4:36124. After 2 attempts. Is the server running ?
ClientEnvironment:
[08:58:03 5.5.2023] Ecflow version(5.5.3) boost(1.74.0) compiler(gcc 7.5.0) protocol(JSON cereal 1.3.0) Compiled on Oct 13 2020 23:00:59
   ECF_HOST/ECF_PORT : host_vec_index_ = 0 host_vec_.size() = 1
   cheyenne4:36124
   ECF_NAME =
   ECF_PASS =
   ECF_RID =
   ECF_TRYNO = 1
   ECF_HOSTFILE =
   ECF_TIMEOUT = 86400
   ECF_ZOMBIE_TIMEOUT = 43200
   ECF_CONNECT_TIMEOUT = 0
   ECF_DENIED = 0
   NO_ECF = 0
   ECF_DEBUG_CLIENT = 0

sorry about that, @DeniseWorthen. i may have configured the new ecflow paths incorrectly. if you want to use ecflow, perhaps reverting to the develop ECFLOW_PATH (ECFLOW_START=/glade/p/ral/jntp/tools/miniconda3/4.8.3/envs/ufs-weather-model/bin/ecflow_start.sh) might work for now. i'll try to get this fixed soon.

@ulmononian
Copy link
Collaborator Author

Followup. I copied your run directory from /glade/scratch/bcameron/FV3_RT/rt_64181/cpld_control_p8/ and removed the Aerosol component. I compiled S2SW in GNU+Debug and it is running fine. I think the issue must be w/ the aerosol component. Have you tried running just the atmaero_control_p8 case?

See /glade/scratch/worthen/cpld_control_p8

this is very useful and very interesting. i had a suspicion it was related to the aerosol component based upon the initial error i was seeing, but this seems to point much more strongly to that. i'lll take a look at your rundir and also try running atmaero_control_p8 (i had not yet done that). thank you very much!

@BrianCurtis-NOAA
Copy link
Collaborator

Something seems wrong with the ecflow install. "ran out of endpoints" seems more of a generic error. If ecflow can't start the server or connect to the server in any way, thats the error that shows.

@DeniseWorthen
Copy link
Collaborator

@BrianCurtis-NOAA What about the

error while loading shared libraries: libssl.so.1.1: cannot open shared object file: No such file or directory

@ulmononian
Copy link
Collaborator Author

@BrianCurtis-NOAA @DeniseWorthen apologies for the ecflow issues. i actually only use rocoto or run the tests sequentially (which of course is an issue on cheyenne). the ecflow changes in rt.sh most likely need corrected, but it was not a priority for me yet as i was testing w/out ecflow. sorry for the inconvenience here.

@climbfuji
Copy link
Collaborator

I just ran this:

   module purge
   export LMOD_TMOD_FIND_FIRST=yes
   module use /glade/work/jedipara/cheyenne/spack-stack/modulefiles/misc
   module load miniconda/3.9.12
   module load ecflow/5.8.4
   module load mysql/8.0.31

   module use /glade/work/epicufsrt/contrib/spack-stack/spack-stack-1.3.1/envs/unified-env/install/modulefiles/Core
   module load stack-intel/19.1.1.217
   module load stack-intel-mpi/2019.7.217
   module load stack-python/3.9.12

   module load jedi-fv3-env jedi-ewok-env soca-env

   export ECF_PORT=5907
   ecflow_start.sh -p $ECF_PORT
   ecflow_client --ping

Output:

ping server(localhost:5907) succeeded in 00:00:00.001142  ~1 milliseconds

Then I ran:

nice ecflow_ui &

(GUI shows up); then:

ecflow_stop.sh -p $ECF_PORT

Note that you don't need all of this, but I wanted to make sure it's not interfering. This suffices:

   module purge
   export LMOD_TMOD_FIND_FIRST=yes
   module use /glade/work/jedipara/cheyenne/spack-stack/modulefiles/misc
   module load miniconda/3.9.12
   module load ecflow/5.8.4

@ulmononian
Copy link
Collaborator Author

ulmononian commented May 5, 2023

module purge
export LMOD_TMOD_FIND_FIRST=yes
module use /glade/work/jedipara/cheyenne/spack-stack/modulefiles/misc
module load miniconda/3.9.12
module load ecflow/5.8.4

@climbfuji thanks for this. should we just use the basic module use/module load (miniconda/ecflow) for each machine's stanza within rt.sh? still unsure how to set (if still needed) ECFLOW_START and ECF_PORT; you exported ECF_PORT, but i do not know how you got that value or how to properly set it for each machine (it seems to be in some of the ecflow modulefiles, but not all, and doesn't seem to correspond to how ECF_PORT is set in develop's rt.sh).

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented May 5, 2023

@ulmononian My atm-aero test finally (!) ran. It gives me this

#14  0x19607bb in __pfio_rdmareferencemod_MOD_fence
        at /glade/work/epicufsrt/contrib/spack-stack/spack-stack-1.3.1/cache/build_stage/spack-stage-mapl-2.22.0-5z4kqhw26h2bbjtvp7tifxwimvhl4wgj/spack-src/pfio/RDMAReference.F90:163
#15  0x18c324f in __pfio_baseservermod_MOD_receive_output_data

/glade/scratch/worthen/atmaero_control_p8

@ulmononian
Copy link
Collaborator Author

ulmononian commented May 5, 2023

0x19607bb in __pfio_rdmareferencemod_MOD_fence

yes wow that test takes a long time to get going. mine also just finished and i got the same results (/glade/scratch/bcameron/FV3_RT/rt_16799/atmaero_control_p8). the Writing: 7 Slices to File: gocart.inst_aod.20210322_1200z.nc4 at the end of the out file immediately preceded the WW3 restart file write i mentioned in the issue description (for the cpld_control_p8 run). definitely looks like a gocart/mapl issue at this point. thanks for your help here!!!

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented May 5, 2023

You might need to check that there are no existing gocart*nc files before you start up. I think that creates a failure (?) but I'm not 100% sure. (Not sure if you're re-using a run directory or not.)

@ulmononian
Copy link
Collaborator Author

You might need to check that there are no existing gocart*nc files before you start up. I think that creates a failure (?) but I'm not 100% sure. (Not sure if you're re-using a run directory or not.)

this was done in a fresh rt_#### directory, so i don't believe there were any existing gocart*.nc files present.

@ulmononian
Copy link
Collaborator Author

ulmononian commented May 10, 2023

@jkbk2004 @mark-a-potts @climbfuji @DeniseWorthen :

cpld_control_p8 was run on orion with gnu/10.2.0 and openmpi/4.0.4 to check whether this compiler/mpi combination might be inherently problematic when running the WM w/ aerosols. though not the exact versions as on cheyenne, they are close. the model failed at the same location as on cheyenne (see the attached out file). however, the err output on orion is not identical (nor as descriptive) to the err output on cheyenne (even w/ mapl & esmf debug versions being used). i've attached a screenshot of the err file from orion below.

atmaero_control_p8 also fails using orion.gnu: /work/noaa/stmp/cbook/stmp/cbook/FV3_RT/rt_422829

a complete set of rt_gnu.conf RTs is running here: /work/noaa/stmp/cbook/stmp/cbook/FV3_RT/rt_6745

the exp. path on orion is here: /work/noaa/stmp/cbook/stmp/cbook/FV3_RT/rt_283017

Screen Shot 2023-05-09 at 4 02 55 PM
Screen Shot 2023-05-09 at 4 04 37 PM

@ulmononian
Copy link
Collaborator Author

ulmononian commented May 10, 2023

@jkbk2004 @mark-a-potts @climbfuji @DeniseWorthen:

to follow-up on my previous comment: all but two RTs in rt_gnu.conf pass for orion.gnu tests: cpld_control_p8 and cpld_debug_p8 fail (test suite location: work/noaa/stmp/cbook/stmp/cbook/FV3_RT/rt_215148). see rocotostat output (attached image below).

a further hint that it is a gocart/aerosols issue is the fact that cpld_control_noaero_p8 passes: /work/noaa/stmp/cbook/stmp/cbook/FV3_RT/rt_249407/cpld_control_noaero_p8

Screen Shot 2023-05-09 at 10 04 41 PM

@jkbk2004
Copy link
Collaborator

jkbk2004 commented May 10, 2023

I contacted Gocart team at GEOS-ESM/GOCART#227

@ulmononian
Copy link
Collaborator Author

I contacted Gocart team at GEOS-ESM/GOCART#227

thank you, @jkbk2004!

@zach1221
Copy link
Collaborator

@ulmononian should this issue stay open, since we're actively transitioning away from Cheyenne to Derecho, and none of has the ability to test anymore on Cheyenne?

@zach1221
Copy link
Collaborator

Closing this issue as RTs can no longer be run on Cheyenne. Cpld_control_p8_gnu will continue to run on Hera and Hercules going forward.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Backlog: platforms and RT Oct 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

7 participants