Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with running fates_next_api/release-clm5.0 on izumi #1093

Open
ekluzek opened this issue Jul 31, 2020 · 17 comments
Open

Problems with running fates_next_api/release-clm5.0 on izumi #1093

ekluzek opened this issue Jul 31, 2020 · 17 comments
Assignees
Labels
branch tag: release Changes go on release branch as well as master bug something is working incorrectly enhancement new capability or improved behavior of existing capability FATES API update Changes to the FATES version that also REQUIRE an API change in CTSM

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented Jul 31, 2020

Brief summary of bug

Jackie has had problems running on izumi of late with fates_next_api.
@jkshuman

General bug information

CTSM version you are using: release-clm5.0.30-143-gabcd5937

Does this bug cause significantly incorrect results in the model's science? No
Configurations affected: izumi_intel

Details of bug

Failure of building gptl.

Important details of your setup / configuration so we can reproduce the bug

I think this is just because fates_next_api is using cime5.6.28 and needs to be updated to cime5.6.33

Important output or errors that show the problem

got another fail with gptl:
4:27
Finished creating component namelists
Building gptl with output to file /scratch/cluster/jkshuman/test2_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/gptl.bldlog.200730-162622
Calling /home/jkshuman/git/fates_next_api/cime/src/build_scripts/buildlib.gptl
ERROR: /home/jkshuman/git/fates_next_api/cime/src/build_scripts/buildlib.gptl FAILED, cat /scratch/cluster/jkshuman/test2_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/gptl.bldlog.200730-162622

4:28
my case script is here in case it is another obvious error.
4:29
path: /home/jkshuman/FATES_data/boreal/above_canada/case_izu_GLDAS_ABoVE_canada

@ekluzek ekluzek added enhancement new capability or improved behavior of existing capability bug something is working incorrectly FATES API update Changes to the FATES version that also REQUIRE an API change in CTSM labels Jul 31, 2020
@ekluzek ekluzek self-assigned this Jul 31, 2020
@ekluzek
Copy link
Collaborator Author

ekluzek commented Jul 31, 2020

From looking at the ChangeLog for cime, I think this should drop in with no problems, and no change of answers.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Jul 31, 2020

Here's another log message:

[jkshuman@izumi bld]$ cat gptl.bldlog.200729-182059
gmake --output-sync -f /home/jkshuman/git/fates_next_api/cime/src/share/timing/Makefile install -C /scratch/cluster/jkshuman/testGLDAS_boreal_Izu_intel_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl MACFILE=/home/jkshuman/FATES_cases/Canada/test/testGLDAS_boreal_Izu_intel_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/Macros.make MODEL=gptl GPTL_DIR=/home/jkshuman/git/fates_next_api/cime/src/share/timing GPTL_LIBDIR=/scratch/cluster/jkshuman/testGLDAS_boreal_Izu_intel_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl SHAREDPATH=/scratch/cluster/jkshuman/testGLDAS_boreal_Izu_intel_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads
gmake: Entering directory '/scratch/cluster/jkshuman/testGLDAS_boreal_Izu_intel_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl'
mpicc -c -I/home/jkshuman/git/fates_next_api/cime/src/share/timing -qno-opt-dynamic-align -fp-model precise -std=gnu99 -lifcore -O2 -debug minimal -DHAVE_NANOTIME -DBIT64 -DHAVE_VPRINTF -DHAVE_BACKTRACE -DHAVE_SLASHPROC -DHAVE_COMM_F2C -DHAVE_TIMES -DHAVE_GETTIMEOFDAY -DFORTRANUNDERSCORE -DCPRINTEL -DHAVE_MPI /home/jkshuman/git/fates_next_api/cime/src/share/timing/gptl.c
/home/jkshuman/git/fates_next_api/cime/src/share/timing/Makefile:57: recipe for target 'gptl.o' failed
gmake: Leaving directory '/scratch/cluster/jkshuman/testGLDAS_boreal_Izu_intel_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl'
gmake: *** [gptl.o] Error 127
ERROR: gmake: *** [gptl.o] Error 127[jkshuman@izumi bld]$

@jkshuman
Copy link
Contributor

jkshuman commented Aug 3, 2020

Thanks @ekluzek I will follow up with you on where to get the updated cime. unless you want to post here, and I can update and test.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Aug 3, 2020

@jkshuman use cime5.6.33 and see if it works.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Aug 3, 2020

@jkshuman you must be watching now (or I just had a glitch Friday), as you now show up when I start typing your name. If I start typing someone's name and they don't show up as an option, it's usually because they aren't watching.

@jkshuman
Copy link
Contributor

jkshuman commented Aug 3, 2020

@ekluzek still getting a fail on Izumi. Let me know if I am missing something:
Working in this clone directory: /home/jkshuman/git/fates_next_api
Updated cime in the Externals.cfg file to cime.5.6.33
Ran ./manage_externals/checkout_externals cime
(also tried running ./manage_externals/checkout_externals and got same fail)
inside cime folder git describe shows an update to: cime5.6.32-16-gb5d8cb94e

case build still fails on Izumi:
Calling /home/jkshuman/git/fates_next_api/cime/src/components/stub_comps/sesp/cime_config/buildnml Calling /home/jkshuman/git/fates_next_api/cime/src/drivers/mct/cime_config/buildnml Finished creating component namelists Building gptl with output to file /scratch/cluster/jkshuman/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/gptl.bldlog.200803-172712 Calling /home/jkshuman/git/fates_next_api/cime/src/build_scripts/buildlib.gptl ERROR: /home/jkshuman/git/fates_next_api/cime/src/build_scripts/buildlib.gptl FAILED, cat /scratch/cluster/jkshuman/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/gptl.bldlog.200803-172712

@jkshuman
Copy link
Contributor

jkshuman commented Aug 3, 2020

error in that file:
gmake --output-sync -f /home/jkshuman/git/fates_next_api/cime/src/share/timing/Makefile install -C /scratch/cluster/jkshuman/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl MACFILE=/home/jkshuman/FATES_cases/Canada/test/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/Macros.make MODEL=gptl GPTL_DIR=/home/jkshuman/git/fates_next_api/cime/src/share/timing GPTL_LIBDIR=/scratch/cluster/jkshuman/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl SHAREDPATH=/scratch/cluster/jkshuman/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads gmake: Entering directory '/scratch/cluster/jkshuman/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl' mpicc -c -I/home/jkshuman/git/fates_next_api/cime/src/share/timing -qno-opt-dynamic-align -fp-model precise -std=gnu99 -O2 -debug minimal -DHAVE_NANOTIME -DBIT64 -DHAVE_VPRINTF -DHAVE_BACKTRACE -DHAVE_SLASHPROC -DHAVE_COMM_F2C -DHAVE_TIMES -DHAVE_GETTIMEOFDAY -DFORTRANUNDERSCORE -DCPRINTEL -DHAVE_MPI /home/jkshuman/git/fates_next_api/cime/src/share/timing/gptl.c /home/jkshuman/git/fates_next_api/cime/src/share/timing/Makefile:57: recipe for target 'gptl.o' failed gmake: Leaving directory '/scratch/cluster/jkshuman/cime_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/intel/mvapich2/nodebug/nothreads/gptl' gmake: *** [gptl.o] Error 127 ERROR: gmake: *** [gptl.o] Error 127(base)

@ekluzek ekluzek added the branch tag: release Changes go on release branch as well as master label Aug 4, 2020
@ekluzek
Copy link
Collaborator Author

ekluzek commented Aug 4, 2020

OK, I verified the same problem by testing fates_next_api, with both default cime and cime5.6.33...

SMS.f09_g17.I2000Clm50Fates.izumi_intel.clm-FatesColdDef

Then I also tried it on the release branch and see the same problem (cime5.6.33 is the default in release-clm5.0.34).

I also tried the more generic test

SMS.f09_g17.I2000Clm50BgcCrop.izumi_intel.clm-default

and it fails as well.

@ekluzek ekluzek changed the title Problems with running fates_next_api on izumi Problems with running fates_next_api/release-clm5.0 on izumi Aug 4, 2020
@ekluzek
Copy link
Collaborator Author

ekluzek commented Aug 4, 2020

OK, it looks like the izumi updates went in cime maint-5.6 branch -- but haven't been tagged. When I point cime to the latest maint-5.6 branch, it does seem to build.

This is the cime PR with the needed updates...

ESMCI/cime#3561

@jkshuman
Copy link
Contributor

jkshuman commented Aug 4, 2020

@ekluzek model builds and submits, but fails after first time-step. (This same case was successful on Hobart)
"killed by signal 15"
here is the error:
run command is mpiexec --machinefile /var/spool/torque/aux//337043.izumi.unified.ucar.edu -n 384 /scratch/cluster/jkshuman/cime_t3_Izu_GLDAS_boreal_GLDAS_0.25x_surf_0.125_6f568e42_e9f63270/bld/cesm.exe >> cesm.log.$LID 2>&1 2020-08-04 13:47:37 MODEL EXECUTION HAS FINISHED check for resubmit dout_s True mach izumi resubmit_num 23 -------------------- Post Job Clean Up -------------------- Running cleanipcs as jkshuman on i041.unified.ucar.edu Killed by signal 15. Terminated i035.unified.ucar.edu Connection to i035.unified.ucar.edu closed by remote host.

@jkshuman
Copy link
Contributor

jkshuman commented Aug 4, 2020

@ekluzek I tried a run that uses only 1 node, and got same fail. The two tests have similar fail where they complete the first time-step, and then fail on resubmit. First run was an 8 node run with monthly time-step, second test was 1 node with yearly time-step) Same fail: "killed by signal 15"

the 1 node case (junk no fire) will continue if I resubmit manually from inside the case. did not test 8 node case.
manual resubmit works on this case: /scratch/cluster/jkshuman/junk_cime_nofire_4x5_6f568e42_e9f63270/run

@jkshuman
Copy link
Contributor

jkshuman commented Aug 4, 2020

and this junk fire case running on 1 node was able to make it into year 2 automatically...
/scratch/cluster/jkshuman/junk_cime_izu_fire_4x5_6f568e42_e9f63270/run

@jkshuman
Copy link
Contributor

jkshuman commented Aug 4, 2020

Note that the cime fix for Izumi works.
Fix: Modify Externals.cfg file to point at maint-5.6 branch for access to the necessary Izumi updates
[cime]
local_path = cime
protocol = git
repo_url = https://github.com/ESMCI/cime
branch = maint-5.6
required = True

The resubmit problem is a different issue (and inconsistent as not all cases fail for my test cases). @ekluzek should we close this and open a separate issue on this resubmit?

@jedwards4b (is this Jim Edwards?) suggested including a workaround option:
./case.submit --resubmit-immediate
This option is functional on Izumi, and a test case is into year 3 (annual time-step) with this option.

per Jim "Are you aware of the resubmit immediate option to case.submit? It will submit all of your jobs at once from the login node with dependancies so that each job will complete before the next begins. This should be an effective workaround for the problem compute nodes not resubmitting properly."

@glemieux
Copy link
Collaborator

glemieux commented Aug 5, 2020

@ekluzek I just accidentally replicated the above error on my workstation trying to build a single site case. Last week while helping @jkshuman track down the issue using my workstation I had been able to successfully build and run with sci.1.40.1_api.13.0.1 and fates_next_api release-clm5.0.30-143-gabcd5937 (with cime5.6.28).

The trigger for the failure this time was that I was trying to build the case with a conda environment activated that I don't normally use during case builds. Perhaps that suggests it's an issue with the module versions loaded on izumi? I can provide my output from conda list if you think it'd be helpful.

@jkshuman
Copy link
Contributor

jkshuman commented Aug 5, 2020

@glemieux this is interesting. I just did an overhaul of my conda environments. though I do not recall which conda was active (if any) when I ran these test cases.

@ekluzek ekluzek added the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Aug 11, 2020
@billsacks billsacks removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Aug 13, 2020
@jkshuman
Copy link
Contributor

jkshuman commented Sep 4, 2020

tested cime branch cime.5.8.30 on Izumi with fates_main_api per @ekluzek recommendation and simulation was successful. Thanks @ekluzek

ctsm commit fde33f56e9d65e7cebc79a7a2319d8b1e5959296 (HEAD -> fates_main_api, escomp_ctsm_repo/fates_main_api)
cime5.8.30
fates_main commit 61a751c37181f18162e23851fff495db62fc807a (HEAD -> master, tag: sci.1.41.0_api.13.0.1, origin/master, origin/HEAD)

path to output: /scratch/cluster/jkshuman/archive/t4_izumi_JKS_C3_main_4x5_fde33f56_6bfea0f8/lnd/hist

@glemieux
Copy link
Collaborator

glemieux commented Sep 10, 2020

@ekluzek noted in the ctsm software meeting today that cime5.8.24 is minimum necessary to alleviate this issue. For fates_main_api this should be taken care of for fates_main_api with PR #1137 as it brings the branch up to cime5.8.28.

samsrabin pushed a commit to samsrabin/CTSM that referenced this issue Sep 17, 2024
…m-updates

Update default allometry parameters for tree PFTs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch tag: release Changes go on release branch as well as master bug something is working incorrectly enhancement new capability or improved behavior of existing capability FATES API update Changes to the FATES version that also REQUIRE an API change in CTSM
Projects
None yet
Development

No branches or pull requests

4 participants