Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update support for low-resolution Greenland tests #6445

Merged

Conversation

stephenprice
Copy link
Contributor

This PR updates and/or adds new testing support for IG and BG test
cases using an active, 20km Greenland ice sheet.

Updates & changes include:

  • updated SMS test support so that mali hist files are included in baseline comparisons
  • added a default PE layout for Chrys and pm-cpu for low-res, BG case with active 20km GIS
  • simplified existing ERS & SMS IG case tests to use same testdef files
  • updated all IG and BG tests to use v3 Icos ocean mesh
  • moved IG tests out of elm component dirs and into mali component dirs
  • all updates have been tested on pm-cpu and Chrys

Remove 'exclude_testing' attribute for MALI to allow MALI generation
and comparison of MALI hist files for SMS test. Add nl specs for SMS
and ERS tests so that hist.am files are NOT written out as part of
testing (allows hist files to be compared rather than hist.am files).
Added defulat PE layout for BG case using GIS 20km for
Chrys and PM-cpu. Added some additional explicit stub GLC specs
to various new PE layouts (so that these aren't accidentally
'found' and chosen for other layouts w/ an active GLC).
Add BGWCYCL1850 test with 20km GIS; move location of existing IGELM_MLI
tests and support from ./elm component dir to ./mpas-albany-landice dir;
simplify tests so that they all use the same user_nl_mali and shell
scripts.
@stephenprice stephenprice added mpas-albany-landice Testing Anything related to unit/system tests Chrysalis pm-cpu Perlmutter at NERSC (CPU-only nodes) labels May 24, 2024
@stephenprice
Copy link
Contributor Author

Updated tests associated with this PR can be run using:

./create_test SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.IGELM_MLI.chrysalis_gnu.mali-gis20km --project e3sm -g -o -b gis20.BGELM_MLI.baseline.SMS

./create_test ERS.ne30pg2_r05_IcoswISC30E3r5_gis20.IGELM_MLI.chrysalis_gnu.mali-gis20km --project e3sm

./create_test SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km --project e3sm -g -o -b gis20.BGWCYCL1850.baseline.SMS

For Perlmutter, replace "chysalis_gnu" with "pm-cpu_gnu".

Note that for SMS tests, new baselines should first be generated using "-g -o" flags before comparing using "-c" flag.

@jonbob
Copy link
Contributor

jonbob commented May 28, 2024

@stephenprice -- can you change the title/description to not include the branch name?

@stephenprice stephenprice changed the title Stephenprice/glc/update gis20km tests Update support for low-resolution Greenland tests. Jun 10, 2024
@stephenprice stephenprice changed the title Update support for low-resolution Greenland tests. Update support for low-resolution Greenland tests Jun 10, 2024
@stephenprice
Copy link
Contributor Author

Hi @jonbob. Apologies for the long delay. Just getting back to this now. Title changed to something sensible now (sorry for missing that first time around).

@rljacob
Copy link
Member

rljacob commented Jun 13, 2024

@matthewhoffman and @trhille please review.

@trhille
Copy link
Contributor

trhille commented Jun 14, 2024

Testing on Perlmutter:

  • SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.IGELM_MLI.pm-cpu_gnu.mali-gis20km
  • ERS.ne30pg2_r05_IcoswISC30E3r5_gis20.IGELM_MLI.pm-cpu_gnu.mali-gis20km
  • SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.pm-cpu_gnu.allactive-gis20km

Testing on Chrysalis:

  • SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.IGELM_MLI.chrysalis_gnu.mali-gis20km
  • ERS.ne30pg2_r05_IcoswISC30E3r5_gis20.IGELM_MLI.chrysalis_gnu.mali-gis20km
  • SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km

It looks like the ERS test is failing on Chrysalis, as there is no glc.log file for the restart. However, I'm not able to find any other indication of it failing, so maybe I'm missing something.

@stephenprice
Copy link
Contributor Author

stephenprice commented Jun 17, 2024

The ERS test that passes fine on PM-cpu is failing on Chyrs at the point it is supposed to be copying over the relevant history files from the restart case, for comparison to those already save for the 'base' case. The error, that's coming from /driver-mct/shr/seq_infodata_mod.F90, is:

0: ERROR: (seq_infodata_Check) : invalid continue restart case name =

(Edited) Jon Wolfe noted that this problem has already been reported and discussed (on ~May 31 infrastructure-devops slack thread) and is a result of the following PR: #6329. Flagging @rljacob as an FYI.

Copy link
Contributor

@trhille trhille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ERS test does not pass on Chrysalis, but it sounds like this is a known issue not specific to this PR, so I'm approving.

@mperego
Copy link
Contributor

mperego commented Jun 28, 2024

@stephenprice @jonbob Can we go ahead and merge this PR or do we need to address the issue with Chrysalis?

@jonbob jonbob added the BFB PR leaves answers BFB label Jul 1, 2024
Copy link
Contributor

@matthewhoffman matthewhoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephenprice , I added a couple small questions for you to my review.

Additionally, I tested the 3 tests on Chrysalis.

  • SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.IGELM_MLI.chrysalis_gnu.mali-gis20km passed
  • ERS.ne30pg2_r05_IcoswISC30E3r5_gis20.IGELM_MLI.chrysalis_gnu.mali-gis20km passed except for the final operation, which is documented in this PR to be caused by other issues.
  • SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km failed to compile and gave the error below.

@jonbob or @rljacob , is this error familiar to you? I'm guessing maybe I need to rebase in order to get this merge from June 13?
* bfc6631a0b Merge branch 'dqwu/mach/new-bebop-soft' (PR #6414)


Build error for third test:

~/e3sm-gis/E3SM-update-gis20km-tests/cime/scripts$ ./create_test SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km --project e3sm -g -o -b gis20.BGWCYCL1850.baseline.SMS --wait
create_test will do up to 1 tasks simultaneously
create_test will use up to 160 cores simultaneously
Creating test directory /lcrc/group/e3sm/ac.mhoffman/scratch/chrys/SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km.G.20240702_133048_vsw84i
RUNNING TESTS:
  SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km
Starting CREATE_NEWCASE for test SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km with 1 procs
Finished CREATE_NEWCASE for test SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km in 7.389073 seconds (PASS)
Starting XML for test SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km with 1 procs
Finished XML for test SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km in 0.590582 seconds (PASS)
Starting SETUP for test SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km with 1 procs
Finished SETUP for test SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km in 0.596137 seconds (FAIL). [COMPLETED 1 of 1]
    Case dir: /lcrc/group/e3sm/ac.mhoffman/scratch/chrys/SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km.G.20240702_133048_vsw84i
    Errors were:
        ERROR: module command /gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/gcc-9.3.0/lmod-8.3-5be73rg/lmod/lmod/libexec/lmod python purge  failed with message:
        /gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/gcc-9.3.0/lua-5.3.5-xp4uqzv/bin/lua: error loading module 'posix.ctype' from file '/gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/gcc-9.3.0/lua-luaposix-33.4.0-oe44dcc/lib/lua/5.3/posix.so':
        	/lib64/libcrypt.so.1: version `XCRYPT_2.0' not found (required by /gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/gcc-9.3.0/lua-luaposix-33.4.0-oe44dcc/lib/lua/5.3/posix.so)
        stack traceback:
        	[C]: in ?
        	[C]: in function 'require'
        	...lua-luaposix-33.4.0-oe44dcc/share/lua/5.3/posix/init.lua:29: in main chunk
        	[C]: in function 'require'
        	...x86_64/gcc-9.3.0/lmod-8.3-5be73rg/lmod/lmod/libexec/lmod:61: in main chunk
        	[C]: in ?

Waiting for tests to finish
FAIL SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km (phase SETUP)
    Case dir: /lcrc/group/e3sm/ac.mhoffman/scratch/chrys/SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km.G.20240702_133048_vsw84i
test-scheduler took 8.624069690704346 seconds

@@ -0,0 +1 @@
config_am_globalstats_enable = .false.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephenprice , was this necessary to get the tests to run? It seems like it would be nice to have globalStats enabled in the tests unless there is problem doing so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This is a short-term workaround to get the mali hist files recognized by the testing framework. If the .am files are included in the outputs, the testing framework does the comparison on those rather than the full hist. files. This is some quirk with the current testing infrastructure, which seems to have very limited support for including / accounting for MPAS component outputs. I've been communicating with @jasonb5 about this and this was one of the suggested workarounds to allow MALI hist files to be included in the SMS comparison tests, for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks, that's fine then. For these short tests, if the full output files are being compared, it's not that important that the globalStats files exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to delete this file, as well as the user_nl_mali file in the same directory? The PR description indicates these tests were moved from the ELM dir to the MALI dir, so are these copies in the ELM dir needed at all anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Yes, I think that is what I intended to do here. Let me look at that and re-test to confirm that everything still works when I make that change. Assuming it does, I'll push that correction. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this, re-tested, and the 20km SMS tests still work correctly, so I will push this update in a minute. Thanks again for catching that.

@jonbob
Copy link
Contributor

jonbob commented Jul 2, 2024

@stephenprice -- I also had to add the "+SGLC." to some of the layouts in

cime_config/testmods_dirs/config_pes_tests.xml

to avoid getting

        ERROR: More than one PE layout matches given PE specs

for SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km

@jonbob
Copy link
Contributor

jonbob commented Jul 2, 2024

@matthewhoffman -- I haven't seen that error before and didn't run into it testing today. I've added a PR to bring in a new version of gnu on chrysalis which may be helpful

@jonbob
Copy link
Contributor

jonbob commented Jul 2, 2024

@stephenprice -- everything passed for me on pm-cpu except SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.pm-cpu_gnu.allactive-gis20km. That one gives me Kokkos errors -- did it work for you?

@stephenprice
Copy link
Contributor Author

@jonbob -- I did not re-test on pm-cpu, just Chrys. My SMS tests on Chyrs all passed, but the ERS failed as per before. But this was just from my branch so I probably did not have the recent gnu update for Chyrs. I can re-test on pm-cpu if you like. I'll also check w/ @mperego to see if that's expected (B cases w/ MALI have failed in the past due to Kokkos conflicts between EAM and MALI).

@jonbob
Copy link
Contributor

jonbob commented Jul 2, 2024

I'll test again on chrysalis once the gnu update is one master. But the fully-coupled test did pass on chrysalis -- just having issues on pm-cpu. I'm getting some help to see if the installation needs some tweaking there

@mperego
Copy link
Contributor

mperego commented Jul 2, 2024

@stephenprice @jonbob, what Kokkos error do you see? We were not expecting this to fail. I think that we are using the same version of Kokkos on Chrysalis and pm-cpu, so they should behave similarly.

@mperego
Copy link
Contributor

mperego commented Jul 2, 2024

@jonbob if you update gnu on Chrysalis we might need to re-install Albany and Trilinos. It might be better to do so after this PR is merged.

@jonbob
Copy link
Contributor

jonbob commented Jul 2, 2024

Thanks @mperego -- I got this:

261: terminate called after throwing an instance of 'std::runtime_error'
261:   what():  Kokkos failed to allocate memory for label "hi".  Allocation using MemorySpace named "Host" failed with the following error:  Allocation of size 136 B failed, likely due to insufficient memory.  (The allocation mechanism was standard malloc().)
261:
261:
261: Program received signal SIGABRT: Process abort signal.

@stephenprice
Copy link
Contributor Author

@mperego -- the error Jon is pointing to above is the same error I'm getting when trying to run a BG case on Chyrs (I think his report is from pm-cpu). To be clear, the circumstances for this are a branch from my fork that was merged w/ master yesterday afternoon (so I'm not sure if it's got the gnu compiler update yet or not).

@jonbob
Copy link
Contributor

jonbob commented Jul 2, 2024

@mperego -- the current landice_developer suite runs fine on chrysalis with the gnu update, so we may not require re-installation before that PR goes in. But I'm rerunning a BG test to make sure it plays OK with the atm using Kokkos as well

@mperego
Copy link
Contributor

mperego commented Jul 2, 2024

@mperego -- the error Jon is pointing to above is the same error I'm getting when trying to run a BG case on Chyrs (I think his report is from pm-cpu). To be clear, the circumstances for this are a branch from my fork that was merged w/ master yesterday afternoon (so I'm not sure if it's got the gnu compiler update yet or not).

@stephenprice I thought you tried the BG case before on Chrysalis and it was working fine. Is this error new?

@jewatkins
Copy link
Contributor

This recent change (yesterday afternoon) might be why kokkos errors are showing up: c7d7998#diff-f8a781c6b7da95fb77cc79868123181ff46fc639c8fa9de5e60c1dfa8c15a659L280
I think if Kokkos_ROOT is not set, e3sm will use it's version of kokkos which is different from the old trilinos/kokkos that was being used before.

That was probably removed by eamxx because they were seeing openmp errors (see #6473). But the trilinos/kokkos situation was never fixed.

@jewatkins
Copy link
Contributor

The new pre-installs with kokkos 4.0 have been ready to go:
Chrysalis: /lcrc/group/e3sm/soft/
pmcpu: /global/common/software/e3sm
and they would probably work using Kokkos_ROOT but I think we have to figure out a new way to use pre-installed kokkos when landice is on (and also landice+eamxx?). I think @bartgol @jgfouca have already discussed this some so I'll defer to them to see what they suggest.

@mperego
Copy link
Contributor

mperego commented Jul 3, 2024

Ooh.. damn it, we were so close to have this merged...
@jonbob @stephenprice can we simply disable the BG tests, but leave all the required plumbing in there, and we can re-enable it with a subsequent PR by @jewatkins?

Thanks @jewatkins for finding the culprit. I looked at PR #6437 but didn't check what went to master recently.

@bartgol
Copy link
Contributor

bartgol commented Jul 3, 2024

How big of a deal is to disable the BG test? Jim is off this week, and I'll be off the next 2 weeks, so I can't interact with him for a while. But I'm sure he can handle patching things up to get the correct kokkos found by e3sm.

That said, the runtime error is quite bizarre. I can't formulate any theory on what could be wrong, and it's prob best to get kokkos versioning sorted out first, to rule out incompatibilities of that sort.

@stephenprice
Copy link
Contributor Author

Hi all. Sorry for the delay in responding. Let me do a bit more testing today and I'll report back (now that the gnu fix for Chrys is on master). The BG test has not been merged yet -- it's part of this PR -- so it should not (yet) show up as failing or need to be disabled. We could either hold off merging it or merge it w/ it disabled for the time being. I will defer to @jonbob on that.

@mperego
Copy link
Contributor

mperego commented Jul 3, 2024

@stephenprice , thanks. What I was suggesting is to go ahead and merge this PR with BG disabled. @jewatkins is planning to update the Albany/Trilinos libraries to use Kokkos 4.2 and I think he can enable BG when he submit his PR.

@stephenprice
Copy link
Contributor Author

That sounds fine to me if it works for @jonbob. I'm confused though if we're talking about new lib. support needed for both PM-cpu and Chrys, or just one or the other.

@mperego
Copy link
Contributor

mperego commented Jul 3, 2024

New Trilinos/Albany libraries are needed on both Chrysalis and pm-cpu. The current libraries use an older version of Kokkos and need to be updated.

@jonbob
Copy link
Contributor

jonbob commented Jul 3, 2024

@stephenprice -- the BG test is failing on both chrysalis and pm-cpu, so we need a fix for both platforms. We can remove the BG test temporarily and merge it on Monday? Or leave it in and just know that it will fail until the libraries are updated. @stephenprice -- we also need a couple of small patches to config_pes_tests.xml. I can push them to your branch or just get them to you

jonbob added a commit that referenced this pull request Jul 9, 2024
)

Update support for low-resolution Greenland tests

This PR updates and/or adds new testing support for IG and BG test cases
using an active, 20km Greenland ice sheet.

Updates & changes include:
* updated SMS test support so that mali hist files are included in
  baseline comparisons
* added a default PE layout for Chrys and pm-cpu for low-res, BG case
  with active 20km GIS
* simplified existing ERS & SMS IG case tests to use same testdef files
* updated all IG and BG tests to use v3 Icos ocean mesh
* moved IG tests out of elm component dirs and into mali component dirs

[BFB] for all currently tested configurations
@jonbob
Copy link
Contributor

jonbob commented Jul 9, 2024

Passes the new landice_developer suite on chrysalis, with the exception of

SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km

which is expected to fail until PR #6473 is completed

Also passes:

  • ERP_Ld3.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.chrysalis_intel.allactive-pioroot1

merged to next

@jonbob jonbob merged commit 04842fe into E3SM-Project:master Jul 10, 2024
20 checks passed
@mperego
Copy link
Contributor

mperego commented Jul 10, 2024

Thanks @jonbob !

@jonbob
Copy link
Contributor

jonbob commented Jul 10, 2024

merged to master and baselines requested for new tests

@stephenprice stephenprice deleted the stephenprice/glc/update-gis20km-tests branch September 19, 2024 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB Chrysalis mpas-albany-landice pm-cpu Perlmutter at NERSC (CPU-only nodes) Testing Anything related to unit/system tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants