-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update support for low-resolution Greenland tests #6445
Update support for low-resolution Greenland tests #6445
Conversation
Remove 'exclude_testing' attribute for MALI to allow MALI generation and comparison of MALI hist files for SMS test. Add nl specs for SMS and ERS tests so that hist.am files are NOT written out as part of testing (allows hist files to be compared rather than hist.am files).
Added defulat PE layout for BG case using GIS 20km for Chrys and PM-cpu. Added some additional explicit stub GLC specs to various new PE layouts (so that these aren't accidentally 'found' and chosen for other layouts w/ an active GLC).
Add BGWCYCL1850 test with 20km GIS; move location of existing IGELM_MLI tests and support from ./elm component dir to ./mpas-albany-landice dir; simplify tests so that they all use the same user_nl_mali and shell scripts.
Updated tests associated with this PR can be run using:
For Perlmutter, replace "chysalis_gnu" with "pm-cpu_gnu". Note that for SMS tests, new baselines should first be generated using "-g -o" flags before comparing using "-c" flag. |
@stephenprice -- can you change the title/description to not include the branch name? |
Hi @jonbob. Apologies for the long delay. Just getting back to this now. Title changed to something sensible now (sorry for missing that first time around). |
@matthewhoffman and @trhille please review. |
Testing on Perlmutter:
Testing on Chrysalis:
It looks like the ERS test is failing on Chrysalis, as there is no glc.log file for the restart. However, I'm not able to find any other indication of it failing, so maybe I'm missing something. |
The ERS test that passes fine on PM-cpu is failing on Chyrs at the point it is supposed to be copying over the relevant history files from the restart case, for comparison to those already save for the 'base' case. The error, that's coming from
(Edited) Jon Wolfe noted that this problem has already been reported and discussed (on ~May 31 infrastructure-devops slack thread) and is a result of the following PR: #6329. Flagging @rljacob as an FYI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ERS test does not pass on Chrysalis, but it sounds like this is a known issue not specific to this PR, so I'm approving.
@stephenprice @jonbob Can we go ahead and merge this PR or do we need to address the issue with Chrysalis? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephenprice , I added a couple small questions for you to my review.
Additionally, I tested the 3 tests on Chrysalis.
SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.IGELM_MLI.chrysalis_gnu.mali-gis20km
passedERS.ne30pg2_r05_IcoswISC30E3r5_gis20.IGELM_MLI.chrysalis_gnu.mali-gis20km
passed except for the final operation, which is documented in this PR to be caused by other issues.SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km
failed to compile and gave the error below.
@jonbob or @rljacob , is this error familiar to you? I'm guessing maybe I need to rebase in order to get this merge from June 13?
* bfc6631a0b Merge branch 'dqwu/mach/new-bebop-soft' (PR #6414)
Build error for third test:
~/e3sm-gis/E3SM-update-gis20km-tests/cime/scripts$ ./create_test SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km --project e3sm -g -o -b gis20.BGWCYCL1850.baseline.SMS --wait
create_test will do up to 1 tasks simultaneously
create_test will use up to 160 cores simultaneously
Creating test directory /lcrc/group/e3sm/ac.mhoffman/scratch/chrys/SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km.G.20240702_133048_vsw84i
RUNNING TESTS:
SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km
Starting CREATE_NEWCASE for test SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km with 1 procs
Finished CREATE_NEWCASE for test SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km in 7.389073 seconds (PASS)
Starting XML for test SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km with 1 procs
Finished XML for test SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km in 0.590582 seconds (PASS)
Starting SETUP for test SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km with 1 procs
Finished SETUP for test SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km in 0.596137 seconds (FAIL). [COMPLETED 1 of 1]
Case dir: /lcrc/group/e3sm/ac.mhoffman/scratch/chrys/SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km.G.20240702_133048_vsw84i
Errors were:
ERROR: module command /gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/gcc-9.3.0/lmod-8.3-5be73rg/lmod/lmod/libexec/lmod python purge failed with message:
/gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/gcc-9.3.0/lua-5.3.5-xp4uqzv/bin/lua: error loading module 'posix.ctype' from file '/gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/gcc-9.3.0/lua-luaposix-33.4.0-oe44dcc/lib/lua/5.3/posix.so':
/lib64/libcrypt.so.1: version `XCRYPT_2.0' not found (required by /gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/gcc-9.3.0/lua-luaposix-33.4.0-oe44dcc/lib/lua/5.3/posix.so)
stack traceback:
[C]: in ?
[C]: in function 'require'
...lua-luaposix-33.4.0-oe44dcc/share/lua/5.3/posix/init.lua:29: in main chunk
[C]: in function 'require'
...x86_64/gcc-9.3.0/lmod-8.3-5be73rg/lmod/lmod/libexec/lmod:61: in main chunk
[C]: in ?
Waiting for tests to finish
FAIL SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km (phase SETUP)
Case dir: /lcrc/group/e3sm/ac.mhoffman/scratch/chrys/SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km.G.20240702_133048_vsw84i
test-scheduler took 8.624069690704346 seconds
@@ -0,0 +1 @@ | |||
config_am_globalstats_enable = .false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephenprice , was this necessary to get the tests to run? It seems like it would be nice to have globalStats enabled in the tests unless there is problem doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This is a short-term workaround to get the mali hist files recognized by the testing framework. If the .am files are included in the outputs, the testing framework does the comparison on those rather than the full hist. files. This is some quirk with the current testing infrastructure, which seems to have very limited support for including / accounting for MPAS component outputs. I've been communicating with @jasonb5 about this and this was one of the suggested workarounds to allow MALI hist files to be included in the SMS comparison tests, for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks, that's fine then. For these short tests, if the full output files are being compared, it's not that important that the globalStats files exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to delete this file, as well as the user_nl_mali
file in the same directory? The PR description indicates these tests were moved from the ELM dir to the MALI dir, so are these copies in the ELM dir needed at all anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Yes, I think that is what I intended to do here. Let me look at that and re-test to confirm that everything still works when I make that change. Assuming it does, I'll push that correction. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this, re-tested, and the 20km SMS tests still work correctly, so I will push this update in a minute. Thanks again for catching that.
@stephenprice -- I also had to add the "+SGLC." to some of the layouts in
to avoid getting
for SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.chrysalis_gnu.allactive-gis20km |
@matthewhoffman -- I haven't seen that error before and didn't run into it testing today. I've added a PR to bring in a new version of gnu on chrysalis which may be helpful |
@stephenprice -- everything passed for me on pm-cpu except SMS.ne30pg2_r05_IcoswISC30E3r5_gis20.BGWCYCL1850.pm-cpu_gnu.allactive-gis20km. That one gives me Kokkos errors -- did it work for you? |
@jonbob -- I did not re-test on pm-cpu, just Chrys. My SMS tests on Chyrs all passed, but the ERS failed as per before. But this was just from my branch so I probably did not have the recent gnu update for Chyrs. I can re-test on pm-cpu if you like. I'll also check w/ @mperego to see if that's expected (B cases w/ MALI have failed in the past due to Kokkos conflicts between EAM and MALI). |
I'll test again on chrysalis once the gnu update is one master. But the fully-coupled test did pass on chrysalis -- just having issues on pm-cpu. I'm getting some help to see if the installation needs some tweaking there |
@stephenprice @jonbob, what Kokkos error do you see? We were not expecting this to fail. I think that we are using the same version of Kokkos on Chrysalis and pm-cpu, so they should behave similarly. |
@jonbob if you update gnu on Chrysalis we might need to re-install Albany and Trilinos. It might be better to do so after this PR is merged. |
Thanks @mperego -- I got this:
|
@mperego -- the error Jon is pointing to above is the same error I'm getting when trying to run a BG case on Chyrs (I think his report is from pm-cpu). To be clear, the circumstances for this are a branch from my fork that was merged w/ master yesterday afternoon (so I'm not sure if it's got the gnu compiler update yet or not). |
@mperego -- the current landice_developer suite runs fine on chrysalis with the gnu update, so we may not require re-installation before that PR goes in. But I'm rerunning a BG test to make sure it plays OK with the atm using Kokkos as well |
@stephenprice I thought you tried the BG case before on Chrysalis and it was working fine. Is this error new? |
This recent change (yesterday afternoon) might be why kokkos errors are showing up: c7d7998#diff-f8a781c6b7da95fb77cc79868123181ff46fc639c8fa9de5e60c1dfa8c15a659L280 That was probably removed by eamxx because they were seeing openmp errors (see #6473). But the trilinos/kokkos situation was never fixed. |
The new pre-installs with kokkos 4.0 have been ready to go: |
Ooh.. damn it, we were so close to have this merged... Thanks @jewatkins for finding the culprit. I looked at PR #6437 but didn't check what went to master recently. |
How big of a deal is to disable the BG test? Jim is off this week, and I'll be off the next 2 weeks, so I can't interact with him for a while. But I'm sure he can handle patching things up to get the correct kokkos found by e3sm. That said, the runtime error is quite bizarre. I can't formulate any theory on what could be wrong, and it's prob best to get kokkos versioning sorted out first, to rule out incompatibilities of that sort. |
Hi all. Sorry for the delay in responding. Let me do a bit more testing today and I'll report back (now that the gnu fix for Chrys is on master). The BG test has not been merged yet -- it's part of this PR -- so it should not (yet) show up as failing or need to be disabled. We could either hold off merging it or merge it w/ it disabled for the time being. I will defer to @jonbob on that. |
@stephenprice , thanks. What I was suggesting is to go ahead and merge this PR with BG disabled. @jewatkins is planning to update the Albany/Trilinos libraries to use Kokkos 4.2 and I think he can enable BG when he submit his PR. |
That sounds fine to me if it works for @jonbob. I'm confused though if we're talking about new lib. support needed for both PM-cpu and Chrys, or just one or the other. |
New Trilinos/Albany libraries are needed on both Chrysalis and pm-cpu. The current libraries use an older version of Kokkos and need to be updated. |
@stephenprice -- the BG test is failing on both chrysalis and pm-cpu, so we need a fix for both platforms. We can remove the BG test temporarily and merge it on Monday? Or leave it in and just know that it will fail until the libraries are updated. @stephenprice -- we also need a couple of small patches to config_pes_tests.xml. I can push them to your branch or just get them to you |
) Update support for low-resolution Greenland tests This PR updates and/or adds new testing support for IG and BG test cases using an active, 20km Greenland ice sheet. Updates & changes include: * updated SMS test support so that mali hist files are included in baseline comparisons * added a default PE layout for Chrys and pm-cpu for low-res, BG case with active 20km GIS * simplified existing ERS & SMS IG case tests to use same testdef files * updated all IG and BG tests to use v3 Icos ocean mesh * moved IG tests out of elm component dirs and into mali component dirs [BFB] for all currently tested configurations
Passes the new landice_developer suite on chrysalis, with the exception of
which is expected to fail until PR #6473 is completed Also passes:
merged to next |
Thanks @jonbob ! |
merged to master and baselines requested for new tests |
This PR updates and/or adds new testing support for IG and BG test
cases using an active, 20km Greenland ice sheet.
Updates & changes include: