-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Software module updates in hpc-stack for Hera (intel, gnu) #1468
Conversation
Updating the location of a fresh installation of the hpc-stack modulefiles and miniconda3/4.12.0 version
Updating locations of newly installed hpc-stack modules and miniconda3/4.12.0
Updating miniconda3 module installation location with python3.9
@natalie-perlin I'm confused about whether this PR changes baselines or not. You report that a small subset of the RTs pass, but you've also checked the box saying that one or more tests have changed results. Can you run the entire intel and gnu baselines on a single platform with your PR and verify whether any test changes results? |
|
@natalie-perlin two baselines fail with intel: control_fhzero and control_CubedSphereGrid_parallel. |
@natalie-perlin gnu baselines are reproduced ok. |
Hi Jong,
Do you have the output of these two failed tests to look at?
Natalie.
*--*
*Natalie Perlin, Ph.D.*
*Sr. Systems Engineer*
*RedLine Performance Solutions, LLC*
***@***.*** ***@***.***>*
*(541) 231-2320*
…On Sun, Nov 6, 2022 at 7:58 PM JONG KIM ***@***.***> wrote:
@natalie-perlin <https://github.com/natalie-perlin> two baselines fail
with intel: control_fhzero and control_CubedSphereGrid_parallel.
—
Reply to this email directly, view it on GitHub
<#1468 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQHA63EIEA2DTOCI2N2BXRLWHBA2BANCNFSM6AAAAAARKTUKVQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
/scratch1/NCEPDEV/stmp2/Jong.Kim/FV3_RT/rt_9126
-Jong
On Mon, Nov 7, 2022 at 10:30 AM Natalie Perlin ***@***.***>
wrote:
… Hi Jong,
Do you have the output of these two failed tests to look at?
Natalie.
*--*
*Natalie Perlin, Ph.D.*
*Sr. Systems Engineer*
*RedLine Performance Solutions, LLC*
***@***.*** ***@***.***>*
*(541) 231-2320*
On Sun, Nov 6, 2022 at 7:58 PM JONG KIM ***@***.***> wrote:
> @natalie-perlin <https://github.com/natalie-perlin> two baselines fail
> with intel: control_fhzero and control_CubedSphereGrid_parallel.
>
> —
> Reply to this email directly, view it on GitHub
> <
#1468 (comment)
>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AQHA63EIEA2DTOCI2N2BXRLWHBA2BANCNFSM6AAAAAARKTUKVQ
>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#1468 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJG7GYDRBN6AJU2HOYKUUATWHEOA3ANCNFSM6AAAAAARKTUKVQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Jong Kim
Science and Technology Corp at NOAA-EPIC
***@***.***
Cell Phone: 630) 484-5053
|
Looking at the tests that appear to be failing:
1) control_CubedSphereGrid_parallel
Directory:
*/scratch1/NCEPDEV/stmp2/Jong.Kim/FV3_RT/rt_9126/control_CubedSphereGrid_parallel/*
The file *./out *has reported the successful model ending (error code=0):
Job 37471637 (not serial) finished for user Jong.Kim in partition hera with
exit code 0:0
The file *./compare_ncfile.log* complains about the missing numpy module:
Traceback (most recent call last):
File
"/scratch2/NCEPDEV/marine/Jong.Kim/UFS-RT/rt-1468-intel/tests/compare_ncfile.py",
line 3, in <module>
import numpy as np
ModuleNotFoundError: No module named 'numpy'
To use the miniconda-installed python modules, a virtual environment needs
to be activated (regional_workflow installed along with the
miniconda3/4.12.0 contains numpy)
2) control_fhzero
Directory: */scratch1/NCEPDEV/stmp2/Jong.Kim/FV3_RT/rt_9126/control_fhzero/*
File *./out* reports that the job successfully finished:
Job 37471638 (not serial) finished for user Jong.Kim in partition hera with
exit code 0:0
File ./compare_ncfile.log complains about missing numpy python module,
similarly to the test case #1.
|
Keeping the existing miniconda3/3.7.3 (not updating)
Keeping the existing miniconda3/3.7.3 (not updating to miniconda3/4.12.0)
Reverting to the existing path of miniconda3 and python3
I am not familiar with the scotch package. |
@DusanJovic-NOAA I had found this issue #1440 and someone suggested it was because the GNU on hera is 9.x but on cheyenne is 10.x. Is that is a reasonable guess? It seemed odd to me that it would result in being 5x slower w/ the older gnu. But if true, then gnu 10.x would be preferable. |
I ran control_p8 test using gnu 9.2 and openmpi 3.4.1 and it finished in 267 seconds. Looks like just switching from mpich openmpi speeds up model execution almost 3x. Changing the optimization level from -O2 to -O3 speeds up control_p8 test a little bit more, it now runs 254 seconds. |
@DeniseWorthen , @DusanJovic-NOAA - thank you for your comments and testing! @jkbk2004 - should we switch the compiler to gnu/10.2? |
hera system admin didn't agree to install gn10.2 since they prefer gnu12. but @ulmononian was able to install gnu10.2 thru spack. we are achieving a goal with gnu9.2/openmpi3.1.4. we can test gnu10.2. |
@jkbk2004 @natalie-perlin gnu/9.2.0 is natively installed on Hera through RDHPCS. as jong mentions, the Hera team does not want to install an intermediate gnu (e.g., they will not install gnu/10.1 or gnu/10.2), only gnu/12. i would suggest moving forward with gnu/9.2.0-openmpi/3.1.4 given @DusanJovic-NOAA's report on the rt results, rather than relying on a non-native install of gnu on Hera (e.g. installed directly from a tarball or spack), which would be the only way to utilize anything newer than gnu/9.2.0. |
@ulmononian @jkbk2004 @DusanJovic-NOAA - |
@natalie-perlin noted. i also installed gnu/10.1.0 (via spack). will the ufs-wm CM accept compiler installations performed in this way (i.e., not by RDHPCS) for use in the official ufs-wm modulefiles? @jkbk2004 |
cpld_control_p8 fails with gnu/9.2.0 and openmpi/3.1.4:
I opened hera helpdesk ticket. |
reverting back to using gnu/9.2.0 + mpich/3.3.2
reverting back to using gnu/9.2 + mpich/3.3.2
@natalie-perlin your branch is 2 commits behind. Please, sync up. Then can you make a pr to @DeniseWorthen #1486? |
@jkbk2004 - synced with the develop. |
It would look much more transparent of the work that has been done on updating the hpc-stack locations (Issue-1465) when the changes from this PR on Hera compilers go from |
If there is a need to test this PR-1468 before it is merged for working along another PR, it could be checked out in the following way:
when the
|
@jkbk2004 @BrianCurtis-NOAA Before combining PRs, we need to know that this PR has been tested and verified. I don't see that has happened. Only a limited sub-set of tests appears to have been tested but the part which says "regression test results change" has been marked but the small sub-set of tests which was run appears to have passed. |
@DeniseWorthen @BrianCurtis-NOAA I was able to test this pr ok on hera for both intel and gnu. @natalie-perlin can you go ahead to directly create a pr to #1486 branch? |
@jkbk2004 Then the checkmark that says "one or more regression tests change" should not be checked. |
@DeniseWorthen - unchecked the "regression test results change" |
…reading for cpld_bmark control and restart (was #1483); Software module updates in hpc-stack for Hera (intel, gnu) (was #1468) (#1486) * update CMEPS submodule * bmark cpld tests use esmf-managed threading by default * remove version w/o esmf-managed threading * update hera hpc stack locations: intel/gnu Co-authored-by: Brian Curtis <brian.curtis@noaa.gov> Co-authored-by: jkbk2004 <jong.kim@noaa.gov> Co-authored-by: zach1221 <99902696+zach1221@users.noreply.github.com>
This pr was merged thru #1486. @DusanJovic-NOAA we will create another pr to add gnu/openmpi feature on hera. |
UPD. 07 Nov 2022:
Limiting the updates to hpc-stack only in the present PR-1468. Not including miniconda3 updates, by a request.
UPD2. 08 Nov 2022:
Added updates for Hera system to use hpc-stack built with gnu/9.2.0 compiler and mpich/3.3.2, installed in EPIC-managed space, by a request.
UPD3. 09 Nov 2022:
By a request, changed the module to use different gnu-mpi combination, gnu/9.2.0 compiler and openmpi/3.1.4, installed in EPIC-managed space
UPD4. 14 Nov 2022:
By a request from @jkbk2004, reverting hera_gnu option to use gnu/9.2 compilers + mpich/3.3.2.
PR Checklist
This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. Please consult the ufs-weather-model wiki if you are unsure how to do this.
This PR has been tested using a branch which is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR
An Issue describing the work contained in this PR has been created either in the subcomponent(s) or in the ufs-weather-model. The Issue should be created in the repository that is most relevant to the changes in contained in the PR. The Issue and the dependent sub-component PR
are specified below.
Issue 1465
Results for one or more of the regression tests change and the reasons for the changes are understood and explained below (see Testing section).
New or updated input data is required by this PR. If checked, please work with the code managers to update input data sets on all platforms.
Instructions: All subsequent sections of text should be filled in as appropriate.
Description
The updates were made to list new locations of the updated hpc-stack libraries and an updated miniconda3/4.12.0.
Updated files are the modulefiles:
and python environmental variables for the Hera system:
./tests/rt.sh
Issue(s) addressed
Link the issues to be closed with this PR, whether in this repository, or in another repository.
This PR adresses one of the issues in Issue-1465
Testing
The following regression tests have been run on Hera and reported to be passed (OK):
Dependencies