-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRTM issues encountered on new platforms (Derecho, Hercules, and Gaea C5) #916
Comments
I assume that the problems you are seeing on Derecho are a bug in UPP, it's probably worth opening an issue in that repository as well. |
Thanks, @mkavulich! I have opened issue #789 in the UPP repository. |
@MichaelLueken - I'll take care of verifying fix coefficients are installed in current locations on these platforms. |
Thanks, @natalie-perlin! |
@natalie-perlin - Out of curiosity, while running the GSI regression tests on Cheyenne, did you encounter any issues with the CRTM? The fact that the post is failing in the CRTM's forward model only on Derecho, suggests that there might be something extra that needs to be done while compiling the CRTM on the machine. If you had to make changes to the CRTM build on Cheyenne to allow the GSI regression tests to run on the machine, then the same changes will likely need to be made to allow the post to work on Derecho. |
@MichaelLueken - |
@natalie-perlin - I'll attempt running the fundamental tests on Hercules, Gaea C5, and Derecho using my fork's
|
@MichaelLueken - testing for Derecho now |
@natalie-perlin - I can confirm that the SRW App successfully runs using the newly added CRTM coefficients on Hercules:
and on Gaea C5:
I'm finding that Derecho is still failing with in the post with the following error message in the CRTM_Forward_Module:
|
@MichaelLueken -
In all the cases the tests failed. |
@natalie-perlin - In the function, both It is unclear to me why the CRTM is failing due to an invalid attempt to assign into a pointer that is not associated, since neither are pointers. At this point, I would recommend attempting to rebuild the CRTM library. I will also reach out to Ben Johnson. I believe he is still the code manager for the CRTM at JCSDA, and see if he has encountered this type of issue before. I will CC you as well so that you are kept in the loop. |
@natalie-perlin -
Unfortunately, I don't have access to the epicufsrt account on Cheyenne/Derecho. If you would like, please let me know once the modifications have been made, then I will look over the changes before you rebuild the CRTM. |
@MichaelLueken - this suggested fix did not seem to make a difference. Similarly, crtm/2.4.0 downloaded from JCSDA/crtm and built with or without the "Opt" fix - all resulted in the same error. In summary, things tested:
|
@natalie-perlin - Hopefully Ben can think of something else to try. It didn't look like the issue was with Post_Process_RTSolution, since the failure appears to be happening before any calls to that subroutine. For as nicely documented and cleanly coded the CRTM is, it is prone to compiler issues. |
To circumvent this issue and to allow the ufs-weather-model and UPP hashes to be brought up-to-date, the use of the |
The CRTM issue that was affecting Derecho has been rectified. The WE2E tests now successfully run through to completion on the machine while using the However, a new issue has occurred (related to CRTM). On Hera, using GNU-built executables, the inline post WE2E test fails because the post is unable to read in the CRTM coefficient files. It fails with the following error messages:
I have opened issue #2537 in the |
Expected behavior
Updating the UFS-WM hash to the version associated with PR #1823, is causing the SRW App to either not run or fail on Derecho, Hercules, and Gaea C5. This hash updated the UPP to 520cc23, which requires changing the
postxconfig-NT-fv3lam.txt
post configuration file topostxconfig-NT-fv3lam_rrfs.txt
(postxconfig-NT-fv3lam.txt
was removed from the UPP repository). The newpostxconfig-NT-fv3lam_rrfs.txt
file includes simulated radiances, which means that the CRTM needs to be run and CRTM coefficients need to be made available.Changes which were made to use the updated hashes must successfully run on all of the new platforms (Derecho, Hercules, Gaea C5).
Current behavior
On Hercules and Gaea C5, while the path that would normally contain the CRTM coefficients are present, there are no fix files available:
On Derecho, both inline and offline post are failing in the CRTM with the following error message:
Machines affected
Derecho, Hercules, and Gaea C5
Steps To Reproduce
feature/upp_2d_decomp
,git clone -b feature/upp_2d_decomp git@github.com:MichaelLueken/ufs-srweather-app.git
develop_hercules
anddevelop_gaea_c5
branches into my branchush/machine/hercules|gaea_c5.yaml
files:/work/noaa/epic/role-epic/contrib/hercules/hpc-stack/intel-2022.2.1/intel-oneapi-compilers-2022.2.1/intel-oneapi-mpi-2021.7.1/crtm/2.4.0/fix
/lustre/f2/dev/role.epic/contrib/C5/hpc-stack/intel-classic-2023.1.0/intel-classic-2023.1.0/cray-mpich-8.1.25/crtm/2.4.0/fix
./run_WE2E_tests.py -t fundamental -m derecho|hercules|gaea_c5 -a NRAL0032|epic
Detailed Description of Fix (optional)
The text was updated successfully, but these errors were encountered: