Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRTM issues encountered on new platforms (Derecho, Hercules, and Gaea C5) #916

Open
MichaelLueken opened this issue Sep 21, 2023 · 16 comments
Assignees
Labels
bug Something isn't working EPIC Support Requested

Comments

@MichaelLueken
Copy link
Collaborator

Expected behavior

Updating the UFS-WM hash to the version associated with PR #1823, is causing the SRW App to either not run or fail on Derecho, Hercules, and Gaea C5. This hash updated the UPP to 520cc23, which requires changing the postxconfig-NT-fv3lam.txt post configuration file to postxconfig-NT-fv3lam_rrfs.txt (postxconfig-NT-fv3lam.txt was removed from the UPP repository). The new postxconfig-NT-fv3lam_rrfs.txt file includes simulated radiances, which means that the CRTM needs to be run and CRTM coefficients need to be made available.

Changes which were made to use the updated hashes must successfully run on all of the new platforms (Derecho, Hercules, Gaea C5).

Current behavior

On Hercules and Gaea C5, while the path that would normally contain the CRTM coefficients are present, there are no fix files available:

FileNotFoundError: 
USE_CRTM has been set, but the external CRTM fix file directory:
CRTM_DIR = /work/noaa/epic/role-epic/contrib/hercules/hpc-stack/intel-2022.2.1/intel-oneapi-compilers-2022.2.1/intel-oneapi-mpi-2021.7.1/crtm/2.4.0/fix
could not be found.
FileNotFoundError: 
USE_CRTM has been set, but the external CRTM fix file directory:
CRTM_DIR = /lustre/f2/dev/role.epic/contrib/C5/hpc-stack/intel-classic-2023.1.0/intel-classic-2023.1.0/cray-mpich-8.1.25/crtm/2.4.0/fix
could not be found.

On Derecho, both inline and offline post are failing in the CRTM with the following error message:

forrtl: severe (122): invalid attempt to assign into a pointer that is not associated
Image              PC                Routine            Line        Source
ufs_model          00000000042221E5  crtm_forward_modu         356  CRTM_Forward_Module.f90
libiomp5.so        000014F8970FB053  __kmp_invoke_micr     Unknown  Unknown
libiomp5.so        000014F897069A64  __kmp_fork_call       Unknown  Unknown
libiomp5.so        000014F897023223  __kmpc_fork_call      Unknown  Unknown
ufs_model          0000000004221C83  crtm_forward_modu         353  CRTM_Forward_Module.f90
ufs_model          0000000003E8123B  calrad_wcloud_           1725  CALRAD_WCLOUD_newcrtm.f

Machines affected

Derecho, Hercules, and Gaea C5

Steps To Reproduce

  1. Clone my branch, feature/upp_2d_decomp, git clone -b feature/upp_2d_decomp git@github.com:MichaelLueken/ufs-srweather-app.git
  2. No changes are necessary to run on Derecho - compile and move on to step 5 below. Please follow steps 3 and 4 for Hercules and Gaea C5.
  3. Merge Natalie's develop_hercules and develop_gaea_c5 branches into my branch
  4. Add the paths for CRTM_DIR into the ush/machine/hercules|gaea_c5.yaml files:
    • Hercules - /work/noaa/epic/role-epic/contrib/hercules/hpc-stack/intel-2022.2.1/intel-oneapi-compilers-2022.2.1/intel-oneapi-mpi-2021.7.1/crtm/2.4.0/fix
    • Gaea C5 - /lustre/f2/dev/role.epic/contrib/C5/hpc-stack/intel-classic-2023.1.0/intel-classic-2023.1.0/cray-mpich-8.1.25/crtm/2.4.0/fix
  5. Run the fundamental WE2E test suite, ./run_WE2E_tests.py -t fundamental -m derecho|hercules|gaea_c5 -a NRAL0032|epic

Detailed Description of Fix (optional)

  • Please add the CRTM coefficient files to the EPIC maintained locations noted above for Hercules and Gaea C5.
  • Additional work will likely be required to allow the CRTM to run properly on Derecho.
@MichaelLueken MichaelLueken added bug Something isn't working EPIC Support Requested labels Sep 21, 2023
@mkavulich
Copy link
Collaborator

I assume that the problems you are seeing on Derecho are a bug in UPP, it's probably worth opening an issue in that repository as well.

@MichaelLueken
Copy link
Collaborator Author

Thanks, @mkavulich! I have opened issue #789 in the UPP repository.

@natalie-perlin
Copy link
Collaborator

@MichaelLueken - I'll take care of verifying fix coefficients are installed in current locations on these platforms.

@MichaelLueken
Copy link
Collaborator Author

Thanks, @natalie-perlin!

@MichaelLueken
Copy link
Collaborator Author

@natalie-perlin - Out of curiosity, while running the GSI regression tests on Cheyenne, did you encounter any issues with the CRTM?

The fact that the post is failing in the CRTM's forward model only on Derecho, suggests that there might be something extra that needs to be done while compiling the CRTM on the machine. If you had to make changes to the CRTM build on Cheyenne to allow the GSI regression tests to run on the machine, then the same changes will likely need to be made to allow the post to work on Derecho.

@natalie-perlin
Copy link
Collaborator

@MichaelLueken -
CRTM fix files are now in a correct location on Hercules and Gaea C5 and Derecho.
What is the best way to test that issues are resolved?

@MichaelLueken
Copy link
Collaborator Author

@natalie-perlin - I'll attempt running the fundamental tests on Hercules, Gaea C5, and Derecho using my fork's feature/upp_2d_decomp branch. If you would like to try testing as well, you should be able to clone my branch that contains all the necessary changes for Hercules, Gaea C5, and Derecho by using the following command:

git clone -b feature/upp_2d_decomp git@github.com:MichaelLueken/ufs-srweather-app.git

@natalie-perlin
Copy link
Collaborator

@MichaelLueken - testing for Derecho now

@natalie-perlin natalie-perlin self-assigned this Oct 2, 2023
@MichaelLueken
Copy link
Collaborator Author

@natalie-perlin - I can confirm that the SRW App successfully runs using the newly added CRTM coefficients on Hercules:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              19.09
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              17.79
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE              21.28
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              26.60
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR          COMPLETE              46.05
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0              COMPLETE              27.23
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              44.40
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             202.44

and on Gaea C5:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              20.28
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              29.75
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE              24.86
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              30.16
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR          COMPLETE              41.27
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0              COMPLETE              34.28
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              52.63
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             233.23

I'm finding that Derecho is still failing with in the post with the following error message in the CRTM_Forward_Module:

forrtl: severe (122): invalid attempt to assign into a pointer that is not associated

@natalie-perlin
Copy link
Collaborator

@MichaelLueken -
yes, the experiments are still failing on Derecho. I've done the following in attempt to correct the issue:

  • Reinstalled CRTM fix files by downloading them from
    http://ftp.ssec.wisc.edu/pub/s4/CRTM/fix_REL-2.4.0_emc.tgz and unpacking into a single (flat) ./fix/ directory
  • Tared up the directory on Hercules/Orion with CRTM fix coefficients that are used in successful runs of fundamental tests on Hercules, /work/noaa/epic/role-epic/contrib/orion/hpc-stack/intel-2022.1.2/intel-2022.1.2/impi-2022.1.2/crtm/2.4.0/fix/, and ported/untared them in corresponding location for Derecho, /glade/work/epicufsrt/contrib/derecho/hpc-stack/intel-classic-2023.0.0/intel-classic-2023.0.0/cray-mpich-8.1.25/crtm/2.4.0/fix

In all the cases the tests failed.
Any ideas on what could be the next step to debug the issue?..

@MichaelLueken
Copy link
Collaborator Author

@natalie-perlin -
With the error message in the logs, I don't think that the issue is with the coefficient files on Derecho. Looking at line 356 in CRTM_Forward_Module.f90, I see the following:
Opt = Default_Options

In the function, both Default_Options and Opt are declared as typed arrays, but not pointers:
TYPE(CRTM_Options_type) :: Default_Options, Opt

It is unclear to me why the CRTM is failing due to an invalid attempt to assign into a pointer that is not associated, since neither are pointers.

At this point, I would recommend attempting to rebuild the CRTM library. I will also reach out to Ben Johnson. I believe he is still the code manager for the CRTM at JCSDA, and see if he has encountered this type of issue before. I will CC you as well so that you are kept in the loop.

@MichaelLueken
Copy link
Collaborator Author

@natalie-perlin -
It looks like Ben's suggestion would be to go into CRTM_Forward_Module.f90 and make the following changes:

  • On line 950, add Opt - CALL Post_Process_RTSolution(Opt,RTSolution(ln,m), &
  • On line 968, add Opt - CALL Post_Process_RTSolution(Opt,RTSolution(ln,m), &
  • On line 1012, add Opt - SUBROUTINE Post_Process_RTSolution(Opt,rts, &
  • On line 1016, add - TYPE(CRTM_Options_Type), INTENT(IN) :: Opt

Unfortunately, I don't have access to the epicufsrt account on Cheyenne/Derecho. If you would like, please let me know once the modifications have been made, then I will look over the changes before you rebuild the CRTM.

@natalie-perlin
Copy link
Collaborator

@MichaelLueken - this suggested fix did not seem to make a difference. Similarly, crtm/2.4.0 downloaded from JCSDA/crtm and built with or without the "Opt" fix - all resulted in the same error. In summary, things tested:

  • current crtm/2.4.0 from NOAA-EMC/crtm, with the fix implemented;
  • crtm/2.4.0 downloaded from JCSDA/crtm and built without the fix;
  • crtm/2.4.0 from JCSDA/crtm with the fix.

@MichaelLueken
Copy link
Collaborator Author

@natalie-perlin - Hopefully Ben can think of something else to try. It didn't look like the issue was with Post_Process_RTSolution, since the failure appears to be happening before any calls to that subroutine. For as nicely documented and cleanly coded the CRTM is, it is prone to compiler issues.

@MichaelLueken
Copy link
Collaborator Author

To circumvent this issue and to allow the ufs-weather-model and UPP hashes to be brought up-to-date, the use of the postxconfig-NT-fv3lam.txt file found in ufs-weather-model/tests/parm, will be used in lieu of postxconfig-NT-fv3lam_rrfs.txt. Once the SRW App transitions to spack-stack on Derecho, hopefully the CRTM issue will be corrected and we can move forward with using the postxconfig-NT-fv3lam_rrfs.txt file from UPP.

@MichaelLueken
Copy link
Collaborator Author

The CRTM issue that was affecting Derecho has been rectified. The WE2E tests now successfully run through to completion on the machine while using the postxconfig-NT-rrfs.txt post configuration file, setting USE_CRTM to true, and providing the path to the CRTM coefficient location (CRTM_DIR).

However, a new issue has occurred (related to CRTM). On Hera, using GNU-built executables, the inline post WE2E test fails because the post is unable to read in the CRTM coefficient files. It fails with the following error messages:

 Check_Binary_File(FAILURE) : Data file needs to be byte-swapped.
 Open_Binary_File(FAILURE) : Error checking imgr_g15.SpcCoeff.bin file byte order
 SpcCoeff_ReadFile(Binary)(FAILURE) : Error opening imgr_g15.SpcCoeff.bin
 CRTM_SpcCoeff_Load(FAILURE) : Error reading SpcCoeff file #1, imgr_g15.SpcCoeff.bin; Process ID: 0
 CRTM_Init(FAILURE) : Error loading SpcCoeff data; Process ID: 0
 ERROR*** crtm_init error_status=      3

I have opened issue #2537 in the ufs-weather-model repository detailing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working EPIC Support Requested
Projects
Status: No status
Development

No branches or pull requests

3 participants