Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Update hash of ufs weather model and input.nml/diag_table files #663

Merged
merged 6 commits into from
Mar 21, 2023

Conversation

chan-hoo
Copy link
Collaborator

@chan-hoo chan-hoo commented Mar 9, 2023

DESCRIPTION OF CHANGES:

  • Update the hash of the ufs weather model with the latest one as of 3/9/23.
  • Based on ufs-community/ufs-weather-model@74ec1da, the input namelist file and diag tables are updated to avoid failure.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • WE2E tests on Hera and WCOSS2:
    community_ensemble_2mems_stoch
    grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional
    grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
    grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
    grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8
    grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
    grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
    grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km
    MET_verification

  • Sample script test for AQM on Hera

  • hera.intel

  • orion.intel

  • cheyenne.intel

  • cheyenne.gnu

  • gaea.intel

  • jet.intel

  • wcoss2.intel

  • NOAA Cloud (indicate which platform)

  • Jenkins

  • fundamental test suite

  • comprehensive tests (specify which if a subset was used)

ISSUE:

Fixes issue mentioned in #661

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chan-hoo Thank you very much for taking care of this! I'd suggest that the UPP be updated to the version used with the weather model. I have noted the version associated with the updated weather model hash. Once the UPP hash has been updated, I'll give my approval.

Externals.cfg Show resolved Hide resolved
Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chan-hoo Thank you very much for updating the UPP hash as well! I will now approve these changes.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Mar 10, 2023
@MichaelLueken
Copy link
Collaborator

@chan-hoo The Cheyenne GNU test failed in the Jenkins tests. All other tests successfully passed. The failure was in the run_post task. I'm currently going through the various UPP commits to try and see when the failure began. Would it be possible for you to try rerunning the WE2E tests that you previously ran on Hera using the GNU compiler? You should be able to build the SRW by using ./devbuild.sh -p=hera -c=gnu. While submitting the WE2E tests, please make sure to use
compiler="gnu". I'd like to see if this is an issue only on Cheyenne GNU, or if it is an issue with GNU overall. Thanks!

@chan-hoo
Copy link
Collaborator Author

@MichaelLueken, a gnu test on hera failed too. I'll test with the latest hash of upp again.

@chan-hoo
Copy link
Collaborator Author

@MichaelLueken, the latest hash of upp failed too. It works well with intel though. I have no idea what happens with gnu. I think we need a help from others. Do you know who we can contact to resolve this gnu issue on run_post?

@MichaelLueken
Copy link
Collaborator

@chan-hoo Unfortunately, I'm not sure who is in charge of the UPP. During my daily stand-up this morning, I'll see if Jong might know who to contact about this issue with post and GNU compilers.

@MichaelLueken
Copy link
Collaborator

@chan-hoo No one in EPIC's CM team seems to know who to contact about the issue. Testing on Cheyenne shows that the same GNU-based double free or corruption error messages are being seen in hash e88dbea, which is the next hash following 2b2c84a, which is the hash currently used in the SRW App. Looking at PR #1529 in the weather model, the addition of the new winter weather diagnostics were placed in different areas of the various diag_table files. Could this be what is causing the problem? Once my current test wraps up, I want to try and test the current UPP hash, and I will also try changing the location of the winter weather diagnostics in the diag_table files. If nothing pans out, I will open an issue in the UPP repo.

@chan-hoo
Copy link
Collaborator Author

@MichaelLueken, based on ufs-community/ufs-weather-model@74ec1da, we might need to update the upp control file. I'll test this and then let you know.

@chan-hoo
Copy link
Collaborator Author

It didn't work. I got the same error (double free or corruption).

@mark-a-potts
Copy link
Collaborator

It didn't work. I got the same error (double free or corruption).

I have no idea if you are seeing the same sort of issue, but I ran into a double free issue with the UFS-WM that was happening because it was linking both the shared and static ESMF libraries during the build. You might want to check the link command for whatever executable is failing to see if something similar is happening here.

@chan-hoo
Copy link
Collaborator Author

@mark-a-potts, thank you for your comments. I hope the upp developers would fix this issue soon. :) @MichaelLueken, I tested it with the original hash (2b2c84a). It worked well. I think we'd be better not to update the hash of UPP in this PR. What do you think? If you agree, could you run the fundamental we2e tests again?

@MichaelLueken
Copy link
Collaborator

@chan-hoo I will go ahead and resubmit the Jenkins tests using the original hash. I'll continue to see what is happening with UPP and make them aware of the issue we are encountering.

@MichaelLueken
Copy link
Collaborator

@mark-a-potts Thanks for the suggestion! Unfortunately, looking at the linking step for upp.x for both 2b2c84a (the version of UPP currently used in develop that works with GNU) and b37f8ab (the version of UPP associated with the weather model at the updated hash), the linking is identical and ESMF isn't used. I have a few more potential avenues to visit, but this is a bit of a head scratcher.

@MichaelLueken
Copy link
Collaborator

@chan-hoo I was finally able to find a way to get the run_post tasks to run properly using the b37f8ab hash in UPP. As part of the ufs-weather-model's update in PR #1529, there were updates to parm/postxconfig-NT-fv3lam.txt, parm/postxconfig-NT.txt, and parm/postxconfig-NT_FH00.txt.

In scripts/exregional_run_post.sh, when I replaced:

post_config_fp="${PARMdir}/upp/postxconfig-NT-fv3lam.txt"

with:

post_config_fp="/glade/scratch/mlueken/ufs-srweather-app/sorc/ufs-weather-model/tests/parm/postxconfig-NT-fv3lam.txt"

in other words, replacing the postxconfig-NT-fv3lam.txt file from UPP with the version in the weather model, the GNU tests run without issue.

I'll open an issue in the UPP repo to see if something can be done in UPP develop to address the need to point to a parameter file in a different component to allow the post to run using GNU.

@chan-hoo
Copy link
Collaborator Author

@MichaelLueken, great! Thank you for finding it out. I tried adding the new parameters in the above PR you mentioned to the existing upp control file 'postxconfig-NT-fv3lam.txt' but my run failed. Maybe I missed something. BTW the fundamental tests failed again. Can you let me know which tests failed?

@MichaelLueken
Copy link
Collaborator

@chan-hoo The tests failed on Jet. The tests were launched before the machine went down for maintenance and timed out once Jet returned to production. Once Jet is recovered from the cooling issue over night, I will resubmit the Jenkins tests one more time. I'll also do a few more tests to make sure that the replacing of the postxconfig-NT-fv3lam.txt is all that is needed to make the updated UPP work (I have made several changes to various parts, including the upp control file, so I will see if just changing the control file will correct the issue, or if an additional change I made is also required to get run_post to work correctly).

@MichaelLueken
Copy link
Collaborator

@chan-hoo After backing out of all modifications except for pointing to the weather model's postxconfig-NT-fv3lam.txt rather than the version in UPP/parm, the run_post task is passing using UPP hash b37f8ab.

From this comment in the weather model's PR #1529, it looks like the issue is that a new UPP variable snow liquid ratio (SDEN) wasn't added to the diag_table files. So the UPP team needed to provide a temporary solution, adding control files to the weather model that removed SDEN. So, until SDEN is added to the diag_table files, we won't be able to point to the UPP parm files (as it so happens, SDEN was added to the control files in the PR that followed the current version that works).

Copy link
Collaborator

@BenjaminBlake-NOAA BenjaminBlake-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me, approving now.

Copy link
Collaborator

@RatkoVasic-NOAA RatkoVasic-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@MichaelLueken
Copy link
Collaborator

@chan-hoo Following discussions from last Friday's code management meeting, we should move forward with updating the UPP hash as well. There are two variables that can control what post control file is used - USE_CUSTOM_POST_CONFIG_FILE and CUSTOM_POST_CONFIG_FP.

Setting USE_CUSTOM_POST_CONFIG_FILE to true in ush/config_default.yaml, then setting CUSTOM_POST_CONFIG_FP to point to the modified post control file in the weather model will allow the GNU tests to run (at least on Cheyenne).

I can open a PR in your fork to get the necessary changes in, if you would like for me to do so.

@chan-hoo
Copy link
Collaborator Author

@MichaelLueken, please do that. I didn't attend the code managers' meeting due to the conflict with the EMC internal meeting, so I don't know the detail.

@MichaelLueken
Copy link
Collaborator

@chan-hoo I will open the PR in your fork once I complete the fundamental tests for Intel on Cheyenne (need to make sure that the update doesn't adversely affect Intel).

Update UPP hash and point to updated post control file in weather model via ush/config_defaults.yaml
@MichaelLueken
Copy link
Collaborator

@RatkoVasic-NOAA and @BenjaminBlake-NOAA - A few additional changes were merged into @chan-hoo's work to update the UPP hash and update the ush/config_defaults.yaml file to point to a post control file that isn't within the UPP/parm directory (instead, pointing to the ufs-weather-model/tests/parm, which holds a version of the file that works with UPP, ufs-weather-model, and the SRW). All of the Jenkins tests successfully ran to completion following this update. I'd like for you to take a moment to review these changes one last time to make sure that you are still okay with these modifications.

Thank you very much for your time!

Copy link
Collaborator

@RatkoVasic-NOAA RatkoVasic-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

Copy link
Collaborator

@BenjaminBlake-NOAA BenjaminBlake-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@MichaelLueken MichaelLueken merged commit e317d02 into ufs-community:develop Mar 21, 2023
@chan-hoo chan-hoo deleted the bugfix/input branch April 5, 2023 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parm/input.nml.FV3 need to be updated for the recent GFS physic update
5 participants