-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Many WE2E tests do not run successfully with top of develop branch #673
Comments
## DESCRIPTION OF CHANGES: A couple of fixes to get the workflow running on Cheyenne. - Remove `module purge` from load_modules_run_task.sh. This no longer causes failures on Cheyenne due to intervening PR #650, but it should be removed anyway as it can cause future issues - Fixing the number of processors used in the mpirun command for the weather model on Cheyenne. I am honestly not sure how this was ever working, but this change fixes nearly all of the runtime failures currently seen on Cheyenne. ## TESTS CONDUCTED: ### Cheyenne Ran a set of WE2E tests on Cheyenne, chosen mostly at random to save core hours (I did ensure that a variety of domains were run so that several different MPI layouts were tested). Most tasks succeed, and all failures (aside from one walltime issue) are also tests that fail on Hera with the current develop branch. See issue #673 for more details. **Successful tests:** - grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 - grid_GSD_HRRR_AK_50km_ics_RAP_lbcs_RAP_suite_GSD_SAR - grid_RRFS_CONUS_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta - grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 - grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_HRRR - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_HRRR - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta - grid_RRFS_CONUS_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta **Unsuccessful tests:** - All gfdlmp tests (grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp, grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional, grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_2017_gfdlmp) - grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16 - GST_release_public_v1 - Hit walltime limit ### Hera, Jet, and Orion Ran the same set of tests on Hera, Jet, and Orion, with similar results. On Hera the GST successfully completed (though was close to reaching the walltime limit). On Jet, a few tests (grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR, grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_HRRR, grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta) failed due to missing initial and/or lateral boundary conditions. On Orion, even more tests failed due to missing ICs and LBCs (grid_GSD_HRRR_AK_50km_ics_RAP_lbcs_RAP_suite_GSD_SAR, grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR, grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_2017_gfdlmp, grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16). **To summarize, the only test failures were those that were also seen in develop, and mostly due to missing input files on those platforms.** ## DEPENDENCIES: This will need to be merged prior to ufs-community/ufs-srweather-app#206 ## ISSUE: #663 has technically already been resolved, but this will fully address that specific issue.
@mkavulich I was able to reproduce this issue. All of the end to end tests you listed as failing above failed for me as well, with the same tasks failing. My record of this can be found on Hera: Moving forward, standardizing a suite of tests is crucial to establish a testing framework. Knowing which of these tests we can expect to pass and which we can expect to fail is imperative in setting up a robust continuous integration environment. |
Just to chime in with a bit of information about tests on Jet. It seems that there is no LBCS data for HRRR on Jet, so the tests that are configured for that would not be able to run. |
With the latest version of ufs-srweather-app (ufs-community/ufs-srweather-app@89a0793, RW: 2718b62), the test list has changed, as have the list of failing tests. I have documented the new list of failures below (all tests on Hera), along with the current successful tests for completeness sake. Full test logs available on Hera here: /scratch2/BMC/det/kavulich/workdir/update_static_data_locations/expt_dirs Successful tests (58)
Failures (18)
|
@mkavulich for the last failure noted ( &fms2_io_nml |
@mkavulich I noted here that 9 of the 18 tests involving the |
@mkavulich, PR #706 will fix the failures on
, and I expect that the failures on the following two tests will be resolved once PRs #704 and #706 are merged
|
## DESCRIPTION OF CHANGES: Several paths in the machine-specific files point to locations in user paths or old locations of static data. This PR updates paths of static data in regional_workflow/ush/machine/ to point to the official, centralized locations on Cheyenne, Hera, and Jet. ## TESTS CONDUCTED: Ran the following suite of end-to-end tests on Cheyenne and Jet prior to the latest ufs-weather-model hash update. All passed. This list of tests was chosen because all of these tests are known to succeed on all tested platforms, and this tests a variety of input and boundary condition types. - grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 - grid_RRFS_CONUS_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta - grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 - grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_HRRR - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_HRRR - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta - grid_RRFS_CONUS_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta On Hera, I ran tests with the latest SRW hash, which included the updated weather model. Because of this, many tests could not be generated due to using old, removed CCPP suites (see issue #668). To get around this issue, I tested with the fixes from #697 incorporated into my branch. With those extra commits, all "get_extrn_ics" and "get_extrn_lbcs" tasks completed successfully, which indicates that all data is in its correct place. ## ISSUE (optional): Will resolve a few issues in #673, many remain however.
Using the latest regional_workflow hash (which has not yet been added to the SRW develop branch), the vast majority of tests now pass on Hera.
There are four tests that still fail on Hera:
I will close this issue and open individual issues for the remaining failed tests |
Description
Because I was unsure about failures I was seeing in testing for a PR, I ran the entire suite of tests on Hera. I found that with the top of the develop branch, a large number of tests fail on Hera (as well as other platforms). These tests were run with ufs-srweather-app hash 4120a9bf and compiled with default Intel compilers unless otherwise noted. The Hera tests can be found on disk at
/scratch2/BMC/det/kavulich/workdir/run_all_tests/expt_dirs
Hera failures
Here are the tests that are failing on Hera with the 4120a9bf develop hash, including a brief description of the failure. There is an impressively wide variety of failure modes.
invalid reference to variable in NAMELIST input
FATAL from PE 4: compute_qs: saturation vapor pressure table overflow, nbad= 1
*** FATAL ERROR: packing type 40000 not supported ***
FATAL from PE 0: NetCDF: One or more variable sizes violate format constraints: set_netcdf_mode
Will provide more details on other platforms later as I have time.
Steps to Reproduce
The text was updated successfully, but these errors were encountered: