Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release/public-v2.2.0] Fix crontab bug for Cheyenne and Derecho, update PR template for new platforms #939

Conversation

mkavulich
Copy link
Collaborator

Note: this is identical to #934 except it contains an additional fix for removing old crontab entries on Cheyenne and Derecho

DESCRIPTION OF CHANGES:

The option to create an experiment with the option USE_CRON_TO_RELAUNCH=True is currently broken on Cheyenne and Derecho due to some bad python logic. This PR fixes that issue.

I also took the opportunity to update the PR template to include the new supported platforms (Derecho, Hercules, and Gaea C5)

Type of change

  • Bug fix (non-breaking change which fixes an issue)

TESTS CONDUCTED:

Ran WE2E fundamental tests with the option --launch=cron on three platforms. Previously failing on Cheyenne an Derecho, these tasks all succeed except for the grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 test on Cheyenne: this is a pre-existing failure (see Issue #933)

  • hera.intel
  • cheyenne.intel
  • derecho.intel

DEPENDENCIES:

None

DOCUMENTATION:

None

ISSUE:

Fixes #932

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

…e for new platforms (ufs-community#934)

The option to create an experiment with the option USE_CRON_TO_RELAUNCH=True is currently broken on Cheyenne and Derecho due to some bad python logic. This fixes that issue.

Also took the opportunity to update the PR template to include the new supported platforms (Derecho, Hercules, and Gaea C5)
Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich - Similar to PR #934, these changes look good to me. I also went ahead and ran the grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 test on Derecho using the cron and the test successfully ran and ultimately passed. After the job passed, I checked the crontab and the job had been removed.

Approving this work now.

@MichaelLueken
Copy link
Collaborator

@mkavulich - Will you be including the additional fix for removing old crontab entries on Cheyenne and Derecho in a subsequent PR to develop? I ask because when I ran the grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 test on Derecho for PR #934, the job was successfully removed from the crontab once it completed. Was this an issue only for Cheyenne? Thanks.

Copy link
Collaborator

@RatkoVasic-NOAA RatkoVasic-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@MichaelLueken
Copy link
Collaborator

Given that the Jenkins tests successfully passed for PR #934 and the only additional modifications were made for Cheyenne and Derecho, which aren't supported via Jenkins, I have completed running the WE2E coverage tests on Derecho and all tests successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_IndianOcean_6km                                     COMPLETE              21.88
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              35.51
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              42.35
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR           COMPLETE              26.94
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              16.71
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR                COMPLETE              38.67
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              22.65
pregen_grid_orog_sfc_climo                                         COMPLETE              13.40
specify_template_filenames                                         COMPLETE              13.75
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             231.86

I will now move forward with merging this work.

@MichaelLueken MichaelLueken merged commit d94b6e4 into ufs-community:release/public-v2.2.0 Oct 11, 2023
@mkavulich
Copy link
Collaborator Author

@MichaelLueken turns out there was a bad assumption in the crontab script: Derecho does not suffer from the same problem as Cheyenne where a different crontab command is needed when run from the cron job. That is why this error was giving me confusing results, and working on Derecho but not Cheyenne. This PR is probably fine for the release branch (the crontab command on Derecho is actually the same as the full /usr/bin/crontab specified by the special logic), but I will include the correct fix in the develop branch eventually.

mkavulich added a commit to mkavulich/ufs-srweather-app that referenced this pull request Dec 8, 2023
mkavulich added a commit to mkavulich/ufs-srweather-app that referenced this pull request Dec 8, 2023
MichaelLueken pushed a commit that referenced this pull request Jan 11, 2024
…d new winter weather verification test with staged data (#997)

New test
* The new test MET_ensemble_verification_winter_wx is added. This test will exercise a number of yet-untested capabilities in the workflow, including a 10-member ensemble, snowfall verification with staged data (so can be run on all platforms, not just Jet and Hera), and several SPP settings.
* As part of this new test, snowfall observations will now be staged on all tier-1 platforms, as well as netCDF GFS data and other observation types, all for the date 2022020300

Resolved issues
* Incorrect octal notation causing ensemble vx to fail #966 Resolved: In several locations, an explicit conversion is done to ENSMEM_INDX to ensure it is a base-10 integer, to avoid problems with bash interpreting numbers with a leading zero as octal.
* Should "EXPT_SUBDIR" be a mandatory variable? #978 resolved: per discussion in a recent SRW code management meeting, give EXPT_SUBDIR a default value "experiment" to avoid unnecessary complications and work for users. Additionally, the default behavior if an experiment directory already exists is changed to "quit" rather than "delete"
* Issue mentioned in this discussion; the setting fhzero=6 is removed from the weather model namelist for CCPP suite FV3_GFS_v17_p8, which allows precipitation and other accumulations to be made every hour rather than 6 hours (SRW output is always hourly, so this makes sense). Also, update diag_table.FV3_GFS_v17_p8 so that all output files will be hourly
* Per discussion in [release/public-v2.2.0] Fix crontab bug for Cheyenne and Derecho, update PR template for new platforms #939, remove an unnecessary special case in get_crontab_contents.py for Derecho

Other fixes
* Some old files for unsupported/removed CCPP suites are removed
* Add some missing task dependencies for retrieving verification obs

General improvements
* Many improvements to verification obs-pulling task
   * NDAS observations are now retrieved for forecast hour zero, and a better obs file is retrieved for major obs times (00z, 06z, 12z, 18z) per EMC guidance
   * Better in-line comments/documentation
   * Standardize order and messaging for file-on-disk checks across all observation types
* Added explanatory comments for reflectivity field in diag_table files
* Update diag_table.FV3_GFS_v17_p8 so that all output files will be hourly
* Simplify task dependencies that rely on staged verification observations; these "get_obs" tasks should always be run (they check that the data exists before trying to retrieve it), so no need to make the dependency conditional
* Add a check in monitor_jobs.py to ensure the yaml file does not contain duplicate experiment directories
* Make sure the key in the experiment dictionary used by is unique by appending the current date/time to the exptdir name; additionally, set this key as the WORKFLOW_ID variable (so that it could be used in the workflow if necessary).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release This PR/issue is related to a release branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants