Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Transition the var_defns bash file to YAML. #1098

Merged
merged 43 commits into from
Jul 26, 2024

Conversation

christinaholtNOAA
Copy link
Collaborator

@christinaholtNOAA christinaholtNOAA commented Jun 18, 2024

DESCRIPTION OF CHANGES:

Use YAML for the configuration language at run time.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

I could use some help on updating documentation. Just pointers to the most important pieces. Let's also talk about whether that could be separated into a follow-on PR.

TESTS CONDUCTED:

  • hera.intel
  • orion.intel
  • hercules.intel
  • cheyenne.intel
  • cheyenne.gnu
  • derecho.intel
  • gaea.intel
  • gaeac5.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests -- all of them, but only on Hera

DEPENDENCIES:

n/a

DOCUMENTATION:

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

# USHdir
#
# workflow:
# EXPTDIR
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pattern that shows up regularly. I did my best to add a doc block to the top of each j-job and ex-script to list the variables that are used in the script and the sections they come from.

The run-time variables may be from Rocoto jobs or from the NCO preamble.

source_config_for_task "" ${GLOBAL_VAR_DEFNS_FP}
for sect in user nco workflow ; do
source_yaml ${GLOBAL_VAR_DEFNS_FP} ${sect}
done
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the pattern I've added to explicitly state which sections of the var_defns file should be "sourced". The bash util helps us run a UW tool that translates YAML to bash and then exports it line-by-line.

This sort of thing will go away with a full integration of uwtools drivers (planned work in upcoming EPIC PI 13).

@@ -16,7 +31,9 @@
#-----------------------------------------------------------------------
#
. $USHdir/source_util_funcs.sh
source_config_for_task "" ${GLOBAL_VAR_DEFNS_FP}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command sources everything in the bash file that does not start with "task_", so the environment is now narrowed to the specific script with the new implementation.

@@ -3,97 +3,25 @@
#
#-----------------------------------------------------------------------
#
# This script generates grid and orography files in NetCDF format that
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This text block was not consistent with what this script actually does, nor what its corresponding ex-script does.

set +u
conda activate ${SRW_GRAPHICS_ENV}
set -u
fi
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This task uses a different environment to run graphics, so loads it here instead of in the load_modules_run_task.sh file.

line=$(echo "$line" | sed -E "s/='\[(.*)\]'/=(\1)/")
line=${line//,/}
line=${line//\"/}
line=${line/None/}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines handle the transition from YAML lists to bash lists, and "None" values being empty variables in bash.

@@ -468,7 +468,7 @@ workflow:
#
#-----------------------------------------------------------------------
#
WORKFLOW_ID: !nowtimestamp ''
WORKFLOW_ID: ""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thing isn't supported in uwtools, and isn't actually needed for SRW.

@@ -639,7 +639,7 @@ def generate_FV3LAM_wflow(
input_format="nml",
output_file=FV3_NML_STOCH_FP,
output_format="nml",
supplemental_configs=[settings],
update_config=get_nml_config(settings),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change to the API in the new uwtools version.

@@ -1,5 +1,7 @@
#!/bin/bash

set +u
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some how this script wasn't subjected to this requirement.

else
exec "${jjob_fp}"
fi
source "${jjob_fp}"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running the script in the same shell provides access to modify the conda env.

@MichaelLueken MichaelLueken changed the title Transition the var_defns bash file to YAML. [develop] Transition the var_defns bash file to YAML. Jun 20, 2024
@christinaholtNOAA christinaholtNOAA marked this pull request as ready for review June 20, 2024 14:43
Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christinaholtNOAA -

The modification that you made to ush/setup.py to correct the behavior with the DATE_FIRST_CYCL variable allows both the AQM cold start WE2E test:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
aqm_grid_AQM_NA13km_suite_GFS_v16_20240718123320                   COMPLETE            6325.57
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            6325.57

and the AQM warm start configuration:

       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
202311100000               make_grid                     1989729           SUCCEEDED                   0         1          30.0
202311100000               make_orog                     1989833           SUCCEEDED                   0         1         185.0
202311100000          make_sfc_climo                     1989939           SUCCEEDED                   0         1          97.0
202311100000           nexus_gfs_sfc                     1989730           SUCCEEDED                   0         1          30.0
202311100000       nexus_emission_00                     1989834           SUCCEEDED                   0         1         658.0
202311100000       nexus_emission_01                     1989835           SUCCEEDED                   0         1         640.0
202311100000       nexus_emission_02                     1989836           SUCCEEDED                   0         1         727.0
202311100000        nexus_post_split                     1990128           SUCCEEDED                   0         1         103.0
202311100000           fire_emission                     1989731           SUCCEEDED                   0         1          92.0
202311100000            point_source                     1989837           SUCCEEDED                   0         1         284.0
202311100000             aqm_ics_ext                     1990040           SUCCEEDED                   0         1         117.0
202311100000                aqm_lbcs                     1990129           SUCCEEDED                   0         1          83.0
202311100000           get_extrn_ics                     1989732           SUCCEEDED                   0         1          23.0
202311100000          get_extrn_lbcs                     1989733           SUCCEEDED                   0         1          23.0
202311100000         make_ics_mem000                     1990013           SUCCEEDED                   0         1         116.0
202311100000        make_lbcs_mem000                     1990014           SUCCEEDED                   0         1         278.0
202311100000         run_fcst_mem000                     1990154           SUCCEEDED                   0         1        2659.0
202311100000    run_post_mem000_f000                     1990315           SUCCEEDED                   0         1          47.0
202311100000    run_post_mem000_f001                     1990328           SUCCEEDED                   0         1          27.0
202311100000    run_post_mem000_f002                     1990329           SUCCEEDED                   0         1          43.0
202311100000    run_post_mem000_f003                     1990371           SUCCEEDED                   0         1          29.0
202311100000    run_post_mem000_f004                     1990372           SUCCEEDED                   0         1          44.0
202311100000    run_post_mem000_f005                     1990409           SUCCEEDED                   0         1          92.0
202311100000    run_post_mem000_f006                     1990410           SUCCEEDED                   0         1          92.0
202311100000    run_post_mem000_f007                     1990449           SUCCEEDED                   0         1          28.0
202311100000    run_post_mem000_f008                     1990450           SUCCEEDED                   0         1          26.0
202311100000    run_post_mem000_f009                     1990458           SUCCEEDED                   0         1          28.0
202311100000    run_post_mem000_f010                     1990459           SUCCEEDED                   0         1          25.0
202311100000    run_post_mem000_f011                     1990521           SUCCEEDED                   0         1          25.0
202311100000    run_post_mem000_f012                     1990532           SUCCEEDED                   0         1          27.0
202311100000    run_post_mem000_f013                     1990533           SUCCEEDED                   0         1          26.0
202311100000    run_post_mem000_f014                     1990606           SUCCEEDED                   0         1          44.0
202311100000    run_post_mem000_f015                     1990607           SUCCEEDED                   0         1          28.0
202311100000    run_post_mem000_f016                     1990625           SUCCEEDED                   0         1          27.0
202311100000    run_post_mem000_f017                     1990626           SUCCEEDED                   0         1          42.0
202311100000    run_post_mem000_f018                     1990666           SUCCEEDED                   0         1          33.0
202311100000    run_post_mem000_f019                     1990667           SUCCEEDED                   0         1          38.0
202311100000    run_post_mem000_f020                     1990730           SUCCEEDED                   0         1          27.0
202311100000    run_post_mem000_f021                     1990760           SUCCEEDED                   0         1          43.0
202311100000    run_post_mem000_f022                     1990761           SUCCEEDED                   0         1          43.0
202311100000    run_post_mem000_f023                     1990762           SUCCEEDED                   0         1          43.0
202311100000    run_post_mem000_f024                     1990763           SUCCEEDED                   0         1          38.0
================================================================================================================================
202311110000           nexus_gfs_sfc                     1989734           SUCCEEDED                   0         1          30.0
202311110000       nexus_emission_00                     1989838           SUCCEEDED                   0         1         663.0
202311110000       nexus_emission_01                     1989839           SUCCEEDED                   0         1         676.0
202311110000       nexus_emission_02                     1989840           SUCCEEDED                   0         1         707.0
202311110000        nexus_post_split                     1990041           SUCCEEDED                   0         1         120.0
202311110000           fire_emission                     1989735           SUCCEEDED                   0         1          92.0
202311110000            point_source                     1989841           SUCCEEDED                   0         1         289.0
202311110000                 aqm_ics                     1990764           SUCCEEDED                   0         1         107.0
202311110000                aqm_lbcs                     1990130           SUCCEEDED                   0         1          94.0
202311110000           get_extrn_ics                     1989736           SUCCEEDED                   0         1          23.0
202311110000          get_extrn_lbcs                     1989737           SUCCEEDED                   0         1          26.0
202311110000         make_ics_mem000                     1990015           SUCCEEDED                   0         1         116.0
202311110000        make_lbcs_mem000                     1990016           SUCCEEDED                   0         1         261.0
202311110000         run_fcst_mem000                     1990814           SUCCEEDED                   0         1        2789.0
202311110000    run_post_mem000_f000                     1990982           SUCCEEDED                   0         1          47.0
202311110000    run_post_mem000_f001                     1990984           SUCCEEDED                   0         1          47.0
202311110000    run_post_mem000_f002                     1991030           SUCCEEDED                   0         1          25.0
202311110000    run_post_mem000_f003                     1991031           SUCCEEDED                   0         1          25.0
202311110000    run_post_mem000_f004                     1991061           SUCCEEDED                   0         1          26.0
202311110000    run_post_mem000_f005                     1991062           SUCCEEDED                   0         1          26.0
202311110000    run_post_mem000_f006                     1991093           SUCCEEDED                   0         1          26.0
202311110000    run_post_mem000_f007                     1991113           SUCCEEDED                   0         1          26.0
202311110000    run_post_mem000_f008                     1991114           SUCCEEDED                   0         1          46.0
202311110000    run_post_mem000_f009                     1991146           SUCCEEDED                   0         1          28.0
202311110000    run_post_mem000_f010                     1991147           SUCCEEDED                   0         1          31.0
202311110000    run_post_mem000_f011                     1991179           SUCCEEDED                   0         1          36.0
202311110000    run_post_mem000_f012                     1991180           SUCCEEDED                   0         1          31.0
202311110000    run_post_mem000_f013                     1991219           SUCCEEDED                   0         1          27.0
202311110000    run_post_mem000_f014                     1991220           SUCCEEDED                   0         1          50.0
202311110000    run_post_mem000_f015                     1991235           SUCCEEDED                   0         1          47.0
202311110000    run_post_mem000_f016                     1991261           SUCCEEDED                   0         1          26.0
202311110000    run_post_mem000_f017                     1991262           SUCCEEDED                   0         1          51.0
202311110000    run_post_mem000_f018                     1991277           SUCCEEDED                   0         1          25.0
202311110000    run_post_mem000_f019                     1991305           SUCCEEDED                   0         1          27.0
202311110000    run_post_mem000_f020                     1991306           SUCCEEDED                   0         1          30.0
202311110000    run_post_mem000_f021                     1991321           SUCCEEDED                   0         1          24.0
202311110000    run_post_mem000_f022                     1991322           SUCCEEDED                   0         1          43.0
202311110000    run_post_mem000_f023                     1991340           SUCCEEDED                   0         1          46.0
202311110000    run_post_mem000_f024                     1991341           SUCCEEDED                   0         1          46.0

to successfully run through to completion.

I'm still confused as to why the transition from bash to YAML is causing a 10-digit string to be converted to a date/time format. Is this supposed to happen, or is this showing an issue with the implementation?

@christinaholtNOAA
Copy link
Collaborator Author

@MichaelLueken I was also confused and it took me a while to find it. I found that that some extra magic was happening behind the scenes when var_defns.sh was being written to disk using SRW's python_utils functions. The experiment definition exists in memory as a dict in setup.py and these date values exist as Python datetime objects.

On line 1505 in setup.py the dict was written to the bash file using cfg_to_shell_str. This function ultimately calls the python_utils/environment.py function date_to_str on all datetime objects in the experiment dict before creating the bash string. This is not a step uwtools get_yaml_config does -- it uses only the pyyaml package to decide how different types of objects are transformed. Since it's a datetime in memory, you get a standard ISO format when it's converted to YAML (or bash -- see the last snippet below). Here's a very short example from the Python repl:

>>> import yaml
>>> from datetime import datetime
>>> x = datetime.now()
>>> x
datetime.datetime(2024, 7, 19, 17, 23, 6, 779122)
>>> print(yaml.dump({"x": x}))
x: 2024-07-19 17:23:06.779122
>>> from python_utils.environment import date_to_str
>>> date_to_str(x)
'202407191723'

I think this example shows how much SRW is reliant on assumptions made explicit by SRW related to data types. If we can move away from that behavior and adopt well-documented, widely used type conversions offered by YAML and ensure these conversions are very explicit when they happen, we'll be in better shape overall (as opposed to burying them deep in our SRW-specific code). I think all of our type conversions will rely on the YAML rules, so even a bash output file with a datetime object will look the same. Here I used the uwtools API to dump out the same dict as bash into example.sh:

$ cat example.sh 
x='2024-07-19 17:34:14.087619'

I hope that helps, but if it's still foggy, let me know.

Just a forward-looking comment on how dates will be handled with a full uwtools integration: We give users the ability to modify dates however they like from the datetime object instead of assuming. An example taken from the MPAS App defines this explicitly so we know it's always treated as a datetime object both in memory, and in config files:

first_cycle: !!timestamp 2023-09-15T00:00:00

And then drivers will always expect a format like that. Where we need to use the NWP-favored YYYYMMDDHH formats, we are giving the user direct access via Jinja to format the datestring however you like. The run directory might be formulated for the forecast like this to get our standard 10-digit date string:

rundir: '{{ user.experiment_dir }}/{{ cycle.strftime("%Y%m%d%H") }}/forecast'

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christinaholtNOAA -

Thank you very much for the detailed explanation! Since all tests are successfully running now, and the date and time behavior was expected (and the fix is working), I will go ahead and approve these changes now.

Copy link
Collaborator

@mkavulich mkavulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my concerns!

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Jul 22, 2024
@MichaelLueken
Copy link
Collaborator

@christinaholtNOAA -

While running the Jenkins tests, the Functional WorkflowTaskTests failed. The wrapper scripts in ush/wrapper are setting:

export GLOBAL_VAR_DEFNS_FP="${EXPTDIR}/var_defns.sh"

These will need to be updated to point to var_defns.yaml.

Additionally, the Functional UnitTests are failing on Hera GNU. The test_rap_lbcs_from_aws test is failing. All Functional UnitTests successfully pass on Hera Intel and other platforms.

When you have the chance, please look through these issue and correct them if possible.

@MichaelLueken
Copy link
Collaborator

@christinaholtNOAA -

I opened PR #4 in your fork to add in the necessary changes to the wrapper scripts to work with transitioning the var_defns bash file to YAML.

As for the failures of the retrieve_data unit tests, it looks like the failures are those that are attempting to pull RAP_obs from HPSS. All unit test failures have the following:

KeyError: 'No data stores have been defined for RAP_obs!'

These tests were added when RRFS features were being added to the SRW. Is it possible that the old RRFS data on HPSS has been removed now that the RRFSv1 implementation has been suspended? We will need to remove these unit tests moving forward if the data is no longer available to be pulled from HPSS.

[to_yaml] Update wrapper scripts to allow them to run using var_defns.yaml
@christinaholtNOAA
Copy link
Collaborator Author

@MichaelLueken Can you point me to the output of the failed unit tests for RAP Obs? I'm running those on Hera now, but they do take quite some time. Thanks so much for your engagement on this whole PR!

@christinaholtNOAA
Copy link
Collaborator Author

After looking at the code, I'm curious how this has been passing in PRs since I removed the RAP_obs section in the data_locations.yaml file months ago! I'll push a change that removes the tests.

@MichaelLueken
Copy link
Collaborator

@christinaholtNOAA - Thanks for removing the RAP_obs tests from test_retrieve_data.py!

Yeah, looking at the changes in PR #893, I do see the removal of RAP_obs from the data_locations.yml file. I have no idea where the tests have been attempting to pull from, since it appears to have been able to retrieve some data to allow the Jenkins unit tests to run up to this point. Wherever the test was pulling data from is no longer available, leading to the failure now.

@christinaholtNOAA
Copy link
Collaborator Author

@MichaelLueken Just checking in...are there any requests for changes I missed here?

@MichaelLueken
Copy link
Collaborator

@christinaholtNOAA - Everything is fine. Both Hercules and Orion were down for maintenance yesterday, so I didn't relaunch the Jenkins tests. Now that maintenance is complete, relaunching now.

@MichaelLueken
Copy link
Collaborator

@christinaholtNOAA -

The updates to the wrapper scripts and test_retrieve_data.py are successfully running. Unfortunately, there was another script that needed to be updated for the wrapper scripts to work on Gaea. I will be opening a PR to your to_yaml branch once I have corrected the issue (for Gaea, the wrapper scripts call load_modules_run_task.sh, which has been updated in this PR to require three arguments).

@MichaelLueken
Copy link
Collaborator

@christinaholtNOAA -

The final fix for the Functional WorkflowTaskTests to run successfully on Gaea has been completed and PR #5 has been opened in your fork. Once this PR has been merged, I will kick off the Gaea tests one last time, then we should be able to merge this work.

[to_yaml] Update .cicd/scripts/wrapper_srw_ftest.sh to add third argument to call to load_modules_run_task.sh
@christinaholtNOAA
Copy link
Collaborator Author

@MichaelLueken Thanks! I merged your PR to my branch.

@MichaelLueken
Copy link
Collaborator

The Jenkins tests successfully passed on Derecho, Gaea, Hera Intel, Hercules, and Orion.

The NOMADS WE2E test is currently failing in run_fcst on Hera GNU. I've relaunched the Hera GNU tests in hopes that the date will change, which should pull good data.

The Jet tests were aborted due to running longer than 8 hours. The coverage WE2E tests were manually launched yesterday, and the grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR test's run_fcst is still sitting in queue.

Once the Hera GNU and Jet tests are complete, I will merge this PR.

@MichaelLueken
Copy link
Collaborator

The Jet coverage WE2E tests have successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community_20240725134749                                           COMPLETE              26.78
custom_ESGgrid_20240725134750                                      COMPLETE              64.00
custom_ESGgrid_Great_Lakes_snow_8km_20240725134751                 COMPLETE              30.28
custom_GFDLgrid_20240725134753                                     COMPLETE              47.22
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202407  COMPLETE              13.15
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20  COMPLETE             103.50
get_from_HPSS_ics_RAP_lbcs_RAP_20240725134756                      COMPLETE              20.54
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240725134756  COMPLETE             632.83
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE             105.59
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE              42.54
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             965.82
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            2052.25

@MichaelLueken
Copy link
Collaborator

The Hera GNU tests have successfully passed through the Jenkins pipeline.

Merging PR now.

@MichaelLueken MichaelLueken merged commit c377164 into ufs-community:develop Jul 26, 2024
4 of 5 checks passed
natalie-perlin pushed a commit to natalie-perlin/ufs-srweather-app that referenced this pull request Aug 15, 2024
…1098)

Use YAML for the configuration language at run time.

---------

Co-authored-by: Michael Lueken <63728921+MichaelLueken@users.noreply.github.com>
Co-authored-by: Michael Kavulich <kavulich@ucar.edu>
Co-authored-by: michael.lueken <Michael.Lueken@noaa.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants