adding explainations to the verify documentation

NCAR · Dec 29, 2024 · b8e7ca4 · b8e7ca4
1 parent b19ae87
commit b8e7ca4
Show file tree

Hide file tree

Showing 10 changed files with 280 additions and 18 deletions.
diff --git a/verification/README.md b/verification/README.md
@@ -1,6 +1,94 @@
-# Forecast Verification Steps 
+# Forecast Verification Workflow
 
+This subfolder contains scripts to facilitate the verification of newly produced forecasts. This is an evolving section of the repository, and contributions are welcome to enhance the pipeline and expand verification capabilities.
 
-This sub folder provides scripts to aide in the verification of newly produced forecasts. 
+Forecast verification is compute-intensive and often requires parallel processing. Our strategy leverages many small queued jobs to handle the workload efficiently. This document outlines the steps required to verify a forecast against the ERA5 dataset and compare it to your forecast system.
+
+The process is driven by a **YAML configuration file** located in **./verification/** and named **verif\_config.yml**. This file must be customized extensively before initiating any verification steps.
+
+---
+
+## Step 00 – Adjust the YAML Configuration
+
+Most fields in the `ERA5` and `IFS` sections of the YAML file can remain as they are, but the following areas require your attention and adjustment:
+
+1. **qsub Section**
+
+   - **`qsub_loc`** – Path to the directory for qsub scripts (typically `./verification/qsub/`).
+   - **`scripts_loc`** – Path to the directory containing verification scripts.
+   - **`project_code`** – Your project code (required for submitting jobs to the cluster).
+   - **`conda_env`** – Name of the conda environment used for running the scripts.
+
+2. **forecastmodel Section**
+
+   - **`save_loc_rollout`** – Path to the directory where your generated forecasts are saved.
+   - **`verif_variables`** – List of variables you wish to verify (ensure these match your forecast output).
+
+---
+
+## Step 01 – Generate and Run QSUB Scripts
+
+Navigate to the **`./verification/verification/`** directory, where you will find four Jupyter notebooks named **`qsub_STEP00_*.ipynb`**. These notebooks generate the qsub scripts found in the **`./verification/qsub/`** directory.
+
+### Key Scripts to Run:
+
+- **STEP\_00** – Gathers forecast data (required before proceeding).
+- **STEP\_02** – Generates RMSE and ACC metrics.
+
+These scripts must be executed sequentially.
+
+### Running QSUB Scripts:
+
+1. After generating the qsub scripts via the notebooks, navigate to the **`./verification/qsub/`** directory.
+2. Execute the following scripts via bash:
+   ```bash
+   bash step00_gather_ForecastModel_all.sh
+   bash step02_RMSE_MF_all.sh
+   bash step02_ACC_MF_all.sh
+   ```
+3. **`step00_gather_ForecastModel_all.sh`** must complete before running the other scripts.
+
+---
+
+## Expected Results
+
+Upon completion of each stage:
+
+1. **After Forecast Gathering:**
+   - Forecasts will be gathered into individual NetCDF (`*.nc`) files in the location specified in the `qsub` section of the **YAML file**.
+
+2. **After RMSE and ACC Computation:**
+   - RMSE and ACC NetCDF files will be saved in the directory defined by the **`save_loc_verif`** field under the `forecastmodel` section of the YAML file.
+
+## Troubleshooting
+
+Forecast verification can take several days, especially for multi-year data. If errors occur, consider the following:
+
+1. **Directory Permissions & Existence**\
+   Ensure all directories specified in the YAML file exist and have appropriate write permissions. Create them manually if necessary.
+
+   ```python
+   import os
+   os.makedirs(path_verif, exist_ok=True)
+   ```
+
+2. **Post-Gather Checks**
+
+   - After running the gather script, verify that all forecast files have been created and contain the correct data and size.
+   - If you encounter files with abnormally small sizes, delete them and rerun the gather script. Files that already exist will **not** be overwritten.
+   - This will be much faster than the first run, as the files that already exist will be skipped. 
+
+3. **Monitoring Job Progress**
+
+   - Use cluster job monitoring tools to track progress and troubleshoot errors.
+   - For failed jobs, inspect the `.err` files in the qsub directory for detailed logs.
+
+---
+
+## Additional Notes
+
+- This process heavily relies on **parallel computing environments** like NCAR's Casper/Derecho clusters. Ensure you are familiar with the cluster's queuing and submission systems (PBS/SLURM).
+- The workflow is designed to be flexible. Users are encouraged to adapt scripts to suit their specific verification needs.
+
+If additional clarification or sections are needed (e.g., explanation of the verification metrics or variable definitions), feel free to reach out or contribute directly to this repository.
 
-Here are the detailed steps to make new verification results: 
diff --git a/verification/qsub/README.md b/verification/qsub/README.md
@@ -1 +1,25 @@
-# Location of qsub folders 
+# Location of qsub folder 
+
+## Step 01 – Generate and Run QSUB Scripts
+
+Navigate to the **`./verification/verification/`** directory, where you will find four Jupyter notebooks named **`qsub_STEP00_*.ipynb`**. These notebooks generate the qsub scripts found in the **`./verification/qsub/`** directory.
+
+### Key Scripts to Run:
+
+- **STEP\_00** – Gathers forecast data (required before proceeding).
+- **STEP\_02** – Generates RMSE and ACC metrics.
+
+These scripts must be executed sequentially.
+
+### Running QSUB Scripts:
+
+1. After generating the qsub scripts via the notebooks, navigate to the **`./verification/qsub/`**  (you are here now) directory.
+2. Execute the following scripts via bash:
+   ```bash
+   bash step00_gather_ForecastModel_all.sh
+   bash step02_RMSE_MF_all.sh
+   bash step02_ACC_MF_all.sh
+   ```
+3. **`step00_gather_ForecastModel_all.sh`** must complete before running the other scripts.
+
+---
diff --git a/verification/verification/README.md b/verification/verification/README.md
@@ -0,0 +1,112 @@
+## Hello!
+
+Below we outline the notebooks used to gen the QSUB scripts, in **`./scripts`** are the main files which drive the calculations. Adjustments can be made there.
+
+---
+## QSUB Jupyter Notebooks – Generating Job Scripts for Forecast Verification
+
+This directory contains Jupyter notebooks designed to generate and submit qsub scripts for various stages of the forecast verification process. These notebooks facilitate job scheduling and resource allocation on HPC systems, streamlining the process of gathering, processing, and verifying forecast data.
+
+---
+### Notebooks Overview
+
+The primary function of the notebooks in this folder is to automate the creation of bash scripts (`.sh`) that submit jobs to the PBS queueing system. This approach allows for efficient parallelization, ensuring multiple forecasts are processed concurrently.
+
+**Notebook Naming Convention:**
+- **`qsub_STEP00_*.ipynb`** – Responsible for gathering forecast model data.
+- **`qsub_STEP02_*.ipynb`** – Generates RMSE and ACC qsub scripts for model verification.
+---
+
+### How to Use These Notebooks
+
+1. **Setup & Prerequisites**  
+   Ensure the following prerequisites are met before running the notebooks:  
+   - **Configured YAML file** (`verif_config.yml`) with correct paths, project codes, and environment settings.  
+   - **Conda environment** activated (defined in the YAML under `conda_env`).  
+   - Appropriate access to the cluster and necessary permissions for submitting jobs.
+
+   **Example Activation:**
+   ```bash
+   conda activate credit
+   ```
+
+2. **Navigating the Workflow**  
+   - Start by opening the `qsub_STEP00_jobs.ipynb` notebook to generate scripts for gathering forecast data.  
+   - Follow by executing the `qsub_STEP02_*` notebooks for computing RMSE and ACC after the gather phase completes.
+
+3. **Running the Notebooks**  
+   - Execute cells sequentially within the notebook.  
+   - Each notebook will output `.sh` scripts into the `./verification/qsub/` directory.
+
+4. **Submitting QSUB Jobs**  
+   Once scripts are generated, submit them to the cluster queue:  
+   ```bash
+   bash step00_gather_ForecastModel_all.sh
+   bash step02_RMSE_MF_all.sh
+   bash step02_ACC_MF_all.sh
+   ```
+
+---
+
+### Notebook Breakdown – `qsub_STEP00_jobs.ipynb`
+
+**Purpose:**  
+Generates qsub scripts to gather forecast data from various sources and formats it for further verification.  
+
+**Key Sections:**  
+- **Config Loading:** Loads the YAML configuration to set paths, environment, and project-specific parameters.  
+- **Script Generation Loop:** Iterates over forecast indices (`INDs`) to create individual qsub scripts for each chunk of data.  
+- **Output:**  
+   - Scripts are saved in the `qsub_loc` directory specified in the YAML.  
+   - Example script: `verif_ZES_WX_001.sh`
+
+**Critical Code Example:**
+```python
+f = open('{}verif_ZES_WX_{:03d}.sh'.format(conf['qsub']['qsub_loc'], i), 'w') 
+
+heads = '''#!/bin/bash -l
+#PBS -N ZES_MF
+#PBS -A {project_code}
+#PBS -l walltime=23:59:59
+#PBS -l select=1:ncpus=4:mem=32GB
+#PBS -q casper
+#PBS -o verif_ZES_MF.log
+#PBS -e verif_ZES_MF.err
+
+conda activate credit
+cd {}
+python STEP03_ZES_ModelForecast.py {} {}
+'''.format(conf['qsub']['scripts_loc'], ind_start, ind_end, ind_start, ind_end)
+```
+
+---
+
+### Keys to Running These Notebooks
+
+- **Ensure Sequential Execution:**  
+   - `STEP00` scripts **must** be submitted and completed **before** proceeding to `STEP02` scripts.  
+   - Failure to adhere to this order will result in missing forecast data during RMSE/ACC calculations.  
+- **Directory Existence:**  
+   - The directories where NetCDF files are saved (`save_loc_verif`) must exist.  
+   - Use `os.makedirs(path, exist_ok=True)` to create directories if needed.  
+- **Cluster Specifics:**  
+   - These scripts are optimized for NCAR’s Cheyenne/Derecho clusters. Adjust for other HPC environments if necessary.
+
+---
+
+### Troubleshooting
+
+- **Job Failures:**  
+   - Review `.err` files in the qsub directory. These contain logs of job failures and error messages.  
+- **Missing Files:**  
+   - If forecast files appear incomplete, re-run the gather phase (`STEP00`) without fear of overwriting existing valid files.  
+- **Memory/CPU Issues:**  
+   - Adjust resource allocation by modifying `ncpus` and `mem` in the qsub script templates within the notebooks.
+
+---
+
+### Final Notes
+
+This workflow provides a scalable and efficient method for verifying forecast data on HPC clusters. While designed for internal projects, contributions are encouraged to improve performance, add metrics, or adapt for other clusters.
+
+If you find gaps or areas that require clarification, feel free to submit issues or pull requests to enhance the repository.
diff --git a/verification/verification/qsub_STEP00_jobs.ipynb b/verification/verification/qsub_STEP00_jobs.ipynb
@@ -73,7 +73,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": 18,
    "id": "182df2fc-3d68-449c-a3b3-cd1914a6aa76",
    "metadata": {},
    "outputs": [],

diff --git a/verification/verification/scripts/STEP00_gather_ForecastModel.py b/verification/verification/scripts/STEP00_gather_ForecastModel.py
@@ -5,7 +5,7 @@
 The script runs with config file `verif_config_1h.yml` and produces netCDF4 files
 with one file per initialization on a given range:
 ```
-python STEP00_gather_wxformer.py 0 365
+python STEP00_gather_ForecastModel.py 0 365
 ```
 where 0 and 365 are the first and the last initialization.
 
@@ -45,13 +45,17 @@
 verif_ind_end = int(args['verif_ind_end'])
 
 # ==================== #
-model_name = 'wxformer'
+model_name = 'forecastmodel'
 # ==================== #
 
 variables_levels = conf[model_name]['verif_variables']
 
 base_dir = conf[model_name]['save_loc_rollout']
 output_dir = conf[model_name]['save_loc_gather']
+
+# Ensure directory exists before saving the file
+os.makedirs(os.path.dirname(output_dir), exist_ok=True)
+
 time_intervals = None
 
 # Get list of NetCDF files

diff --git a/verification/verification/scripts/STEP02_ACC_ForecastModel.py b/verification/verification/scripts/STEP02_ACC_ForecastModel.py
@@ -41,7 +41,7 @@
 verif_ind_end = int(args['verif_ind_end'])
 
 # ====================== #
-model_name = 'wxformer'
+model_name = 'forecastmodel'
 lead_range = conf[model_name]['lead_range']
 verif_lead_range = conf[model_name]['verif_lead_range']
 
@@ -155,6 +155,9 @@ def sp_avg(DS, wlat):
 # Combine ACC results
 ds_acc = xr.concat(acc_results, dim='days')
 
+# Ensure directory exists before saving the file
+os.makedirs(os.path.dirname(path_verif), exist_ok=True)
+
 # Save
 print('Save to {}'.format(path_verif))
 ds_acc.to_netcdf(path_verif)

diff --git a/verification/verification/scripts/STEP02_RMSE_ForecastModel.py b/verification/verification/scripts/STEP02_RMSE_ForecastModel.py
@@ -40,7 +40,7 @@
 verif_ind_end = int(args['verif_ind_end'])
 
 # ====================== #
-model_name = 'wxformer'
+model_name = 'forecastmodel'
 lead_range = conf[model_name]['lead_range']
 verif_lead_range = conf[model_name]['verif_lead_range']
 
@@ -82,6 +82,7 @@
     'SP': None,
     't2m': None
 }
+variables_levels = conf['ERA5_ours']['verif_variables']
 
 # subset merged ERA5 and unify coord names
 levels = ds_ERA5_merge['level'].values
@@ -131,11 +132,9 @@
 # Combine verif results
 ds_verif = xr.concat(verif_results, dim='days')
 
+# Ensure directory exists before saving the file
+os.makedirs(os.path.dirname(path_verif), exist_ok=True)
+
 # Save the combined dataset
 print('Save to {}'.format(path_verif))
 ds_verif.to_netcdf(path_verif, mode='w')
-
-
-
-
-
diff --git a/verification/verification/scripts/STEP03_ZES_ForecastModel.py b/verification/verification/scripts/STEP03_ZES_ForecastModel.py
@@ -46,7 +46,7 @@
 verif_ind_end = int(args['verif_ind_end'])
 
 # ====================== #
-model_name = 'wxformer'
+model_name = 'forecastmodel'
 lead_range = conf[model_name]['lead_range']
 verif_lead_range = conf[model_name]['verif_lead_range']
 

diff --git a/verification/verification/scripts/STEP04_spatial_corr_ForecastModel.py b/verification/verification/scripts/STEP04_spatial_corr_ForecastModel.py
@@ -34,9 +34,8 @@
 verif_ind_start = int(args['verif_ind_start'])
 verif_ind_end = int(args['verif_ind_end'])
 # ====================== #
-model_name = 'wxformer'
+model_name = 'forecastmodel'
 lead_range = conf[model_name]['lead_range']
-verif_lead_range = [40,]
 
 leads_exist = list(np.arange(lead_range[0], lead_range[-1]+lead_range[0], lead_range[0]))
 leads_verif = list(np.arange(verif_lead_range[0], verif_lead_range[-1]+verif_lead_range[0], verif_lead_range[0]))