Skip to content

Commit

Permalink
adding explainations to the verify documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
WillyChap committed Dec 29, 2024
1 parent b19ae87 commit b8e7ca4
Show file tree
Hide file tree
Showing 10 changed files with 280 additions and 18 deletions.
94 changes: 91 additions & 3 deletions verification/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,94 @@
# Forecast Verification Steps
# Forecast Verification Workflow

This subfolder contains scripts to facilitate the verification of newly produced forecasts. This is an evolving section of the repository, and contributions are welcome to enhance the pipeline and expand verification capabilities.

This sub folder provides scripts to aide in the verification of newly produced forecasts.
Forecast verification is compute-intensive and often requires parallel processing. Our strategy leverages many small queued jobs to handle the workload efficiently. This document outlines the steps required to verify a forecast against the ERA5 dataset and compare it to your forecast system.

The process is driven by a **YAML configuration file** located in **./verification/** and named **verif\_config.yml**. This file must be customized extensively before initiating any verification steps.

---

## Step 00 – Adjust the YAML Configuration

Most fields in the `ERA5` and `IFS` sections of the YAML file can remain as they are, but the following areas require your attention and adjustment:

1. **qsub Section**

- **`qsub_loc`** – Path to the directory for qsub scripts (typically `./verification/qsub/`).
- **`scripts_loc`** – Path to the directory containing verification scripts.
- **`project_code`** – Your project code (required for submitting jobs to the cluster).
- **`conda_env`** – Name of the conda environment used for running the scripts.

2. **forecastmodel Section**

- **`save_loc_rollout`** – Path to the directory where your generated forecasts are saved.
- **`verif_variables`** – List of variables you wish to verify (ensure these match your forecast output).

---

## Step 01 – Generate and Run QSUB Scripts

Navigate to the **`./verification/verification/`** directory, where you will find four Jupyter notebooks named **`qsub_STEP00_*.ipynb`**. These notebooks generate the qsub scripts found in the **`./verification/qsub/`** directory.

### Key Scripts to Run:

- **STEP\_00** – Gathers forecast data (required before proceeding).
- **STEP\_02** – Generates RMSE and ACC metrics.

These scripts must be executed sequentially.

### Running QSUB Scripts:

1. After generating the qsub scripts via the notebooks, navigate to the **`./verification/qsub/`** directory.
2. Execute the following scripts via bash:
```bash
bash step00_gather_ForecastModel_all.sh
bash step02_RMSE_MF_all.sh
bash step02_ACC_MF_all.sh
```
3. **`step00_gather_ForecastModel_all.sh`** must complete before running the other scripts.

---

## Expected Results

Upon completion of each stage:

1. **After Forecast Gathering:**
- Forecasts will be gathered into individual NetCDF (`*.nc`) files in the location specified in the `qsub` section of the **YAML file**.

2. **After RMSE and ACC Computation:**
- RMSE and ACC NetCDF files will be saved in the directory defined by the **`save_loc_verif`** field under the `forecastmodel` section of the YAML file.

## Troubleshooting

Forecast verification can take several days, especially for multi-year data. If errors occur, consider the following:

1. **Directory Permissions & Existence**\
Ensure all directories specified in the YAML file exist and have appropriate write permissions. Create them manually if necessary.

```python
import os
os.makedirs(path_verif, exist_ok=True)
```

2. **Post-Gather Checks**

- After running the gather script, verify that all forecast files have been created and contain the correct data and size.
- If you encounter files with abnormally small sizes, delete them and rerun the gather script. Files that already exist will **not** be overwritten.
- This will be much faster than the first run, as the files that already exist will be skipped.

3. **Monitoring Job Progress**

- Use cluster job monitoring tools to track progress and troubleshoot errors.
- For failed jobs, inspect the `.err` files in the qsub directory for detailed logs.

---

## Additional Notes

- This process heavily relies on **parallel computing environments** like NCAR's Casper/Derecho clusters. Ensure you are familiar with the cluster's queuing and submission systems (PBS/SLURM).
- The workflow is designed to be flexible. Users are encouraged to adapt scripts to suit their specific verification needs.

If additional clarification or sections are needed (e.g., explanation of the verification metrics or variable definitions), feel free to reach out or contribute directly to this repository.

Here are the detailed steps to make new verification results:
26 changes: 25 additions & 1 deletion verification/qsub/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,25 @@
# Location of qsub folders
# Location of qsub folder

## Step 01 – Generate and Run QSUB Scripts

Navigate to the **`./verification/verification/`** directory, where you will find four Jupyter notebooks named **`qsub_STEP00_*.ipynb`**. These notebooks generate the qsub scripts found in the **`./verification/qsub/`** directory.

### Key Scripts to Run:

- **STEP\_00** – Gathers forecast data (required before proceeding).
- **STEP\_02** – Generates RMSE and ACC metrics.

These scripts must be executed sequentially.

### Running QSUB Scripts:

1. After generating the qsub scripts via the notebooks, navigate to the **`./verification/qsub/`** (you are here now) directory.
2. Execute the following scripts via bash:
```bash
bash step00_gather_ForecastModel_all.sh
bash step02_RMSE_MF_all.sh
bash step02_ACC_MF_all.sh
```
3. **`step00_gather_ForecastModel_all.sh`** must complete before running the other scripts.

---
112 changes: 112 additions & 0 deletions verification/verification/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
## Hello!

Below we outline the notebooks used to gen the QSUB scripts, in **`./scripts`** are the main files which drive the calculations. Adjustments can be made there.

---
## QSUB Jupyter Notebooks – Generating Job Scripts for Forecast Verification

This directory contains Jupyter notebooks designed to generate and submit qsub scripts for various stages of the forecast verification process. These notebooks facilitate job scheduling and resource allocation on HPC systems, streamlining the process of gathering, processing, and verifying forecast data.

---
### Notebooks Overview

The primary function of the notebooks in this folder is to automate the creation of bash scripts (`.sh`) that submit jobs to the PBS queueing system. This approach allows for efficient parallelization, ensuring multiple forecasts are processed concurrently.

**Notebook Naming Convention:**
- **`qsub_STEP00_*.ipynb`** – Responsible for gathering forecast model data.
- **`qsub_STEP02_*.ipynb`** – Generates RMSE and ACC qsub scripts for model verification.
---

### How to Use These Notebooks

1. **Setup & Prerequisites**
Ensure the following prerequisites are met before running the notebooks:
- **Configured YAML file** (`verif_config.yml`) with correct paths, project codes, and environment settings.
- **Conda environment** activated (defined in the YAML under `conda_env`).
- Appropriate access to the cluster and necessary permissions for submitting jobs.

**Example Activation:**
```bash
conda activate credit
```

2. **Navigating the Workflow**
- Start by opening the `qsub_STEP00_jobs.ipynb` notebook to generate scripts for gathering forecast data.
- Follow by executing the `qsub_STEP02_*` notebooks for computing RMSE and ACC after the gather phase completes.

3. **Running the Notebooks**
- Execute cells sequentially within the notebook.
- Each notebook will output `.sh` scripts into the `./verification/qsub/` directory.

4. **Submitting QSUB Jobs**
Once scripts are generated, submit them to the cluster queue:
```bash
bash step00_gather_ForecastModel_all.sh
bash step02_RMSE_MF_all.sh
bash step02_ACC_MF_all.sh
```

---

### Notebook Breakdown – `qsub_STEP00_jobs.ipynb`

**Purpose:**
Generates qsub scripts to gather forecast data from various sources and formats it for further verification.

**Key Sections:**
- **Config Loading:** Loads the YAML configuration to set paths, environment, and project-specific parameters.
- **Script Generation Loop:** Iterates over forecast indices (`INDs`) to create individual qsub scripts for each chunk of data.
- **Output:**
- Scripts are saved in the `qsub_loc` directory specified in the YAML.
- Example script: `verif_ZES_WX_001.sh`

**Critical Code Example:**
```python
f = open('{}verif_ZES_WX_{:03d}.sh'.format(conf['qsub']['qsub_loc'], i), 'w')

heads = '''#!/bin/bash -l
#PBS -N ZES_MF
#PBS -A {project_code}
#PBS -l walltime=23:59:59
#PBS -l select=1:ncpus=4:mem=32GB
#PBS -q casper
#PBS -o verif_ZES_MF.log
#PBS -e verif_ZES_MF.err
conda activate credit
cd {}
python STEP03_ZES_ModelForecast.py {} {}
'''.format(conf['qsub']['scripts_loc'], ind_start, ind_end, ind_start, ind_end)
```

---

### Keys to Running These Notebooks

- **Ensure Sequential Execution:**
- `STEP00` scripts **must** be submitted and completed **before** proceeding to `STEP02` scripts.
- Failure to adhere to this order will result in missing forecast data during RMSE/ACC calculations.
- **Directory Existence:**
- The directories where NetCDF files are saved (`save_loc_verif`) must exist.
- Use `os.makedirs(path, exist_ok=True)` to create directories if needed.
- **Cluster Specifics:**
- These scripts are optimized for NCAR’s Cheyenne/Derecho clusters. Adjust for other HPC environments if necessary.

---

### Troubleshooting

- **Job Failures:**
- Review `.err` files in the qsub directory. These contain logs of job failures and error messages.
- **Missing Files:**
- If forecast files appear incomplete, re-run the gather phase (`STEP00`) without fear of overwriting existing valid files.
- **Memory/CPU Issues:**
- Adjust resource allocation by modifying `ncpus` and `mem` in the qsub script templates within the notebooks.

---

### Final Notes

This workflow provides a scalable and efficient method for verifying forecast data on HPC clusters. While designed for internal projects, contributions are encouraged to improve performance, add metrics, or adapt for other clusters.

If you find gaps or areas that require clarification, feel free to submit issues or pull requests to enhance the repository.
2 changes: 1 addition & 1 deletion verification/verification/qsub_STEP00_jobs.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@
},
{
"cell_type": "code",
"execution_count": 17,
"execution_count": 18,
"id": "182df2fc-3d68-449c-a3b3-cd1914a6aa76",
"metadata": {},
"outputs": [],
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
The script runs with config file `verif_config_1h.yml` and produces netCDF4 files
with one file per initialization on a given range:
```
python STEP00_gather_wxformer.py 0 365
python STEP00_gather_ForecastModel.py 0 365
```
where 0 and 365 are the first and the last initialization.
Expand Down Expand Up @@ -45,13 +45,17 @@
verif_ind_end = int(args['verif_ind_end'])

# ==================== #
model_name = 'wxformer'
model_name = 'forecastmodel'
# ==================== #

variables_levels = conf[model_name]['verif_variables']

base_dir = conf[model_name]['save_loc_rollout']
output_dir = conf[model_name]['save_loc_gather']

# Ensure directory exists before saving the file
os.makedirs(os.path.dirname(output_dir), exist_ok=True)

time_intervals = None

# Get list of NetCDF files
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
verif_ind_end = int(args['verif_ind_end'])

# ====================== #
model_name = 'wxformer'
model_name = 'forecastmodel'
lead_range = conf[model_name]['lead_range']
verif_lead_range = conf[model_name]['verif_lead_range']

Expand Down Expand Up @@ -155,6 +155,9 @@ def sp_avg(DS, wlat):
# Combine ACC results
ds_acc = xr.concat(acc_results, dim='days')

# Ensure directory exists before saving the file
os.makedirs(os.path.dirname(path_verif), exist_ok=True)

# Save
print('Save to {}'.format(path_verif))
ds_acc.to_netcdf(path_verif)
Expand Down
11 changes: 5 additions & 6 deletions verification/verification/scripts/STEP02_RMSE_ForecastModel.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@
verif_ind_end = int(args['verif_ind_end'])

# ====================== #
model_name = 'wxformer'
model_name = 'forecastmodel'
lead_range = conf[model_name]['lead_range']
verif_lead_range = conf[model_name]['verif_lead_range']

Expand Down Expand Up @@ -82,6 +82,7 @@
'SP': None,
't2m': None
}
variables_levels = conf['ERA5_ours']['verif_variables']

# subset merged ERA5 and unify coord names
levels = ds_ERA5_merge['level'].values
Expand Down Expand Up @@ -131,11 +132,9 @@
# Combine verif results
ds_verif = xr.concat(verif_results, dim='days')

# Ensure directory exists before saving the file
os.makedirs(os.path.dirname(path_verif), exist_ok=True)

# Save the combined dataset
print('Save to {}'.format(path_verif))
ds_verif.to_netcdf(path_verif, mode='w')





Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@
verif_ind_end = int(args['verif_ind_end'])

# ====================== #
model_name = 'wxformer'
model_name = 'forecastmodel'
lead_range = conf[model_name]['lead_range']
verif_lead_range = conf[model_name]['verif_lead_range']

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,8 @@
verif_ind_start = int(args['verif_ind_start'])
verif_ind_end = int(args['verif_ind_end'])
# ====================== #
model_name = 'wxformer'
model_name = 'forecastmodel'
lead_range = conf[model_name]['lead_range']
verif_lead_range = [40,]

leads_exist = list(np.arange(lead_range[0], lead_range[-1]+lead_range[0], lead_range[0]))
leads_verif = list(np.arange(verif_lead_range[0], verif_lead_range[-1]+verif_lead_range[0], verif_lead_range[0]))
Expand Down
Loading

0 comments on commit b8e7ca4

Please sign in to comment.