merge diyepw-scripts into diyepw, using click to enable cli functiona…

…lity; fixes #51
IMMM-SFA · Jul 13, 2021 · ce48083 · ce48083
1 parent 99c8459
commit ce48083
Show file tree

Hide file tree

Showing 7 changed files with 488 additions and 17 deletions.
diff --git a/README.md b/README.md
@@ -1,18 +1,17 @@
-# DIYEPW
-DIYEPW is a tool developed by Pacific Northwest National Laboratory that allows the quick and easy
+# `diyepw`
+`diyepw` is a tool developed by Pacific Northwest National Laboratory that allows the quick and easy
 generation of a set of EPW files for a given set of WMOs and years. It is provided as both a set
-of scripts (https://github.com/IMMM-SFA/diyepw-scripts) and as a Python package (https://github.com/IMMM-SFA/diyepw).
-This allows DIYEPW to be used as a command-line tool, or as a package to incorporate EPW file 
-generation into a custom script.
+of scripts and as a Python package. This allows `diyepw` to be used as a command-line tool, or as a package to
+incorporate EPW file generation into a custom script.
 
 # Getting Started
-The DIYEPW Python package can be easily installed using PIP:
+The `diyepw` Python package can be easily installed using PIP:
 
 ```
 pip install diyepw
 ```
 
-One you've installed the package, you can access any of the DIYEPW functions or classes by importing the package
+One you've installed the package, you can access any of the `diyepw` functions or classes by importing the package
 into your own Python scripts:
 
 ```
@@ -27,7 +26,7 @@ diyepw.create_amy_epw_files_for_years_and_wmos(
 )
 ```
 
-# Using DIYEPW to generate AMY EPW files
+# Using `diyepw` to generate AMY EPW files
 This package is a tool for the generation of AMY (actual meteorological year) EPW files, which is done
 by injecting AMY data into TMY (typical meteorological year) EPW files. The generated EPW files
 have the following fields replaced with observed data:
@@ -38,14 +37,14 @@ have the following fields replaced with observed data:
 1. wind direction
 1. wind speed
 
-Because observed weather data commonly contains gaps, DIYEPW will attempt to fill in any such gaps to ensure that in 
+Because observed weather data commonly contains gaps, `diyepw` will attempt to fill in any such gaps to ensure that in 
 each record a value is present for all of the hourly timestamps for the variables shown above. To do so, it will use one 
 of two strategies to impute or interpolate values for any missing fields in the data set:
 
 #### Interpolation: Handling for small gaps
 Small gaps (by default up to 6 consecutive hours of consecutive missing data for a field), are handled by linear 
 interpolation, so that for example if the dry bulb temperature has a gap with neighboring observed values like 
-(20, X, X, X, X, 25), DIYEPW will replace the missing values to give (20, 21, 22, 23, 24, 25).
+(20, X, X, X, X, 25), `diyepw` will replace the missing values to give (20, 21, 22, 23, 24, 25).
 
 #### Imputation: Handling for large gaps
 Large gaps (by default up to 48 consecutive hours of missing data for a field) are filled using an imputation strategy
@@ -61,9 +60,9 @@ missing values that can be imputed, can be changed from their defaults. The func
 `max_records_to_interpolate` and `max_records_to_impute`, which likewise override the defaults of 6 and 48.
 
 ## Package Functions
-All of the functionality of the DIYEPW project is available as a set of functions that underlie the scripts 
+All of the functionality of the `diyepw` project is available as a set of functions that underlie the scripts 
 described above. The functions offer much more granular access to the capabilities of the project, and allow
-DIYEPW capabilites to be incorporated into other software projects.
+`diyepw` capabilites to be incorporated into other software projects.
 
 The functions provided by the package are as follows:
 
@@ -86,6 +85,107 @@ The classes provided by the package are as follows:
 For more detailed documentation of all parameters, return types, and behaviors of the above functions and classes,
 please refer to the in-code documentation that heads each function's definition in the package.
 
+## Scripts
+This section describes the scripts available as part of this project. The scripts will be available in the terminal or
+virtual environment after running `pip install diyepw`. The scripts are located in the `diyepw/scripts/` directory.
+Every script has a manual page that can be accessed by passing the "--help" option to the script. For example:
+
+```
+analyze_noaa_data --help
+```
+
+### Workflow 1: AMY EPW generation based on years and WMO indices
+This workflow uses only a single script, `create_amy_epw_files_for_years_and_wmos.py`, and
+generates AMY EPW files for a set of years and WMO indices. It accomplishes this by combining
+TMY (typical meteorological year) EPW files with AMY (actual meteorological year) data. The
+TMY EPW file for a given WMO is downloaded by the software as needed from energyplus.net. The
+AMY data comes from NOAA ISD Lite files that are likewise downloaded as needed, from 
+ncdc.noaa.gov.
+
+This script can be called like this:
+
+```
+create_amy_epw_files_for_years_and_wmos --years=2010-2015 --wmo-indices=723403,7722780 --output-path .
+```
+
+The options `--years` and `--wmo-indices` are required. You will be prompted for them if not provided in the arguments.
+There are a number of other optional options that can also be set. All available options, their effects, and the values
+they accept can be seen by calling this script with the `--help` option:
+
+```
+create_amy_epw_files_for_years_and_wmos --help
+```
+
+### Workflow 2: AMY EPW generation based on existing ISD Lite files
+This workflow is very similar to Workflow 1, but instead of downloading NOAA's ISD Lite files
+as needed, it reads in a set of ISD Lite files provided by the user and generates one AMY EPW
+file corresponding to each.
+
+This workflow involves two steps:
+
+#### 1. analyze_noaa_data
+
+The script analyze_noaa_data.py will check a set of ISD Lite files against a set of requirements,
+and generate a CSV file listing the ISD Lite files that are suitable for conversion to EPW. The
+script is called like this:
+
+```
+analyze_noaa_data --inputs=/path/to/your/inputs/directory --output-path .
+```
+
+The script will look for any file within the directory passed to --inputs, including in 
+subdirectories or subdirectories of subdirectories. The files must be named like 
+"999999-88888-2020.gz", where the first number is a WMO index and the final number is the
+year - the middle number is ignored. The easiest way to get files that are suitable for use
+for this script is to download them from NOAA's catalog at 
+https://www1.ncdc.noaa.gov/pub/data/noaa/isd-lite/.
+
+The ".gz" (gzip commpressed) format of the ISD Lite files is the format provided by NOAA,
+but is not required. You may also provide ISD Lite files in CSV (.csv) format, or in a 
+different compression format like ZIP (.zip). The file extension is used to determine what
+format the file is and must match the file's format. Pass the `--help` option 
+(`analyze_noaa_data --help`) for more information on what compressed formats are supported.
+
+The script is primarily checking that the ISD Lite files are in concordance with the following limits:
+
+1. Total number of rows missing
+1. Maximum number of consecutive rows missing
+
+and will produce the following files (as applicable) under the specified `--output-path`:
+
+1. `missing_total_entries_high.csv`: A list of files where the total number of rows missing exceeds a threshold.
+   The threshold is set to rule out files where more than 700 (out of 8760 total) entries are missing entirely
+   by default, but a custom value can be set with the --max-missing-rows option:
+
+    ```
+    analyze_noaa_data --max-missing-rows=700
+    ```
+   
+1. `missing_consec_entries_high.csv`: A list of files where the maximum consecutive number of rows missing exceeds 
+   a threshold. The threshold is currently set to a maximum of 48 consecutive empty rows, but a custom value can 
+   be set with the --max-consecutive-missing-rows option:
+   
+   ```
+   analyze_noaa_data --max-consecutive-missing-rows=48
+   ```
+   
+1. `files_to_convert.csv`: A list of the files that are deemed to be usable because they are neither missing too many
+   total nor too many consecutive rows. This file determines which EPWs will be generated by the next script, and
+   it can be freely edited before running that script.
+
+#### 2. create_amy_epw_files
+
+The script create_amy_epw.py reads the files_to_convert.csv file generated in the previous step, and for each
+ISD Lite file listed, generates an AMY EPW file. It can be called like this:
+
+```
+create_amy_epw_files --max-records-to-interpolate=6 --max-records-to-impute=48
+```
+
+Both `--max-records-to-interpolate` and `--max-records-to-impute` are optional and can be used to override the
+default size of the gaps that can be filled in observed data using the two strategies, which are described in more
+detail at the top of this document.
+
 ## Reading in TMY3 files and writing EPW files
 Functions for reading TMY3 files and writing EPW files within this script were adapted from the 
 [LAF.py script](https://github.com/SSESLab/laf/blob/master/LAF.py) by Carlo Bianchi at the Site-Specific 

diff --git a/diyepw/__init__.py b/diyepw/__init__.py
@@ -1,4 +1,4 @@
-__version__ = '1.0.5'
+__version__ = '1.1.0'
 from .meteorology import Meteorology
 from .create_amy_epw_files_for_years_and_wmos import create_amy_epw_files_for_years_and_wmos
 from .analyze_noaa_isd_lite_files import analyze_noaa_isd_lite_files

diff --git a/diyepw/analyze_noaa_isd_lite_file.py b/diyepw/analyze_noaa_isd_lite_file.py
@@ -2,7 +2,7 @@
 
 def analyze_noaa_isd_lite_file(
         file: str,
-        compression:str='infer'
+        compression: str='infer'
 ):
     """
     Performs an analysis of a single NOAA ISD Lite file, determining whether it is suitable for conversion into an AMY

diff --git a/diyepw/scripts/analyze_noaa_data.py b/diyepw/scripts/analyze_noaa_data.py
@@ -0,0 +1,102 @@
+import click
+import diyepw
+from glob import iglob
+import os
+import pandas as pd
+
+
+@click.command()
+@click.option(
+    '--max-missing-rows',
+    default=700,
+    show_default=True,
+    type=int,
+    help='ISD files with more than this number of missing rows will be excluded from the output'
+)
+@click.option(
+    '--max-consecutive-missing-rows',
+    default=48,
+    show_default=True,
+    type=int,
+    help='ISD files with more than this number of consecutive missing rows will be excluded from the output'
+)
+@click.option(
+    '-o', '--output-path',
+    default='.',
+    type=click.Path(
+        file_okay=False,
+        dir_okay=True,
+        writable=True,
+        resolve_path=True,
+    ),
+    help="""The path to which output and error files should be written."""
+)
+@click.argument(
+    'input_path',
+    default='.',
+    type=click.Path(
+        file_okay=False,
+        dir_okay=True,
+        readable=True,
+        resolve_path=True,
+    ),
+)
+def analyze_noaa_data(
+    max_missing_rows,
+    max_consecutive_missing_rows,
+    output_path,
+    input_path,
+):
+    """Perform an analysis of a set of NOAA ISA Lite files, determining which are suitable for conversion to
+       AMY EPW files. Any ISD Lite files in INPUT_PATH or any of its subdirectories will be processed. The files
+       must be named according to the format '<WMO Index>-<WBAN>-<Year>' and must end with '.gz', '.csv', or '.zip'."""
+
+    # Make a directory to store results if it doesn't already exist.
+    if not os.path.exists(output_path):
+        os.makedirs(output_path)
+
+    # Recursively search for all files under the passed path, excluding directories
+    input_files = [file for file in iglob(input_path + '/**/*', recursive=True) if not os.path.isdir(file)]
+
+    try:
+        analysis_results = diyepw.analyze_noaa_isd_lite_files(
+            input_files,
+            max_missing_rows=max_missing_rows,
+            max_consecutive_missing_rows=max_consecutive_missing_rows,
+        )
+    except:
+        click.echo("Unable to read input files, aborting...")
+        raise click.Abort
+
+    # Write the dataframes to CSVs for the output files.
+    num_files_with_too_many_rows_missing = len(analysis_results['too_many_total_rows_missing'])
+    if num_files_with_too_many_rows_missing > 0:
+        path = os.path.join(output_path, 'missing_total_entries_high.csv')
+        path = os.path.abspath(path)  # Change to absolute path for readability
+        click.echo(f"""{num_files_with_too_many_rows_missing}
+                       records excluded because they were missing more than {max_missing_rows}
+                       rows. Information about these files will be written to {path}.""")
+        pd.DataFrame(analysis_results['too_many_total_rows_missing']).to_csv(path, index=False)
+
+    num_files_with_too_many_consec_rows_missing = len(analysis_results['too_many_consecutive_rows_missing'])
+    if num_files_with_too_many_consec_rows_missing > 0:
+        path = os.path.join(output_path, 'missing_consec_entries_high.csv')
+        path = os.path.abspath(path)  # Change to absolute path for readability
+        click.echo(f"""{num_files_with_too_many_consec_rows_missing}
+                       records excluded because they were missing more than {max_consecutive_missing_rows}
+                       consecutive rows. Information about these files will be written to {path}.""")
+        pd.DataFrame(analysis_results['too_many_consecutive_rows_missing']).to_csv(path, index=False)
+
+    num_good_files = len(analysis_results['good'])
+    if num_good_files > 0:
+        path = os.path.join(output_path, 'files_to_convert.csv')
+        path = os.path.abspath(path)  # Change to absolute path for readability
+        click.echo(f"""{num_good_files} records are complete enough to be processed.
+                       Information about these files will be written to {path}.""")
+        pd.DataFrame(analysis_results['good']).to_csv(path, index=False)
+
+    click.echo('Done! {count} files processed.'.format(count=sum([
+        num_good_files,
+        num_files_with_too_many_consec_rows_missing,
+        num_files_with_too_many_rows_missing
+    ])))