diff --git a/README.md b/README.md index e9056ef..6575c53 100644 --- a/README.md +++ b/README.md @@ -1,18 +1,17 @@ -# DIYEPW -DIYEPW is a tool developed by Pacific Northwest National Laboratory that allows the quick and easy +# `diyepw` +`diyepw` is a tool developed by Pacific Northwest National Laboratory that allows the quick and easy generation of a set of EPW files for a given set of WMOs and years. It is provided as both a set -of scripts (https://github.com/IMMM-SFA/diyepw-scripts) and as a Python package (https://github.com/IMMM-SFA/diyepw). -This allows DIYEPW to be used as a command-line tool, or as a package to incorporate EPW file -generation into a custom script. +of scripts and as a Python package. This allows `diyepw` to be used as a command-line tool, or as a package to +incorporate EPW file generation into a custom script. # Getting Started -The DIYEPW Python package can be easily installed using PIP: +The `diyepw` Python package can be easily installed using PIP: ``` pip install diyepw ``` -One you've installed the package, you can access any of the DIYEPW functions or classes by importing the package +One you've installed the package, you can access any of the `diyepw` functions or classes by importing the package into your own Python scripts: ``` @@ -27,7 +26,7 @@ diyepw.create_amy_epw_files_for_years_and_wmos( ) ``` -# Using DIYEPW to generate AMY EPW files +# Using `diyepw` to generate AMY EPW files This package is a tool for the generation of AMY (actual meteorological year) EPW files, which is done by injecting AMY data into TMY (typical meteorological year) EPW files. The generated EPW files have the following fields replaced with observed data: @@ -38,14 +37,14 @@ have the following fields replaced with observed data: 1. wind direction 1. wind speed -Because observed weather data commonly contains gaps, DIYEPW will attempt to fill in any such gaps to ensure that in +Because observed weather data commonly contains gaps, `diyepw` will attempt to fill in any such gaps to ensure that in each record a value is present for all of the hourly timestamps for the variables shown above. To do so, it will use one of two strategies to impute or interpolate values for any missing fields in the data set: #### Interpolation: Handling for small gaps Small gaps (by default up to 6 consecutive hours of consecutive missing data for a field), are handled by linear interpolation, so that for example if the dry bulb temperature has a gap with neighboring observed values like -(20, X, X, X, X, 25), DIYEPW will replace the missing values to give (20, 21, 22, 23, 24, 25). +(20, X, X, X, X, 25), `diyepw` will replace the missing values to give (20, 21, 22, 23, 24, 25). #### Imputation: Handling for large gaps Large gaps (by default up to 48 consecutive hours of missing data for a field) are filled using an imputation strategy @@ -61,9 +60,9 @@ missing values that can be imputed, can be changed from their defaults. The func `max_records_to_interpolate` and `max_records_to_impute`, which likewise override the defaults of 6 and 48. ## Package Functions -All of the functionality of the DIYEPW project is available as a set of functions that underlie the scripts +All of the functionality of the `diyepw` project is available as a set of functions that underlie the scripts described above. The functions offer much more granular access to the capabilities of the project, and allow -DIYEPW capabilites to be incorporated into other software projects. +`diyepw` capabilites to be incorporated into other software projects. The functions provided by the package are as follows: @@ -86,6 +85,107 @@ The classes provided by the package are as follows: For more detailed documentation of all parameters, return types, and behaviors of the above functions and classes, please refer to the in-code documentation that heads each function's definition in the package. +## Scripts +This section describes the scripts available as part of this project. The scripts will be available in the terminal or +virtual environment after running `pip install diyepw`. The scripts are located in the `diyepw/scripts/` directory. +Every script has a manual page that can be accessed by passing the "--help" option to the script. For example: + +``` +analyze_noaa_data --help +``` + +### Workflow 1: AMY EPW generation based on years and WMO indices +This workflow uses only a single script, `create_amy_epw_files_for_years_and_wmos.py`, and +generates AMY EPW files for a set of years and WMO indices. It accomplishes this by combining +TMY (typical meteorological year) EPW files with AMY (actual meteorological year) data. The +TMY EPW file for a given WMO is downloaded by the software as needed from energyplus.net. The +AMY data comes from NOAA ISD Lite files that are likewise downloaded as needed, from +ncdc.noaa.gov. + +This script can be called like this: + +``` +create_amy_epw_files_for_years_and_wmos --years=2010-2015 --wmo-indices=723403,7722780 --output-path . +``` + +The options `--years` and `--wmo-indices` are required. You will be prompted for them if not provided in the arguments. +There are a number of other optional options that can also be set. All available options, their effects, and the values +they accept can be seen by calling this script with the `--help` option: + +``` +create_amy_epw_files_for_years_and_wmos --help +``` + +### Workflow 2: AMY EPW generation based on existing ISD Lite files +This workflow is very similar to Workflow 1, but instead of downloading NOAA's ISD Lite files +as needed, it reads in a set of ISD Lite files provided by the user and generates one AMY EPW +file corresponding to each. + +This workflow involves two steps: + +#### 1. analyze_noaa_data + +The script analyze_noaa_data.py will check a set of ISD Lite files against a set of requirements, +and generate a CSV file listing the ISD Lite files that are suitable for conversion to EPW. The +script is called like this: + +``` +analyze_noaa_data --inputs=/path/to/your/inputs/directory --output-path . +``` + +The script will look for any file within the directory passed to --inputs, including in +subdirectories or subdirectories of subdirectories. The files must be named like +"999999-88888-2020.gz", where the first number is a WMO index and the final number is the +year - the middle number is ignored. The easiest way to get files that are suitable for use +for this script is to download them from NOAA's catalog at +https://www1.ncdc.noaa.gov/pub/data/noaa/isd-lite/. + +The ".gz" (gzip commpressed) format of the ISD Lite files is the format provided by NOAA, +but is not required. You may also provide ISD Lite files in CSV (.csv) format, or in a +different compression format like ZIP (.zip). The file extension is used to determine what +format the file is and must match the file's format. Pass the `--help` option +(`analyze_noaa_data --help`) for more information on what compressed formats are supported. + +The script is primarily checking that the ISD Lite files are in concordance with the following limits: + +1. Total number of rows missing +1. Maximum number of consecutive rows missing + +and will produce the following files (as applicable) under the specified `--output-path`: + +1. `missing_total_entries_high.csv`: A list of files where the total number of rows missing exceeds a threshold. + The threshold is set to rule out files where more than 700 (out of 8760 total) entries are missing entirely + by default, but a custom value can be set with the --max-missing-rows option: + + ``` + analyze_noaa_data --max-missing-rows=700 + ``` + +1. `missing_consec_entries_high.csv`: A list of files where the maximum consecutive number of rows missing exceeds + a threshold. The threshold is currently set to a maximum of 48 consecutive empty rows, but a custom value can + be set with the --max-consecutive-missing-rows option: + + ``` + analyze_noaa_data --max-consecutive-missing-rows=48 + ``` + +1. `files_to_convert.csv`: A list of the files that are deemed to be usable because they are neither missing too many + total nor too many consecutive rows. This file determines which EPWs will be generated by the next script, and + it can be freely edited before running that script. + +#### 2. create_amy_epw_files + +The script create_amy_epw.py reads the files_to_convert.csv file generated in the previous step, and for each +ISD Lite file listed, generates an AMY EPW file. It can be called like this: + +``` +create_amy_epw_files --max-records-to-interpolate=6 --max-records-to-impute=48 +``` + +Both `--max-records-to-interpolate` and `--max-records-to-impute` are optional and can be used to override the +default size of the gaps that can be filled in observed data using the two strategies, which are described in more +detail at the top of this document. + ## Reading in TMY3 files and writing EPW files Functions for reading TMY3 files and writing EPW files within this script were adapted from the [LAF.py script](https://github.com/SSESLab/laf/blob/master/LAF.py) by Carlo Bianchi at the Site-Specific diff --git a/diyepw/__init__.py b/diyepw/__init__.py index 36e373a..d5d6ecf 100644 --- a/diyepw/__init__.py +++ b/diyepw/__init__.py @@ -1,4 +1,4 @@ -__version__ = '1.0.5' +__version__ = '1.1.0' from .meteorology import Meteorology from .create_amy_epw_files_for_years_and_wmos import create_amy_epw_files_for_years_and_wmos from .analyze_noaa_isd_lite_files import analyze_noaa_isd_lite_files diff --git a/diyepw/analyze_noaa_isd_lite_file.py b/diyepw/analyze_noaa_isd_lite_file.py index 40e2cca..0a42824 100644 --- a/diyepw/analyze_noaa_isd_lite_file.py +++ b/diyepw/analyze_noaa_isd_lite_file.py @@ -2,7 +2,7 @@ def analyze_noaa_isd_lite_file( file: str, - compression:str='infer' + compression: str='infer' ): """ Performs an analysis of a single NOAA ISD Lite file, determining whether it is suitable for conversion into an AMY diff --git a/diyepw/scripts/analyze_noaa_data.py b/diyepw/scripts/analyze_noaa_data.py new file mode 100644 index 0000000..e28adeb --- /dev/null +++ b/diyepw/scripts/analyze_noaa_data.py @@ -0,0 +1,102 @@ +import click +import diyepw +from glob import iglob +import os +import pandas as pd + + +@click.command() +@click.option( + '--max-missing-rows', + default=700, + show_default=True, + type=int, + help='ISD files with more than this number of missing rows will be excluded from the output' +) +@click.option( + '--max-consecutive-missing-rows', + default=48, + show_default=True, + type=int, + help='ISD files with more than this number of consecutive missing rows will be excluded from the output' +) +@click.option( + '-o', '--output-path', + default='.', + type=click.Path( + file_okay=False, + dir_okay=True, + writable=True, + resolve_path=True, + ), + help="""The path to which output and error files should be written.""" +) +@click.argument( + 'input_path', + default='.', + type=click.Path( + file_okay=False, + dir_okay=True, + readable=True, + resolve_path=True, + ), +) +def analyze_noaa_data( + max_missing_rows, + max_consecutive_missing_rows, + output_path, + input_path, +): + """Perform an analysis of a set of NOAA ISA Lite files, determining which are suitable for conversion to + AMY EPW files. Any ISD Lite files in INPUT_PATH or any of its subdirectories will be processed. The files + must be named according to the format '--' and must end with '.gz', '.csv', or '.zip'.""" + + # Make a directory to store results if it doesn't already exist. + if not os.path.exists(output_path): + os.makedirs(output_path) + + # Recursively search for all files under the passed path, excluding directories + input_files = [file for file in iglob(input_path + '/**/*', recursive=True) if not os.path.isdir(file)] + + try: + analysis_results = diyepw.analyze_noaa_isd_lite_files( + input_files, + max_missing_rows=max_missing_rows, + max_consecutive_missing_rows=max_consecutive_missing_rows, + ) + except: + click.echo("Unable to read input files, aborting...") + raise click.Abort + + # Write the dataframes to CSVs for the output files. + num_files_with_too_many_rows_missing = len(analysis_results['too_many_total_rows_missing']) + if num_files_with_too_many_rows_missing > 0: + path = os.path.join(output_path, 'missing_total_entries_high.csv') + path = os.path.abspath(path) # Change to absolute path for readability + click.echo(f"""{num_files_with_too_many_rows_missing} + records excluded because they were missing more than {max_missing_rows} + rows. Information about these files will be written to {path}.""") + pd.DataFrame(analysis_results['too_many_total_rows_missing']).to_csv(path, index=False) + + num_files_with_too_many_consec_rows_missing = len(analysis_results['too_many_consecutive_rows_missing']) + if num_files_with_too_many_consec_rows_missing > 0: + path = os.path.join(output_path, 'missing_consec_entries_high.csv') + path = os.path.abspath(path) # Change to absolute path for readability + click.echo(f"""{num_files_with_too_many_consec_rows_missing} + records excluded because they were missing more than {max_consecutive_missing_rows} + consecutive rows. Information about these files will be written to {path}.""") + pd.DataFrame(analysis_results['too_many_consecutive_rows_missing']).to_csv(path, index=False) + + num_good_files = len(analysis_results['good']) + if num_good_files > 0: + path = os.path.join(output_path, 'files_to_convert.csv') + path = os.path.abspath(path) # Change to absolute path for readability + click.echo(f"""{num_good_files} records are complete enough to be processed. + Information about these files will be written to {path}.""") + pd.DataFrame(analysis_results['good']).to_csv(path, index=False) + + click.echo('Done! {count} files processed.'.format(count=sum([ + num_good_files, + num_files_with_too_many_consec_rows_missing, + num_files_with_too_many_rows_missing + ]))) diff --git a/diyepw/scripts/create_amy_epw_files.py b/diyepw/scripts/create_amy_epw_files.py new file mode 100644 index 0000000..9bffbc5 --- /dev/null +++ b/diyepw/scripts/create_amy_epw_files.py @@ -0,0 +1,121 @@ +import click +import diyepw +import os +import pandas as pd + + +@click.command() +@click.option( + '--max-records-to-interpolate', + default=6, + show_default=True, + type=int, + help="""The maximum number of consecutive records to interpolate. See the documentation of the + pandas.DataFrame.interpolate() method's "limit" argument for more details. Basically, + if a sequence of fields up to the length defined by this argument are missing, those + missing values will be interpolated linearly using the values of the fields immediately + preceding and following the missing field(s). If a sequence of fields is longer than this + limit, then those fields' values will be imputed instead (see --max-records-to-impute) + """ +) +@click.option( + '--max-records-to-impute', + default=48, + show_default=True, + type=int, + help=f"""The maximum number of records to impute. For groups of missing records larger than the + limit set by --max-records-to-interpolate but up to --max-records-to-impute, we replace the + missing values using the average of the value two weeks prior and the value two weeks after + the missing value. If there are more consecutive missing records than this limit, then the + file will not be processed, and will be added to the error file.""" +) +@click.option( + '-o', '--output-path', + default='.', + type=click.Path( + file_okay=False, + dir_okay=True, + writable=True, + resolve_path=True, + ), + help="""The path to which output and error files should be written.""" +) +@click.argument( + 'path_to_station_list', + default='./files_to_convert.csv', + type=click.Path( + exists=True, + file_okay=True, + dir_okay=False, + readable=True, + resolve_path=True, + ), +) +def create_amy_epw_files( + max_records_to_interpolate, + max_records_to_impute, + output_path, + path_to_station_list, +): + """Generate epw files based on the PATH_TO_STATION_LIST as generated by analyze_noaa_data.py, which must be called + prior to this script. The generated files will be written to the designated --output-path. A list of any files + that could not be generated due to validation or other errors will be written to errors.csv.""" + + # Set path to outputs produced by this script. + if not os.path.exists(output_path): + os.mkdir(output_path) + + # Set path to the files where errors should be written + errors_path = os.path.join(output_path, 'errors.csv') + + # Ensure that the errors file is truncated + with open(errors_path, 'w'): + pass + + # Read in list of AMY files that should be used to create EPW files. + amy_file_list = pd.read_csv(path_to_station_list) + amy_file_list = amy_file_list[amy_file_list.columns[0]] + + # Initialize the df to hold paths of AMY files that could not be converted to an EPW. + errors = pd.DataFrame(columns=['file', 'error']) + + num_files = len(amy_file_list) + for idx, amy_file_path in enumerate(amy_file_list, start=1): + # The NOAA ISD Lite AMY files are stored in directories named the same as the year they describe, so we + # use that directory name to get the year + amy_file_dir = os.path.dirname(amy_file_path) + year = int(amy_file_dir.split(os.path.sep)[-1]) + next_year = year + 1 + + # To get the WMO, we have to parse it out of the filename: it's the portion prior to the first hyphen + wmo_index = int(os.path.basename(amy_file_path).split('-')[0]) + + # Our NOAA ISD Lite input files are organized under inputs/NOAA_ISD_Lite_Raw/ in directories named after their + # years, and the files are named identically (_<###>_.gz), so we can get the path to the subsequent + # year's file by switching directories and swapping the year in the file name. + s = os.path.sep + amy_subsequent_year_file_path = amy_file_path.replace(s + str(year) + s, s + str(next_year) + s)\ + .replace(f'-{year}.gz', f'-{next_year}.gz') + try: + amy_epw_file_path = diyepw.create_amy_epw_file( + wmo_index=wmo_index, + year=year, + max_records_to_impute=max_records_to_impute, + max_records_to_interpolate=max_records_to_interpolate, + amy_epw_dir=output_path, + amy_files=(amy_file_path, amy_subsequent_year_file_path), + allow_downloads=True, + ) + + click.echo(f"Success! {os.path.basename(amy_file_path)} => {os.path.basename(amy_epw_file_path)} ({idx} / {num_files})") + except Exception as e: + errors = errors.append({"file": amy_file_path, "error": str(e)}, ignore_index=True) + click.echo(f"\n*** Error! {amy_file_path} could not be processed, see {errors_path} for details ({idx} / {num_files})\n") + + click.echo("\nDone!") + + if not errors.empty: + click.echo(f"{len(errors)} files encountered errors - see {errors_path} for more information") + errors.to_csv(errors_path, mode='w', index=False) + + click.echo(f"{num_files - len(errors)} files successfully processed. EPWs were written to {output_path}.") diff --git a/diyepw/scripts/create_amy_epw_files_for_years_and_wmos.py b/diyepw/scripts/create_amy_epw_files_for_years_and_wmos.py new file mode 100644 index 0000000..ba9d71f --- /dev/null +++ b/diyepw/scripts/create_amy_epw_files_for_years_and_wmos.py @@ -0,0 +1,137 @@ +import click +import datetime +import diyepw +import os +from typing import List + + +@click.command() +@click.option( + '-y', '--years', + type=str, + prompt='Which years? (i.e. 2000,2003,2006 or 2000-2005)', + help="""The years for which to generate AMY EPW files. This is a comma-separated list that can + include individual years (--years=2000,2003,2006), a range (--years=2000-2005), or + a combination of both (--years=2000,2003-2005,2007)""" +) +@click.option( + '-w', '--wmo-indices', + type=str, + prompt='Which WMOs? (i.e. 724940,724300)', + help="""The WMO indices (weather station IDs) for which to generate AMY EPW files. This is a + comma-separated list (--wmo-indices=724940,724300). Note that currently only WMO + indices beginning with 7 (North America) are supported.""", +) +@click.option( + '--max-records-to-interpolate', + default=6, + type=int, + show_default=True, + help="""The maximum number of consecutive records to interpolate. See the documentation of the + pandas.DataFrame.interpolate() method's "limit" argument for more details. Basically, + if a sequence of fields up to the length defined by this argument are missing, those + missing values will be interpolated linearly using the values of the fields immediately + preceding and following the missing field(s). If a sequence of fields is longer than this + limit, then those fields' values will be imputed instead (see --max-records-to-impute) + """ +) +@click.option( + '--max-records-to-impute', + default=48, + show_default=True, + type=int, + help="""The maximum number of records to impute. For groups of missing records larger than the + limit set by --max-records-to-interpolate but up to --max-records-to-impute, we replace the + missing values using the average of the value two weeks prior and the value two weeks after + the missing value. If there are more consecutive missing records than this limit, then the + file will not be processed, and will be added to the error file.""" +) +@click.option( + '--max-missing-amy-rows', + default=700, + show_default=True, + type=int, + help="""The AMY files corresponding to each requested WMO/year combination will be checked against + this maximum - any file that is missing more than this number of total observations with + more than this number of total missing rows will not be generated. Instead, an entry will + be added to the error file.""" +) +@click.option( + '-o', '--output-path', + default='.', + type=click.Path( + file_okay=False, + dir_okay=True, + writable=True, + resolve_path=True, + ), + help="""The path to which output and error files should be written.""" +) +def create_amy_epw_files_for_years_and_wmos( + years, + wmo_indices, + max_records_to_interpolate, + max_records_to_impute, + max_missing_amy_rows, + output_path, +): + """Generate AMY EPW files for a set of years and WMO indices. The generated files will be written to + the designated --output-path. A list of any files that could not be generated due to validation or other errors + will be written to epw_validation_errors.csv and errors.csv.""" + + # Set path to the directory we'll write created AMY EPW files to. + if not os.path.exists(output_path): + os.mkdir(output_path) + + # Set path to the files where errors should be written + epw_file_violations_path = os.path.join(output_path, 'epw_validation_errors.csv') + errors_path = os.path.join(output_path, 'errors.csv') + + diyepw.create_amy_epw_files_for_years_and_wmos( + years=get_years_list(years), + wmo_indices=get_wmo_indices_list(wmo_indices), + max_records_to_impute=max_records_to_impute, + max_records_to_interpolate=max_records_to_interpolate, + max_missing_amy_rows=max_missing_amy_rows, + amy_epw_dir=output_path, + allow_downloads=True, + ) + + +def get_years_list(years_str: str) -> List[int]: + """ + Transform the years argument string, which can be formatted like "2000, 2001, 2005-2010" (individual years or + ranges, comma separated, not necessarily sorted, with optional spaces), into a sorted list of integers + """ + # Transform the years argument from a string like to a sorted list + years_list = [] + years_str = years_str.replace(" ", "") # Ignore any spaces + for year_arg_part in years_str.split(","): # We'll process each comma-separated entry in the list of years + if "-" in year_arg_part: # If there is a hyphen, then it's a range like "2000-2010" + start_year, end_year = year_arg_part.split("-") + years_list += range(int(start_year), int(end_year) + 1) + else: # If there is no hyphen, it's just a single year + years_list.append(int(year_arg_part)) + years_list.sort() + + # Validate that the years are between 1900 and the present + this_year = datetime.datetime.now().year + if min(years_list) < 1900 or max(years_list) > this_year: + raise Exception(f"Years must be in the range 1900-{this_year}") + + return years_list + + +def get_wmo_indices_list(wmo_indices_str: str) -> List[int]: + """ + Transforms the wmo-indices argument, which should be a comma-separated list of WMO indices, into a sorted list of + integers + :param wmo_indices_str: + :return: + """ + wmo_indices_str = wmo_indices_str.replace(" ", "") # Ignore any spaces + + # Split on "," and convert each value to an integer + wmo_indices_list = [int(wmo_index) for wmo_index in wmo_indices_str.split(",")] + + return wmo_indices_list diff --git a/setup.py b/setup.py index d0dc85f..75cb63b 100644 --- a/setup.py +++ b/setup.py @@ -26,15 +26,26 @@ def readme(): license='BSD 2-Clause', python_requires='~=3.7', install_requires=[ - 'xarray~=0.16.2', + 'click~=8.0.1', 'numpy~=1.19.2', 'pvlib~=0.8.1', + 'xarray~=0.16.2', ], extras_require={ 'dev': [ + 'build~=0.5.1', + 'twine~=3.4.1', 'recommonmark~=0.7.1', + 'setuptools~=57.0.0', 'sphinx~=3.5.1', - 'sphinx-rtd-theme~=0.5.1' + 'sphinx-rtd-theme~=0.5.1', ] - } + }, + entry_points={ + 'console_scripts': [ + 'analyze_noaa_data = diyepw.scripts.analyze_noaa_data:analyze_noaa_data', + 'create_amy_epw_files = diyepw.scripts.create_amy_epw_files:create_amy_epw_files', + 'create_amy_epw_files_for_years_and_wmos = diyepw.scripts.create_amy_epw_files_for_years_and_wmos:create_amy_epw_files_for_years_and_wmos', + ], + }, )