Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign NWM Client Subpackage #138

Merged
merged 102 commits into from
Oct 27, 2021
Merged

Redesign NWM Client Subpackage #138

merged 102 commits into from
Oct 27, 2021

Conversation

jarq6c
Copy link
Collaborator

@jarq6c jarq6c commented Oct 1, 2021

This is a significant refactor and redesign of nwm_client. It implements all existing functionality and compatibility with Google Cloud Platform and generic http servers like NOMADS. The design is so different that I've added it to the repository as a new subpackage under nwm_client_new. The plan would be to transition the old nwm_client to this one after some further testing "in the field". This new package adopts a more component based design spread across 5 modules:

NWMClient: Top level interface responsible for corralling the other four components.
NWMFileCatalog: Interfaces to GCP and HTTP servers used to discover files based on simple queries.
FileDownloader: Asynchronous file downloader.
NWMFileProcessor: Processes raw NetCDF files to datasets and dataframes.
ParquetCache: Implements a parquet version of HDFCache to store processed dataframes.

As the tool is based on dask, I expect it to scale better than the existing client tool. However, per dask's best practices, the tool assumes most requests will fit in memory and therefore defaults to retrieving pandas.DataFrame with a simple switch to retrieve dask.dataframe.DataFrame for larger-than-memory datasets.

Closes #127

Example usage

# Import the file client
from hydrotools.nwm_client.NWMClient import NWMFileClient

# Setup the client, defaults to Google Cloud Platform
#  Note: Retrieval defaults to only those locations that correspond to usgs gage locations,
#  according to the NWM RouteLink files. You can specify a different mapping (also called a crosswalk)
#  to retrieve a custom subset
client = NWMFileClient(
#    location_metadata_mapping=my_custom_crosswalk
)

# Get a pandas.DataFrame
#  This method includes file retrieval, processing, and caching to parquet
df = client.get(
    configuration="analysis_assim",
    reference_times=["20210930T00Z"]
)

# Alternatively you can get a dask.dataframe.DataFrame by
#  turning off the dask compute option
#  In both cases, an HT canonical dataframe is returned
ddf = client.get(
    configuration="analysis_assim",
    reference_times=["20210930T00Z"],
    compute=False
)

# Get data from NOMADS
# Import the HTTPFileCatalog for use with generic web servers
# This has been successfully tested against NOMADS and a simple python3 -m http.server
from hydrotools.nwm_client.NWMFileCatalog import HTTPFileCatalog

# Setup the catalog
#  Note: If you have a custom CA Bundle, you would also pass in 
#  a custom ssl.SSLContext
NOMADS_catalog = HTTPFileCatalog(
    server="https://nomads.ncep.noaa.gov/pub/data/nccf/com/nwm/prod/",
#    ssl_context=my_custom_ssl_context
)

# Setup the client as before, but this time specify a catalog
NOMADS_client = NWMFileClient(
    catalog=NOMADS_catalog,
#    ssl_context=my_custom_ssl_context
)

# Get a pandas.DataFrame from NOMADS
df2 = NOMADS_client.get(
    configuration="analysis_assim",
    reference_times=["20210930T01Z"]
)

Inviting @aaraney @hellkite500 @christophertubbs to review.

Testing

  1. Each component is minimally tested for all functionality. I did not duplicate every test in the original nwm_client because many of those tests were redundant.

Notes

  • The primary interface takes a list of reference times, which are processed serially. This is purposeful, since each reference time may require hundreds of files to be downloaded and processed. The maximum number of simultaneous connections is also limited to 10 by default. GCP in particular did not test well as the number of simultaneous downloads approached 20. YMMV.
  • The actual processing of each individual model cycle is performed in parallel by dask.
  • To my knowledge, no current analyses involve all of CONUS across multiple cycles. This use case may require special attention. This tool would certainly make such an analysis easier, but as-is the primary interface assumes users will typically want a subset.

Todos

  • Right now the processor defaults to dask partitions that contain 2.4 million values. Based on back of the envelope calculations this should result in partitions just under 100 MB in size in memory. This is per dask's best practices. You can explicitly specify a partition size when you repartition, but this is computationally expensive. I left a TODO to make the number of partitions a selectable parameter that could be changed by the caller and further research may be required to find a good sized partition.

Checklist

  • PR has an informative and human-readable title
  • PR is well outlined and documented. See #12 for an example
  • Changes are limited to a single goal (no scope creep)
  • Code can be automatically merged (no conflicts)
  • Code follows project standards (see CONTRIBUTING.md)
  • Passes all existing automated tests
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output) using numpy docstring formatting
  • Placeholder code is flagged / future todos are captured in comments
  • Reviewers requested with the Reviewers tool ➡️

@jarq6c
Copy link
Collaborator Author

jarq6c commented Oct 14, 2021

@aaraney @hellkite500 Still working through your feedback, but it seems an interested member of the public uploaded the RouteLink files to a public data repository here: https://www.hydroshare.org/resource/7ce5f87bc1904d0c8f297389be5fa169/

@jarq6c
Copy link
Collaborator Author

jarq6c commented Oct 20, 2021

@aaraney OK, I think I've addressed the major concerns. I'd like to hold off on the typing issues until I have a better grasp on pydantic and string handling. Once I'm better informed, we may discover opportunities for some light redesign. I think your typing concerns are valid, but I don't want to jump in without a plan.

@jarq6c
Copy link
Collaborator Author

jarq6c commented Oct 27, 2021

Passed all tests. I'm going to merge this in and relegate further updates to separate PRs. This is one is already too big.

@jarq6c jarq6c merged commit 14a70cf into NOAA-OWP:main Oct 27, 2021
@jarq6c jarq6c deleted the nwm-refactor branch November 8, 2021 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Top level metapackage does not pull in nwm_client[gcp] Limit nwm_client.gcp memory usage
3 participants