Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch GEFS from AWS #712

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

lboeman
Copy link
Member

@lboeman lboeman commented Aug 14, 2021

  • Closes download GEFS from AWS instead of NOMADS #680 .
  • I am familiar with the contributing guidelines.
  • Tests added.
  • Updates entries to docs/source/api.rst for API changes.
  • Adds descriptions to appropriate "what's new" file in docs/source/whatsnew for all changes. Includes link to the GitHub Issue with :issue:`num` or this Pull Request with :pull:`num`. Includes contributor name and/or GitHub username (link with :ghuser:`user`).
  • New code is fully documented. Includes numpydoc compliant docstrings, examples, and comments where necessary.
  • Maintainer: Appropriate GitHub Labels and Milestone are assigned to the Pull Request and linked Issue.

This moves fetching for GEFS to AWS. It could probably use some reorganization but this works as is. This is quite a bit slower and moves ~10x the data over the network because we need to fetch the full gefs files and then slice within the lat/lon domain of interest instead of having nomads do that slicing for us.

I took a look at #696 and https://www.weather.gov/media/notification/pdf2/pns20-85ncep_web_access.pdf, and it appears that any of the cli commands that hit nomads are going to contribute to our hitting this rate limit. So moving at least one of the models off of nomads will help to alleviate some of the strain at the cost of bandwitdth/speed. We may need to address #696 separately to handle size 0 responses, but I'm not sure of a good way to throttle the async requests down to a maximum of 60 requests per minute across processes.

@lboeman lboeman changed the title Gefs throttling Fetch GEFS from AWS Aug 16, 2021
@wholmgren
Copy link
Member

Documentation at top of module will need to be updated. I've not yet tested it locally since I don't have wgrib2 on my mac and don't want to fight the compiler. I'll try on golem later today.

@@ -57,6 +57,12 @@
CHECK_URL = 'https://nomads.ncep.noaa.gov/pub/data/nccf/com/{}/prod'
BASE_URL = 'https://nomads.ncep.noaa.gov/cgi-bin/'

GEFS_BASE_URL = 'https://noaa-gefs-pds.s3.amazonaws.com'

# When querying aws for directories, start-after is used to paginate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be helpful in the function documentation too

) as r:
return await r.text()
listing = await _get(session)
all_dirs = re.findall("gefs\\.([0-9]{8})", listing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as r"gefs\.([0-9]{8})"? I think r-strings are much easier to read for regex

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment like

# xml contains many entries formatted similar to 
# <CommonPrefixes><Prefix>gefs.20210101/</Prefix>
# regex creates a list like ['20210101', '20210102'...]

return await r.text()
listing = await _get(session)
all_dirs = re.findall("gefs\\.([0-9]{8})", listing)
if len(all_dirs) < 1000:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aws starts paginating at 1k items?

else:
return all_dirs + await get_available_gefs_dirs(
session,
'gefs.'+all_dirs[-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a proponent of calling kwargs as kwargs, so start_after='gefs.'+all_dirs[-1]

@@ -730,3 +790,11 @@ def check_wgrib2():
if shutil.which('wgrib2') is None:
logger.error('wgrib2 was not found in PATH and is required')
sys.exit(1)


def domain_args():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private function

@williamhobbs
Copy link

I'm not sure if this is helpful, but I tried running solararbiter fetchnwp with gefs and I get "Connection timeout to host..." errors after about 15-20 .grib2 files download. I can run the command again and get an additional 15-20 files before the same errors come up.

I can provide full verbose output, if that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

download GEFS from AWS instead of NOMADS
3 participants