New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Fetch GEFS from AWS #712

Open

lboeman wants to merge 4 commits into SolarArbiter:master from lboeman:gefs-throttling

Member

lboeman commented Aug 14, 2021

Closes download GEFS from AWS instead of NOMADS #680 .
I am familiar with the contributing guidelines.
Tests added.
Updates entries to docs/source/api.rst for API changes.
Adds descriptions to appropriate "what's new" file in docs/source/whatsnew for all changes. Includes link to the GitHub Issue with :issue:`num` or this Pull Request with :pull:`num`. Includes contributor name and/or GitHub username (link with :ghuser:`user`).
New code is fully documented. Includes numpydoc compliant docstrings, examples, and comments where necessary.
Maintainer: Appropriate GitHub Labels and Milestone are assigned to the Pull Request and linked Issue.

This moves fetching for GEFS to AWS. It could probably use some reorganization but this works as is. This is quite a bit slower and moves ~10x the data over the network because we need to fetch the full gefs files and then slice within the lat/lon domain of interest instead of having nomads do that slicing for us.

I took a look at #696 and https://www.weather.gov/media/notification/pdf2/pns20-85ncep_web_access.pdf, and it appears that any of the cli commands that hit nomads are going to contribute to our hitting this rate limit. So moving at least one of the models off of nomads will help to alleviate some of the strain at the cost of bandwitdth/speed. We may need to address #696 separately to handle size 0 responses, but I'm not sure of a good way to throttle the async requests down to a maximum of 60 requests per minute across processes.

lboeman added 3 commits

August 13, 2021 14:40


          try aws

f359b71


          get nwp from aws

e1717d1


          correct comment, remove extra logging call

e2a3341

lboeman changed the title ~~Gefs throttling~~ Fetch GEFS from AWS

Member

wholmgren commented Aug 16, 2021

Documentation at top of module will need to be updated. I've not yet tested it locally since I don't have wgrib2 on my mac and don't want to fight the compiler. I'll try on golem later today.

wholmgren reviewed

View reviewed changes

solarforecastarbiter/io/fetch/nwp.py

@@ @@ -57,6 +57,12 @@ @@
               CHECK_URL = 'https://nomads.ncep.noaa.gov/pub/data/nccf/com/{}/prod'
               BASE_URL = 'https://nomads.ncep.noaa.gov/cgi-bin/'
+              GEFS_BASE_URL = 'https://noaa-gefs-pds.s3.amazonaws.com'
+              # When querying aws for directories, start-after is used to paginate.

Member

wholmgren Aug 16, 2021

this would be helpful in the function documentation too

solarforecastarbiter/io/fetch/nwp.py

+                      ) as r:
+                          return await r.text()
+                  listing = await _get(session)
+                  all_dirs = re.findall("gefs\\.([0-9]{8})", listing)

Member

wholmgren Aug 16, 2021

same as r"gefs\.([0-9]{8})"? I think r-strings are much easier to read for regex

Member

wholmgren Aug 16, 2021

comment like

# xml contains many entries formatted similar to 
# <CommonPrefixes><Prefix>gefs.20210101/</Prefix>
# regex creates a list like ['20210101', '20210102'...]

solarforecastarbiter/io/fetch/nwp.py

+                          return await r.text()
+                  listing = await _get(session)
+                  all_dirs = re.findall("gefs\\.([0-9]{8})", listing)
+                  if len(all_dirs) < 1000:

Member

wholmgren Aug 16, 2021

aws starts paginating at 1k items?

solarforecastarbiter/io/fetch/nwp.py

+                  else:
+                      return all_dirs + await get_available_gefs_dirs(
+                          session,
+                          'gefs.'+all_dirs[-1]

Member

wholmgren Aug 16, 2021

I'm a proponent of calling kwargs as kwargs, so start_after='gefs.'+all_dirs[-1]

solarforecastarbiter/io/fetch/nwp.py Outdated

@@ @@ -730,3 +790,11 @@ def check_wgrib2(): @@
                   if shutil.which('wgrib2') is None:
                       logger.error('wgrib2 was not found in PATH and is required')
                       sys.exit(1)
+              def domain_args():

Member

wholmgren Aug 16, 2021

private function


          check for next init at aws, not nomads

e849bb6

williamhobbs commented Oct 26, 2021

I'm not sure if this is helpful, but I tried running solararbiter fetchnwp with gefs and I get "Connection timeout to host..." errors after about 15-20 .grib2 files download. I can run the command again and get an additional 15-20 files before the same errors come up.

I can provide full verbose output, if that helps.

williamhobbs mentioned this pull request

How to get more then 30days of GFS Data? pvlib/pvlib-python#1347

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet