-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: define separate collections for each region x product x forecast hour set #3
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Standard forecast cycles allow 0-18 | ||
Extended forecast cycles allow 0-48 | ||
""" | ||
valid = 0 <= forecast_hour <= self.max_forecast_hour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if I would have thought to validate this to begin with but since this exists should it validate the actual hour is part of the set of possible hours which I think is (00, 06, 12, 18) ?
Thanks @hrodmn this looks pretty good to me but I would wait for @sharkinsspatial to take a look as he has more experience with writing stactools packages. As far as the HRRR metadata itself, I am curious how we are going to handle the distinction between "products" -- that ism what is called In short, I wonder if we should work back to what the items should look like (or if we need to be concerned about it at all if we're going to use chunk manifests, but even then we will need to figure out how to catalog those chunk manifests) from what we need to represent data cubes for each variable, and how each variable is represented within an xarray dataset or data tree. |
👍 @hrodmn @abarciauskas-bgse I will try to review this afternoon. |
Thanks @abarciauskas-bgse for the review!
Yeah, I think I was influenced by the PC catalog entry. We should be using "product" to describe Here is a sample of the dataframe that from herbie import Herbie
md = Herbie(
datetime(year=2024, month=5, day=30, hour=6),
model="hrrr",
priority=["azure"],
product="sfc",
fxx=6,
).inventory().to_markdown()
print(md)
We could tuck these attributes directly into the STAC metadata to make it possible to browse all of the available layers directly in the STAC.
In the short term, would it be simplest to structure the the STAC metadata in a way that makes it easy to use existing tools (like |
@hrodmn Very timely comment ☝️ as I'm thinking a bit more in-depth about this as I review your PR (sorry, got sucked into another task this morning). I think your observation "would it be simplest to structure the the STAC metadata in a way that makes it easy to use existing tools to load the data into xarray?" is the key here. In my mind, the purpose of our creating STAC metadata for these files is to advertise information about the internal structure of the GRIB2 files to applications that don't have direct access to drivers for opening these files. As a thought example, let's review the use case of a web application that will allow users to visualize HRRR data. For clarification in this discussion, I'll use the more common HRRR terminology for
In my mind, this type of workflow where a "dumb" application needs metadata information for the internal GRIB2 file is why we need STAC metadata. I think it is unlikely in the short term future that we would be using STAC metadata to broker direct access to GRIB2 files and most use cases will be using an additional layer of access such as |
So what is the optimal way to model our STAC structures to support this type of interaction? I like the initial approach we have here, but I think some structural restrictions with STAC Item and Extension interactions might require us to make some alterations (I might be incorrect in my thought process here so I'm going to reach out to some others for feedback on this). I'd propose the following.
I like the initial approach @hrodmn used here of representing the products as |
Thanks @sharkinsspatial, it is really helpful to get the broader perspective on how the
I think we can apply the datacube extension to specific assets within an item - @sharkinsspatial can you elaborate on why we might be restricted by the STAC API spec?
My intuition about the collection/item/asset breakdown was meant to make it simple to find all of the HRRR data that is relevant to a particular time/place with minimal search parameters required for the STAC API. If that approach complicates the downstream process for making specific queries to individual assets I think breaking it out into separate collections by product could make sense.
One additional wrinkle I discovered today is that the variables within a single GRIB file vary by |
Split into collections by region/product/forecast hour set
Closing in favor of #6 |
Here is the first PR for the
stactools.noaa_hrrr
. I opted to pursue the "many collections" approach as suggested by @sharkinsspatial. With this approach we will have This makes it possible to define thedatacube
metadata at the collection-level rather than burying all of that information in the assets. To make it possible to know which coordinates on which dimensions should have data, I added a custom property to thedatacube:variable
metadata:dimension_domains
. After using it to support some downstream applications we can reconsider and/or try to upstream it into the datacube extension.The basic functionality for creating JSON STAC metadata for the collection and items is working.
I opted to store the individualEach item has two assets: theGRIB
files for a given forecast hour + reference_time as assets within a single item..grib
file and the.grib.idx
file.The package uses
Herbie
to retrieve the metadata and validate that the data exists in cloud storage.This should be helpful when we tackle #2 and associated tasks because there are existing methods for loading the HRRR data into xarray. This requires aconda
environment with some system dependencies, but it's not too hard to get it configured.There are some details about availability by region that were not immediately apparent (e.g. Alaska only generates forecasts every three hours) but I think I've got it all sorted out now.
The CLI works like this:
Where:
region
is either'conus'
or'alaska'
product
is either'sfc'
,'prs'
,'nat
', or'subh'
forecast_hour_set
is either'fh00-01'
,'fh02-48'
for'sfc', 'prs', 'nat
orfh00
orfh01-18
forsubh
cloud_provider
can be one ofaws
,azure
, orgoogle
. Thehrefs
for the assets will point to the files in whichever cloud provider you choose (thanksHerbie
!).You can build a docker container with everything installed like this:
You can open a shell in the docker container with this command:
You can launch a jupyterlab server (in a docker container) with this command:
^ that is helpful for experimenting with
Herbie
without a local installation!