Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: define separate collections for each region x product x forecast hour set #3

Closed
wants to merge 28 commits into from

Conversation

hrodmn
Copy link
Collaborator

@hrodmn hrodmn commented May 30, 2024

Here is the first PR for the stactools.noaa_hrrr. I opted to pursue the "many collections" approach as suggested by @sharkinsspatial. With this approach we will have This makes it possible to define the datacube metadata at the collection-level rather than burying all of that information in the assets. To make it possible to know which coordinates on which dimensions should have data, I added a custom property to the datacube:variable metadata: dimension_domains. After using it to support some downstream applications we can reconsider and/or try to upstream it into the datacube extension.

The basic functionality for creating JSON STAC metadata for the collection and items is working. I opted to store the individual GRIB files for a given forecast hour + reference_time as assets within a single item. Each item has two assets: the .grib file and the .grib.idx file.

The package uses Herbie to retrieve the metadata and validate that the data exists in cloud storage. This should be helpful when we tackle #2 and associated tasks because there are existing methods for loading the HRRR data into xarray. This requires a conda environment with some system dependencies, but it's not too hard to get it configured.

There are some details about availability by region that were not immediately apparent (e.g. Alaska only generates forecasts every three hours) but I think I've got it all sorted out now.

The CLI works like this:

stac noaahrrr create-collection {region} {product} {forecast_hour_set} {cloud_provider} /tmp/collection.json

Where:

  • region is either 'conus' or 'alaska'
  • product is either 'sfc', 'prs', 'nat', or 'subh'
  • forecast_hour_set is either 'fh00-01', 'fh02-48' for 'sfc', 'prs', 'nat or fh00 or fh01-18 for subh
  • cloud_provider can be one of aws, azure, or google. The hrefs for the assets will point to the files in whichever cloud provider you choose (thanks Herbie!).

You can build a docker container with everything installed like this:

docker/build

You can open a shell in the docker container with this command:

docker/console

You can launch a jupyterlab server (in a docker container) with this command:

docker/jupyter

^ that is helpful for experimenting with Herbie without a local installation!

@hrodmn hrodmn self-assigned this May 30, 2024
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@hrodmn hrodmn changed the title Package setup feat: define STAC metadata structure May 30, 2024
@hrodmn hrodmn marked this pull request as ready for review May 30, 2024 20:33
Standard forecast cycles allow 0-18
Extended forecast cycles allow 0-48
"""
valid = 0 <= forecast_hour <= self.max_forecast_hour
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if I would have thought to validate this to begin with but since this exists should it validate the actual hour is part of the set of possible hours which I think is (00, 06, 12, 18) ?

@abarciauskas-bgse
Copy link
Contributor

abarciauskas-bgse commented May 31, 2024

Thanks @hrodmn this looks pretty good to me but I would wait for @sharkinsspatial to take a look as he has more experience with writing stactools packages.

As far as the HRRR metadata itself, I am curious how we are going to handle the distinction between "products" -- that ism what is called output variable in the MS documentation and what is called a product in the HRRR documentation -- and sub-file variables. For example, here is a list of variables within the wrfsfc product. Specifically I am concerned because I think different products may have the same variables. For example both wrfsubhf and wrfsfc have the REFD variable... but they also both have it multiple times???

In short, I wonder if we should work back to what the items should look like (or if we need to be concerned about it at all if we're going to use chunk manifests, but even then we will need to figure out how to catalog those chunk manifests) from what we need to represent data cubes for each variable, and how each variable is represented within an xarray dataset or data tree.

@sharkinsspatial
Copy link

👍 @hrodmn @abarciauskas-bgse I will try to review this afternoon.

@hrodmn
Copy link
Collaborator Author

hrodmn commented Jun 3, 2024

Thanks @abarciauskas-bgse for the review!

As far as the HRRR metadata itself, I am curious how we are going to handle the distinction between "products" -- that ism what is called output variable in the MS documentation and what is called a product in the HRRR documentation -- and sub-file variables. For example, here is a list of variables within the wrfsfc product. Specifically I am concerned because I think different products may have the same variables. For example both wrfsubhf and wrfsfc have the REFD variable... but they also both have it multiple times???

Yeah, I think I was influenced by the PC catalog entry. We should be using "product" to describe sfc, prs, subh, nat. Within each of those there are many variables (e.g. REFC, REFD, GUST, etc), and there are potentially multiple entries for each variable along the level dimension.

Here is a sample of the dataframe that Herbie makes to describe the inventory of a single .grib2:

from herbie import Herbie

md = Herbie(
    datetime(year=2024, month=5, day=30, hour=6),
    model="hrrr",
    priority=["azure"],
    product="sfc",
    fxx=6,
).inventory().to_markdown()
print(md)
grib_message start_byte end_byte range reference_time valid_time variable level forecast_time search_this
0 1 0 351531 0-351531 2024-05-30 06:00:00 2024-05-30 12:00:00 REFC entire atmosphere 6 hour fcst :REFC:entire atmosphere:6 hour fcst
1 2 351532 503492 351532-503492 2024-05-30 06:00:00 2024-05-30 12:00:00 RETOP cloud top 6 hour fcst :RETOP:cloud top:6 hour fcst
2 3 503493 848065 503493-848065 2024-05-30 06:00:00 2024-05-30 12:00:00 var discipline=0 center=7 local_table=1 parmcat=16 parm=201 entire atmosphere 6 hour fcst :var discipline=0 center=7 local_table=1 parmcat=16 parm=201:entire atmosphere:6 hour fcst
3 4 848066 1.08016e+06 848066-1080163 2024-05-30 06:00:00 2024-05-30 12:00:00 VIL entire atmosphere 6 hour fcst :VIL:entire atmosphere:6 hour fcst
4 5 1080164 2.51772e+06 1080164-2517725 2024-05-30 06:00:00 2024-05-30 12:00:00 VIS surface 6 hour fcst :VIS:surface:6 hour fcst
5 6 2517726 2.70157e+06 2517726-2701574 2024-05-30 06:00:00 2024-05-30 12:00:00 REFD 1000 m above ground 6 hour fcst :REFD:1000 m above ground:6 hour fcst
6 7 2701575 2.84976e+06 2701575-2849755 2024-05-30 06:00:00 2024-05-30 12:00:00 REFD 4000 m above ground 6 hour fcst :REFD:4000 m above ground:6 hour fcst
7 8 2849756 2.99074e+06 2849756-2990743 2024-05-30 06:00:00 2024-05-30 12:00:00 REFD 263 K level 6 hour fcst :REFD:263 K level:6 hour fcst
8 9 2990744 4.30154e+06 2990744-4301543 2024-05-30 06:00:00 2024-05-30 12:00:00 GUST surface 6 hour fcst :GUST:surface:6 hour fcst
9 10 4301544 5.09628e+06 4301544-5096280 2024-05-30 06:00:00 2024-05-30 12:00:00 UGRD 250 mb 6 hour fcst :UGRD:250 mb:6 hour fcst
10 11 5096281 5.8499e+06 5096281-5849896 2024-05-30 06:00:00 2024-05-30 12:00:00 VGRD 250 mb 6 hour fcst :VGRD:250 mb:6 hour fcst
11 12 5849897 6.63129e+06 5849897-6631288 2024-05-30 06:00:00 2024-05-30 12:00:00 UGRD 300 mb 6 hour fcst :UGRD:300 mb:6 hour fcst
12 13 6631289 7.38695e+06 6631289-7386950 2024-05-30 06:00:00 2024-05-30 12:00:00 VGRD 300 mb 6 hour fcst :VGRD:300 mb:6 hour fcst

We could tuck these attributes directly into the STAC metadata to make it possible to browse all of the available layers directly in the STAC.

In short, I wonder if we should work back to what the items should look like (or if we need to be concerned about it at all if we're going to use chunk manifests, but even then we will need to figure out how to catalog those chunk manifests) from what we need to represent data cubes for each variable, and how each variable is represented within an xarray dataset or data tree.

In the short term, would it be simplest to structure the the STAC metadata in a way that makes it easy to use existing tools (like Herbie) to load the data into xarray? This should shorten the path to our downstream use-cases since the hard work of selectively reading data and organizing dimensions and variables is already handled.

@sharkinsspatial
Copy link

sharkinsspatial commented Jun 3, 2024

@hrodmn Very timely comment ☝️ as I'm thinking a bit more in-depth about this as I review your PR (sorry, got sucked into another task this morning). I think your observation "would it be simplest to structure the the STAC metadata in a way that makes it easy to use existing tools to load the data into xarray?" is the key here.

In my mind, the purpose of our creating STAC metadata for these files is to advertise information about the internal structure of the GRIB2 files to applications that don't have direct access to drivers for opening these files. As a thought example, let's review the use case of a web application that will allow users to visualize HRRR data. For clarification in this discussion, I'll use the more common HRRR terminology for Product and Variable rather than what is used in the MS documentation page.

  1. When the application loads it queries the HRRR STAC Collections endpoint to obtain a information that would allow the user to select from a list of available product->variable->level and horizon (FH) options.
  2. Using a datetime , product and horizon query the application can then get a list of appropriate GRIB2 granules (the core file and it's accompanying IDX file). There is the question of which of these results is most appropriate for the user, but for simplification here, let's assume that the application will always order results by max for the HRRR cycle run and use the most recently generated forecast.
  3. For simple tiling visualization, the application uses the information to send a request to the tiler with the asset url, variable and level.
  4. Internally, the tiler opens the asset using cfgrib, uses xarray variable selection logic to select the appropriate variable and level and returns the tile for the requested region.

In my mind, this type of workflow where a "dumb" application needs metadata information for the internal GRIB2 file is why we need STAC metadata. I think it is unlikely in the short term future that we would be using STAC metadata to broker direct access to GRIB2 files and most use cases will be using an additional layer of access such as xarray with the cfgrib driver.

@sharkinsspatial
Copy link

sharkinsspatial commented Jun 3, 2024

So what is the optimal way to model our STAC structures to support this type of interaction? I like the initial approach we have here, but I think some structural restrictions with STAC Item and Extension interactions might require us to make some alterations (I might be incorrect in my thought process here so I'm going to reach out to some others for feedback on this). I'd propose the following.

  1. Treat each HRRR product as a separate collection. So we would have hrrr-wrfprsf, hrrr-wrfnatf , hrrr-wrfsfcf, hrrr-wrfsubhf (I'll explain further below why I think this may be necessary).
  2. Include a collection level datacube extension with the GRIB2 variables available for that product. This datacube extension should include an additional dimension object representing the options for variable level.
  3. Include a custom property at the collection level hrrr:horizons to represent the available horizon options.

I like the initial approach @hrodmn used here of representing the products as assets of the same logic item, but I think that we may be restricted by interaction of the datacube extension with the STAC API specification. We could theoretically include the datacube extension information at the asset level of the collection document but I think this might be a misuse of the intent for collection assets. I think it might make more sense from a discoverability and layout perspective to treat each HRRR product as it's own collection with it's own distinct set of variables, each with their own distinct levels.

@hrodmn
Copy link
Collaborator Author

hrodmn commented Jun 4, 2024

Thanks @sharkinsspatial, it is really helpful to get the broader perspective on how the datacube extension fits into the STAC API spec.

I like the initial approach @hrodmn used here of representing the products as assets of the same logic item, but I think that we may be restricted by interaction of the datacube extension with the STAC API specification.

I think we can apply the datacube extension to specific assets within an item - @sharkinsspatial can you elaborate on why we might be restricted by the STAC API spec?

I think it might make more sense from a discoverability and layout perspective to treat each HRRR product as it's own collection with it's own distinct set of variables, each with their own distinct levels.

My intuition about the collection/item/asset breakdown was meant to make it simple to find all of the HRRR data that is relevant to a particular time/place with minimal search parameters required for the STAC API. If that approach complicates the downstream process for making specific queries to individual assets I think breaking it out into separate collections by product could make sense.

Include a collection level datacube extension with the GRIB2 variables available for that product. This datacube extension should include an additional dimension object representing the options for variable level.

One additional wrinkle I discovered today is that the variables within a single GRIB file vary by FH, i.e. there are 170 variables in the sfc product for FH0 but there are 173 variables for FH2+. How would we handle that with collection-level datacube metadata? Would we use the set of common variables across all items or include all possible variables?

@hrodmn hrodmn marked this pull request as draft June 5, 2024 11:14
@hrodmn hrodmn marked this pull request as ready for review June 11, 2024 18:18
@hrodmn hrodmn changed the title feat: define STAC metadata structure feat: define separate collections for each region x product x forecast hour set Jun 13, 2024
@hrodmn hrodmn marked this pull request as draft June 17, 2024 20:22
@hrodmn hrodmn removed the request for review from sharkinsspatial June 17, 2024 20:22
@hrodmn
Copy link
Collaborator Author

hrodmn commented Jun 18, 2024

Closing in favor of #6

@hrodmn hrodmn closed this Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants