feat: define separate collections for each region x product x forecast hour set #3

hrodmn · 2024-05-30T19:17:43Z

Here is the first PR for the stactools.noaa_hrrr. I opted to pursue the "many collections" approach as suggested by @sharkinsspatial. With this approach we will have This makes it possible to define the datacube metadata at the collection-level rather than burying all of that information in the assets. To make it possible to know which coordinates on which dimensions should have data, I added a custom property to the datacube:variable metadata: dimension_domains. After using it to support some downstream applications we can reconsider and/or try to upstream it into the datacube extension.

The basic functionality for creating JSON STAC metadata for the collection and items is working. ~~I opted to store the individual GRIB files for a given forecast hour + reference_time as assets within a single item.~~ Each item has two assets: the .grib file and the .grib.idx file.

The package uses Herbie to retrieve the metadata and validate that the data exists in cloud storage. ~~This should be helpful when we tackle #2 and associated tasks because there are existing methods for loading the HRRR data into xarray~~. This requires a conda environment with some system dependencies, but it's not too hard to get it configured.

There are some details about availability by region that were not immediately apparent (e.g. Alaska only generates forecasts every three hours) but I think I've got it all sorted out now.

The CLI works like this:

stac noaahrrr create-collection {region} {product} {forecast_hour_set} {cloud_provider} /tmp/collection.json

Where:

region is either 'conus' or 'alaska'
product is either 'sfc', 'prs', 'nat', or 'subh'
forecast_hour_set is either 'fh00-01', 'fh02-48' for 'sfc', 'prs', 'nat or fh00 or fh01-18 for subh
cloud_provider can be one of aws, azure, or google. The hrefs for the assets will point to the files in whichever cloud provider you choose (thanks Herbie!).

You can build a docker container with everything installed like this:

docker/build

You can open a shell in the docker container with this command:

docker/console

You can launch a jupyterlab server (in a docker container) with this command:

docker/jupyter

^ that is helpful for experimenting with Herbie without a local installation!

review-notebook-app · 2024-05-30T19:17:49Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

abarciauskas-bgse · 2024-05-31T16:35:25Z

src/stactools/noaa_hrrr/constants.py

+        Standard forecast cycles allow 0-18
+        Extended forecast cycles allow 0-48
+        """
+        valid = 0 <= forecast_hour <= self.max_forecast_hour


I don't know if I would have thought to validate this to begin with but since this exists should it validate the actual hour is part of the set of possible hours which I think is (00, 06, 12, 18) ?

abarciauskas-bgse · 2024-05-31T16:50:11Z

Thanks @hrodmn this looks pretty good to me but I would wait for @sharkinsspatial to take a look as he has more experience with writing stactools packages.

As far as the HRRR metadata itself, I am curious how we are going to handle the distinction between "products" -- that ism what is called output variable in the MS documentation and what is called a product in the HRRR documentation -- and sub-file variables. For example, here is a list of variables within the wrfsfc product. Specifically I am concerned because I think different products may have the same variables. For example both wrfsubhf and wrfsfc have the REFD variable... but they also both have it multiple times???

In short, I wonder if we should work back to what the items should look like (or if we need to be concerned about it at all if we're going to use chunk manifests, but even then we will need to figure out how to catalog those chunk manifests) from what we need to represent data cubes for each variable, and how each variable is represented within an xarray dataset or data tree.

sharkinsspatial · 2024-05-31T16:57:33Z

👍 @hrodmn @abarciauskas-bgse I will try to review this afternoon.

hrodmn · 2024-06-03T16:22:37Z

Thanks @abarciauskas-bgse for the review!

As far as the HRRR metadata itself, I am curious how we are going to handle the distinction between "products" -- that ism what is called output variable in the MS documentation and what is called a product in the HRRR documentation -- and sub-file variables. For example, here is a list of variables within the wrfsfc product. Specifically I am concerned because I think different products may have the same variables. For example both wrfsubhf and wrfsfc have the REFD variable... but they also both have it multiple times???

Yeah, I think I was influenced by the PC catalog entry. We should be using "product" to describe sfc, prs, subh, nat. Within each of those there are many variables (e.g. REFC, REFD, GUST, etc), and there are potentially multiple entries for each variable along the level dimension.

Here is a sample of the dataframe that Herbie makes to describe the inventory of a single .grib2:

from herbie import Herbie

md = Herbie(
    datetime(year=2024, month=5, day=30, hour=6),
    model="hrrr",
    priority=["azure"],
    product="sfc",
    fxx=6,
).inventory().to_markdown()
print(md)

	grib_message	start_byte	end_byte	range	reference_time	valid_time	variable	level	forecast_time	search_this
0	1	0	351531	0-351531	2024-05-30 06:00:00	2024-05-30 12:00:00	REFC	entire atmosphere	6 hour fcst	:REFC:entire atmosphere:6 hour fcst
1	2	351532	503492	351532-503492	2024-05-30 06:00:00	2024-05-30 12:00:00	RETOP	cloud top	6 hour fcst	:RETOP:cloud top:6 hour fcst
2	3	503493	848065	503493-848065	2024-05-30 06:00:00	2024-05-30 12:00:00	var discipline=0 center=7 local_table=1 parmcat=16 parm=201	entire atmosphere	6 hour fcst	:var discipline=0 center=7 local_table=1 parmcat=16 parm=201:entire atmosphere:6 hour fcst
3	4	848066	1.08016e+06	848066-1080163	2024-05-30 06:00:00	2024-05-30 12:00:00	VIL	entire atmosphere	6 hour fcst	:VIL:entire atmosphere:6 hour fcst
4	5	1080164	2.51772e+06	1080164-2517725	2024-05-30 06:00:00	2024-05-30 12:00:00	VIS	surface	6 hour fcst	:VIS:surface:6 hour fcst
5	6	2517726	2.70157e+06	2517726-2701574	2024-05-30 06:00:00	2024-05-30 12:00:00	REFD	1000 m above ground	6 hour fcst	:REFD:1000 m above ground:6 hour fcst
6	7	2701575	2.84976e+06	2701575-2849755	2024-05-30 06:00:00	2024-05-30 12:00:00	REFD	4000 m above ground	6 hour fcst	:REFD:4000 m above ground:6 hour fcst
7	8	2849756	2.99074e+06	2849756-2990743	2024-05-30 06:00:00	2024-05-30 12:00:00	REFD	263 K level	6 hour fcst	:REFD:263 K level:6 hour fcst
8	9	2990744	4.30154e+06	2990744-4301543	2024-05-30 06:00:00	2024-05-30 12:00:00	GUST	surface	6 hour fcst	:GUST:surface:6 hour fcst
9	10	4301544	5.09628e+06	4301544-5096280	2024-05-30 06:00:00	2024-05-30 12:00:00	UGRD	250 mb	6 hour fcst	:UGRD:250 mb:6 hour fcst
10	11	5096281	5.8499e+06	5096281-5849896	2024-05-30 06:00:00	2024-05-30 12:00:00	VGRD	250 mb	6 hour fcst	:VGRD:250 mb:6 hour fcst
11	12	5849897	6.63129e+06	5849897-6631288	2024-05-30 06:00:00	2024-05-30 12:00:00	UGRD	300 mb	6 hour fcst	:UGRD:300 mb:6 hour fcst
12	13	6631289	7.38695e+06	6631289-7386950	2024-05-30 06:00:00	2024-05-30 12:00:00	VGRD	300 mb	6 hour fcst	:VGRD:300 mb:6 hour fcst

We could tuck these attributes directly into the STAC metadata to make it possible to browse all of the available layers directly in the STAC.

In short, I wonder if we should work back to what the items should look like (or if we need to be concerned about it at all if we're going to use chunk manifests, but even then we will need to figure out how to catalog those chunk manifests) from what we need to represent data cubes for each variable, and how each variable is represented within an xarray dataset or data tree.

In the short term, would it be simplest to structure the the STAC metadata in a way that makes it easy to use existing tools (like Herbie) to load the data into xarray? This should shorten the path to our downstream use-cases since the hard work of selectively reading data and organizing dimensions and variables is already handled.

sharkinsspatial · 2024-06-03T18:19:45Z

@hrodmn Very timely comment ☝️ as I'm thinking a bit more in-depth about this as I review your PR (sorry, got sucked into another task this morning). I think your observation "would it be simplest to structure the the STAC metadata in a way that makes it easy to use existing tools to load the data into xarray?" is the key here.

In my mind, the purpose of our creating STAC metadata for these files is to advertise information about the internal structure of the GRIB2 files to applications that don't have direct access to drivers for opening these files. As a thought example, let's review the use case of a web application that will allow users to visualize HRRR data. For clarification in this discussion, I'll use the more common HRRR terminology for Product and Variable rather than what is used in the MS documentation page.

When the application loads it queries the HRRR STAC Collections endpoint to obtain a information that would allow the user to select from a list of available product->variable->level and horizon (FH) options.
Using a datetime , product and horizon query the application can then get a list of appropriate GRIB2 granules (the core file and it's accompanying IDX file). There is the question of which of these results is most appropriate for the user, but for simplification here, let's assume that the application will always order results by max for the HRRR cycle run and use the most recently generated forecast.
For simple tiling visualization, the application uses the information to send a request to the tiler with the asset url, variable and level.
Internally, the tiler opens the asset using cfgrib, uses xarray variable selection logic to select the appropriate variable and level and returns the tile for the requested region.

In my mind, this type of workflow where a "dumb" application needs metadata information for the internal GRIB2 file is why we need STAC metadata. I think it is unlikely in the short term future that we would be using STAC metadata to broker direct access to GRIB2 files and most use cases will be using an additional layer of access such as xarray with the cfgrib driver.

sharkinsspatial · 2024-06-03T18:39:36Z

So what is the optimal way to model our STAC structures to support this type of interaction? I like the initial approach we have here, but I think some structural restrictions with STAC Item and Extension interactions might require us to make some alterations (I might be incorrect in my thought process here so I'm going to reach out to some others for feedback on this). I'd propose the following.

Treat each HRRR product as a separate collection. So we would have hrrr-wrfprsf, hrrr-wrfnatf , hrrr-wrfsfcf, hrrr-wrfsubhf (I'll explain further below why I think this may be necessary).
Include a collection level datacube extension with the GRIB2 variables available for that product. This datacube extension should include an additional dimension object representing the options for variable level.
Include a custom property at the collection level hrrr:horizons to represent the available horizon options.

I like the initial approach @hrodmn used here of representing the products as assets of the same logic item, but I think that we may be restricted by interaction of the datacube extension with the STAC API specification. We could theoretically include the datacube extension information at the asset level of the collection document but I think this might be a misuse of the intent for collection assets. I think it might make more sense from a discoverability and layout perspective to treat each HRRR product as it's own collection with it's own distinct set of variables, each with their own distinct levels.

hrodmn · 2024-06-04T02:29:43Z

Thanks @sharkinsspatial, it is really helpful to get the broader perspective on how the datacube extension fits into the STAC API spec.

I like the initial approach @hrodmn used here of representing the products as assets of the same logic item, but I think that we may be restricted by interaction of the datacube extension with the STAC API specification.

I think we can apply the datacube extension to specific assets within an item - @sharkinsspatial can you elaborate on why we might be restricted by the STAC API spec?

I think it might make more sense from a discoverability and layout perspective to treat each HRRR product as it's own collection with it's own distinct set of variables, each with their own distinct levels.

My intuition about the collection/item/asset breakdown was meant to make it simple to find all of the HRRR data that is relevant to a particular time/place with minimal search parameters required for the STAC API. If that approach complicates the downstream process for making specific queries to individual assets I think breaking it out into separate collections by product could make sense.

Include a collection level datacube extension with the GRIB2 variables available for that product. This datacube extension should include an additional dimension object representing the options for variable level.

One additional wrinkle I discovered today is that the variables within a single GRIB file vary by FH, i.e. there are 170 variables in the sfc product for FH0 but there are 173 variables for FH2+. How would we handle that with collection-level datacube metadata? Would we use the set of common variables across all items or include all possible variables?

Split into collections by region/product/forecast hour set

hrodmn · 2024-06-18T00:49:55Z

Closing in favor of #6

hrodmn added 4 commits May 23, 2024 15:55

fill out admin details

e9c11d2

explore hrrr data using Herbie

c0f8752

build the package!

7de96ae

add item-assets

1779620

hrodmn self-assigned this May 30, 2024

hrodmn changed the title ~~Package setup~~ feat: define STAC metadata structure May 30, 2024

hrodmn added 5 commits May 30, 2024 14:22

don't run CI on Python 3.8

85464a7

fix ci

1e3c5a2

fix typo

f0d2f7b

activate conda environment

90aaafc

try activating conda a different way

cbdc3e8

hrodmn marked this pull request as ready for review May 30, 2024 20:33

hrodmn requested review from sharkinsspatial and abarciauskas-bgse May 30, 2024 20:33

hrodmn added 3 commits May 30, 2024 15:34

conda init

a092e92

conditionally activate conda

2495178

activate base conda env

68b5c77

abarciauskas-bgse reviewed May 31, 2024

View reviewed changes

hrodmn added 2 commits June 3, 2024 11:35

fix alaska logic

b89866c

try activating conda a different way

5ec20d6

hrodmn mentioned this pull request Jun 4, 2024

Define STAC metadata structure #1

Closed

variable -> product

62a072d

hrodmn marked this pull request as draft June 5, 2024 11:14

hrodmn added 13 commits June 5, 2024 12:57

add forecast hour sets and variables

11b8725

split collections by product/forecast_hour_set

cfd427c

update examples

0161eab

overhaul inventory system

12090c0

add datacube extension to collection

a5d7346

clean up

e175f2e

Merge pull request #4 from developmentseed/product-set-collections

0367bad

Split into collections by region/product/forecast hour set

update example collection.json

92c881c

update sqlite

1add3a8

really include package data

99a3776

add some tests for inventory fxns

f87d3cc

add Python 3.12 to version matrix

05a18ec

update collection descriptions

9b3e2ab

hrodmn marked this pull request as ready for review June 11, 2024 18:18

hrodmn mentioned this pull request Jun 13, 2024

feat: define separate collections for each product, set each GRIB layer as an asset #5

Closed

hrodmn changed the title ~~feat: define STAC metadata structure~~ feat: define separate collections for each region x product x forecast hour set Jun 13, 2024

hrodmn marked this pull request as draft June 17, 2024 20:22

hrodmn removed the request for review from sharkinsspatial June 17, 2024 20:22

hrodmn mentioned this pull request Jun 17, 2024

feat: define separate collections for each product, define GRIB layers with byte ranges in 'grib:layers' asset metadata #6

Merged

hrodmn closed this Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: define separate collections for each region x product x forecast hour set #3

feat: define separate collections for each region x product x forecast hour set #3

hrodmn commented May 30, 2024 •

edited

Loading

review-notebook-app bot commented May 30, 2024

abarciauskas-bgse May 31, 2024

abarciauskas-bgse commented May 31, 2024 •

edited

Loading

sharkinsspatial commented May 31, 2024

hrodmn commented Jun 3, 2024

sharkinsspatial commented Jun 3, 2024 •

edited

Loading

sharkinsspatial commented Jun 3, 2024 •

edited

Loading

hrodmn commented Jun 4, 2024

hrodmn commented Jun 18, 2024

feat: define separate collections for each region x product x forecast hour set #3

feat: define separate collections for each region x product x forecast hour set #3

Conversation

hrodmn commented May 30, 2024 • edited Loading

review-notebook-app bot commented May 30, 2024

abarciauskas-bgse May 31, 2024

Choose a reason for hiding this comment

abarciauskas-bgse commented May 31, 2024 • edited Loading

sharkinsspatial commented May 31, 2024

hrodmn commented Jun 3, 2024

sharkinsspatial commented Jun 3, 2024 • edited Loading

sharkinsspatial commented Jun 3, 2024 • edited Loading

hrodmn commented Jun 4, 2024

hrodmn commented Jun 18, 2024

hrodmn commented May 30, 2024 •

edited

Loading

abarciauskas-bgse commented May 31, 2024 •

edited

Loading

sharkinsspatial commented Jun 3, 2024 •

edited

Loading

sharkinsspatial commented Jun 3, 2024 •

edited

Loading