Skip to content
This repository has been archived by the owner on Nov 21, 2023. It is now read-only.

STAC catalog sprint: to-do items #1

Open
2 of 6 tasks
Tracked by #33
cisaacstern opened this issue Jul 26, 2021 · 13 comments
Open
2 of 6 tasks
Tracked by #33

STAC catalog sprint: to-do items #1

cisaacstern opened this issue Jul 26, 2021 · 13 comments

Comments

@cisaacstern
Copy link
Member

cisaacstern commented Jul 26, 2021

  • Decide on mapping between storage target layout and STAC structure
    • Current target layout is {prefix}/pangeo-forge/{feedstock_name}/{recipe_name}.{dataset_type}
    • Clearly the pangeo-forge level should be represented by a STAC Catalog
    • It seems that current practice may already be deviating somewhat from the target layout structure as defined above. Specifically, storage targets tend to have an intermediate layer in between pangeo-forge and {feedstock_name} corresponding to the project or, we might say, "collection" to which the recipe belongs. For example, in the case of the swot-adac project, each feedstock (i.e. PR repo corresponding to a particular model output), is stored with a target path in the style of pangeo-forge/swot-adac/{feedstock_name}/{recipe_name}.zarr
    • This intermediate organization layer lends itself to representation by a STAC Collection (or, perhaps a sub-catalog of the primary catalog, but I have yet to confirm whether STAC Catalogs can wrap other Catalogs). Generally speaking, this "sub-catalog/collection" layer identifier has not been represented within the PR itself, but rather has just been "understood" by those working on the PRs, or in certain cases flagged using a GitHub label. We should figure out how we want to encode this information into incoming PRs. This topic is revisited in the next to-do item.
    • Finally, we'll need to decide on an approach (at least provisionally) for how to represent built recipes (i.e. zarr stores) in STAC. Having considered the thread at https://discourse.pangeo.io/t/stac-and-earth-systems-datasets/1472/14 at some length, and also discussed synchronously with @cedadev and as part of a breakout session at the ESIP Summer Meeting, I believe following the CEDA approach of representing Zarr stores as STAC Items does makes sense for our use case, pending further experimentation to confirm that all of our relevant metadata (datetime ranges, e.g.) can be captured by an Item representation. My primary motivator here is a sense that maintaining the STAC model's full hierarchical "depth" will give us the flexibility to convey listing in a more intelligible fashion than would be possible otherwise.
  • Determine how & when (in the recipe workflow) we want to build STAC objects
    • I believe we'll ultimately end up wanting to include a build_stac method or similar within the XarrayZarrRecipe class, and that this method is something which we'll want to call at recipe build time (after finalize_target), so that it can draw upon additional metadata fields (either pre-existing, or which we will add for this purpose) in the recipe's meta.yaml file. While certain cataloging information can be extracted from the zarr metadata directly via xarray (e.g., spatiotemporal extent, variable names, etc.), other key metadata for building expressive STAC objects will need to be fed from outside the dataset (i.e., long-form description, provider url, license, etc.). Much of this is already captured in meta.yaml, so this is a natural place to pull it from.
    • For extracting metadata from xarray and building pystac-backed STAC objects, https://github.com/TomAugspurger/xstac is a good baseline, but may need extension for our specific use case(s)
  • Clarify how (technically and aesthetically) we want to run a STAC Browser instance alongside https://pangeo-forge.org
  • Consider staged rollout of where/how STAC objects are stored
  • Agree upon a recommended loading syntax
  • Customizing the STAC Browser instance
    • This may include CSS to match the Pangeo stylistic pallet as well as templating adjustments if we want to include custom fields in the listings such as Binder links to working notebook examples

cc @rabernat, @sharkinsspatial, @TomAugspurger so you're aware of current progress on this

@TomAugspurger
Copy link

I thought I left a comment here, but apparently not. The gist of it was about the item:

For extracting metadata from xarray and building pystac-backed STAC objects, https://github.com/TomAugspurger/xstac is a good baseline, but may need extension for our specific use case(s)

The datacube STAC extension applies to both Collections and Items. I think that xstac could be updated pretty easily to support items too. Let me know if you're interested in working on that.

@cisaacstern
Copy link
Member Author

Let me know if you're interested in working on that.

Thanks, Tom. I definitely am. Once I get a little closer to figuring out what functionality we need, I'll follow up directly on xstac with an Issue or draft PR for discussion.

@rabernat
Copy link

This is a great list. I'd love to try to help however I can. A few comments.

  • It seems that current practice may already be deviating somewhat from the target layout structure as defined above.

We should fix this! We are still very early in this project. We don't need to worry about preserving bad decisions to maintain some vague backwards compatibility. We should get consistent by either changing the file paths or changing the spec. pangeo-forge/swot-adac/{feedstock_name}/{recipe_name}.zarr is out of spec with ADR 3. There is one outstanding question in ADR 3, which we may now be ready to revisit: do we want to allow recipe_name to include "sub-directories"?

  • I have yet to confirm whether STAC Catalogs can wrap other Catalogs

Yes they can. You can have as many layers of nesting as you want, by linking to other catalogs from catalogs. See STAC catalog overview.

  • I believe following the CEDA approach of representing Zarr stores as STAC Items does makes sense for our use case,

👍 from me

  • A database with an API endpoint makes the most sense middle-term, but on the very short term, a simpler solution which allows us to hammer-out other points on this list is to store prototype catalogs on GitHub.

👍 from me

  • The pystac-based loading linked above currently appears to be the best way to open zarr datasets from STAC

Can pystac load Zarr STAC items?

@cisaacstern
Copy link
Member Author

Yes they can. You can have as many layers of nesting as you want, by linking to other catalogs from catalogs. See STAC catalog overview.

Yeah! Figured this out somewhere along the way too. My current thought (formal write-up to follow, which can tie in your ADR notes above) ... is that our mapping should be:

Layer Object Name STAC Type Example
top pangeo-forge-catalog.json Catalog pangeo-forge-catalog.json
middle-high {{ feedstock_name }}-catalog.json Catalog swot-adac-catalog.json
middle-low {{ collection_name }}-collection.json Collection gigatl-collection.json
bottom {{ dataset_unique_identifier }}.json Item region01-surf-fma.json

This is the pattern I've followed in https://github.com/pangeo-forge/pangeo-forge-catalog/tree/dev/stac (which I just pushed, for discussion purposes).

Can pystac load Zarr STAC items?

It requires a few lines (or wrapping them in a function), but yeah! Check this out: https://nbviewer.jupyter.org/github/cisaacstern/stac-notebooks/blob/gigatl-reg01-surf-fma/example_notebook.ipynb (where opener is defined here).

@cisaacstern
Copy link
Member Author

  • Clarify how (technically and aesthetically) we want to run a STAC Browser instance

A prototyped version of this was completed by pangeo-forge/pangeo-forge-vue-website#6 and should be usable at https://pangeo-forge.org/catalog once Netlify rebuilds. As noted in my last comment on that PR, this code will likely need to be refactored once STAC Browser components become installable directly from npm.

  • Consider staged rollout of where/how STAC objects are stored

With Ryan's thumbs up on using GItHub to start, I'm going to consider that the agreed-upon approach for the time being. This is where I'm directing our STAC Browser to the Pangeo Forge root catalog: https://github.com/pangeo-forge/pangeo-forge-vue-website/blob/main/src/main.js#L14. This can be changed over to a database-backed API whenever we see fit.

  • Customizing the STAC Browser instance

I've recorded a number of specific points for consideration on this topic here: pangeo-forge/pangeo-forge-vue-website#6 (comment).

Ok! More to follow tomorrow. That was a big push and I think I'm ready for a break. 😅

@rabernat
Copy link

That was a big push and I think I'm ready for a break

Well deserved! 🏆 for pushing this difficult and uncertain task forward with minimal guidance. 👏 👏 👏

@cisaacstern
Copy link
Member Author

Following pangeo-forge/pangeo-forge-vue-website#7, prototype catalog is now up at https://pangeo-forge.org/catalog#/.

@sharkinsspatial
Copy link

This is great @cisaacstern 🎊 . Apologies for the radio silence as you have been building this out as I've been out of the office (STAC is the one area of this project where I actually have some experience and might be able to contribute :]). I'll try to address points in the order described in your initial comment.

  • I think the hierarchichal stucture you have described for static STAC catalogs here is spot on. I would suggest however that we try to align the json file naming with the STAC Best Practices document. Though this generic naming convention is not dictated by the spec it is the convention used by most large catalogs. This change would result in
Layer Object Name STAC Type Example
top catalog.json Catalog /pangeo-forge/catalog.json
middle-high /pangeo-forge/{{ feedstock_name }}/catalog.json Catalog /pangeo-forge/swot-adac/catalog.json
middle-low /pangeo-forge/{{ collection_name }}/collection.json Collection /pangeo-forge/gigatl/collection.json
bottom {{ dataset_unique_identifier }}.json Item region01-surf-fma.json
  • We are in a somewhat unique position because we creating a catalog referencing subcatalogs for data hosted on a variety of cloud storage providers (there is a legacy issue discussing this case in stac-spec that I will try to locate). There are open questions about whether static STAC records should be stored inline with their corresponding data or in a unique storage location. I feel the most flexible solution would be for bakeries to provide a storage target specifically for STAC records. With this in place bakery managers could manage catalogs for their own holdings while the pangeo-forge product could manage an overarching confederated catalog which references all of the bakery catalogs. The one complicating factor here will be the need for additional logic around PySTAC's StacIO to support inferring the correct read and write protocols for different cloud providers.

  • I suggest the building of STAC records should be a seperate CI workflow rather than being included directly as a method in pangeo-forge-recipes. With this approach we still have direct access to the meta.yaml as part of the CI context and we can support the cross organization record storage described above. One step would be responsible for submitting a flow to the bakery in order to build the STAC catalog and a second step would create the reference link for this catalog in the central pangeo-forge catalog.

  • I support the use of Github for initial STAC record storage prototyping but we should attempt to standup centralized object storage for STAC records for pangeo-forge and the bakeries to facilitate PySTAC I/O usage.

@cisaacstern
Copy link
Member Author

Thanks for this incredibly helpful perspective, @sharkinsspatial. There's a lot to dig into, but one small point of clarification to start. Option A below is the Collection naming scheme as proposed in your table above. Is it indeed a STAC Best Practice to not store a Collection object within a subdirectory of its enclosing Catalog? Option B seems more intuitive to me, but of course just want to do whatever is considered mainstream within the ecosystem.

Option Object Name STAC Type Example
A /pangeo-forge/{{ collection_name }}/collection.json Collection /pangeo-forge/gigatl/collection.json
B /pangeo-forge/{{ enclosing_catalog_name }}/{{ collection_name }}/collection.json Collection /pangeo-forge/swot_adac/gigatl/collection.json

@TomAugspurger
Copy link

I may be off-base, but radiantearth/stac-api-spec#159 might be related to the catalog / collections layout discussion.

@sharkinsspatial
Copy link

@cisaacstern Apologies, that is a typo in my comment. Your nested collection structure is the correct name.

@cisaacstern
Copy link
Member Author

Thanks to Tom for stac-utils/xstac#11 (comment) which will be of great help in generating STAC Items.

@cisaacstern
Copy link
Member Author

  • Determine how & when (in the recipe workflow) we want to build STAC objects

I believe the best way to build this is as a standalone GitHub Action, to be called following completion of https://github.com/pangeo-forge/feedstock-creation-action here: staged-recipes/create-feedstock.yaml

A standalone Action repo should make local testing with https://github.com/nektos/act easier and means we can maintain/update/etc. xstac JSON templates without having to commit them to pangeo-forge/staged-recipes (NB: keeping the staged-recipes commit history clean is important, as mentioned in pangeo-forge/staged-recipes#80 (review)).

Here is the WIP repo for this Action: https://github.com/pangeo-forge/stac-creation-action. Updates to follow shortly.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants