Adding support for Zarr datasets #70

charlesbluca · 2020-12-21T22:03:25Z

As part of Pangeo's general integration of STAC, we currently have a STAC Catalog roughly mirroring Pangeo's Intake catalogs, as well as support for rendering Zarr metadata with STAC Browser. Another major step forward with this integration would be adding support to load Zarr datasets through Intake-STAC.

What steps need to be taken to make something like this happen? At the moment, Zarr datasets are represented in STAC as Collections with a single asset - a link to the consolidated metadata file of the Zarr dataset, with a role of zarr-consolidated-metadata; an example of this here:

{
  "stac_version": "1.0.0-beta.2",
  "stac_extensions": [
    "collection-assets"
  ],
  "id": "sea_surface_height",
  "title": "sea-surface altimetry data from The Copernicus Marine Environment",
  "description": "",
  "keywords": [],
  "extent": {
    "spatial": {
      "bbox": [
        []
      ]
    },
    "temporal": {
      "interval": [
        []
      ]
    }
  },
  ...
  "assets": {
    "zmetadata": {
      "href": "https://storage.googleapis.com/pangeo-cmems-duacs/.zmetadata",
      "description": "Consolidated metadata file for Zarr store",
      "type": "application/json",
      "roles": [
        "metadata",
        "zarr-consolidated-metadata"
      ]
    }
  }
}

Some random obstacles that come to mind:

How can storage options (such as requester pays status) be specified for an individual dataset?
Should we only be focusing on consolidated Zarr datasets, or generalize Zarr representation in STAC to encompass non-consolidated datasets as well?
Spatial/temporal extent can probably be decided by looking at the Zarr metadata, but if this is impossible should it just default to the widest possible ranges?

The text was updated successfully, but these errors were encountered:

scottyhq · 2020-12-22T00:13:58Z

Thanks for opening this @charlesbluca ! We definitely want to get this functionality into intake-stac. Some specific responses and other thoughts below.

First, how is intake-STAC currently structured? We open the entire catalog (or collection) but each asset is mapped to an intake driver (many specified in the intake-xarray library). Your example asset has "type": "application/json", which currently maps to just being opened as a textfile. I understand the metadata is JSON but would it be better to use 'application/zarr': 'zarr' ?

intake-stac/intake_stac/catalog.py

Line 36 in 3b0b181

'application/json': 'textfiles',

I think functionality will need to be coordinated with @martindurant's open PR in intake-xarray to use xr.open_dataset(engine='zarr')

How can storage options (such as requester pays status) be specified for an individual dataset?

Not sure this needs to be in the metadata, but this would be specified as a global fsspec/s3fs setting in your code.

Should we only be focusing on consolidated Zarr datasets, or generalize Zarr representation in STAC to encompass non-consolidated datasets as well?

In your current cataloging efforts is there any Collection or Item that points to a simple asset with a .zarr 'href' ? I think we'd want to come up with a solution that works for both

Spatial/temporal extent can probably be decided by looking at the Zarr metadata, but if this is impossible should it just default to the widest possible ranges?

My understanding is that any STAC Item needs a bbox and datetime property, but a collection (your example) does not. I'm not up to speed on the STAC Zarr discussion, so hopefully @rabernat or @matthewhanson can clarify but the widest possible ranges seem reasonable to me

charlesbluca · 2020-12-22T18:08:22Z

Thanks for these clarifications!

The current efforts don't have any .zarr href assets - just for clarity, those are essentially equivalent to an href to the directory containing the Zarr group, right?

I'm interested in the arguments provided within the type for some of the drivers; for example:

intake-stac/intake_stac/catalog.py

Line 25 in 3b0b181

'image/vnd.stac.geotiff; cloud-optimized=true': 'rasterio',

Are things like cloud-optimized=true just there for user clarity, or could they be used to control the behavior of the driver itself? My thought was that we could use .zarr style hrefs as the standard, and have something like 'application/zarr; consolidated=*' used to decide whether the driver will handle the Zarr group as consolidated or not (which in most cases is just the difference between .zmetadata or .zgroup appended to the href).

As for the extent, I was basing the current use of the widest possible ranges on some notebooks @rabernat put together, so I think I'll go ahead with that method unless there's any problems later on.

I'll certainly check out the open PR to intake-xarray.

scottyhq · 2020-12-22T18:29:49Z

The current efforts don't have any .zarr href assets - just for clarity, those are essentially equivalent to an href to the directory containing the Zarr group, right?

Yes. Just thinking of how we might get a PR started for this with a couple simple test files, something with consolidated metadata and another without. Note also, if the STAC Asset omits 'type' we currently make a guess based on URL suffix:

intake-stac/intake_stac/catalog.py

Lines 446 to 450 in 3b0b181

    
           if entry_type in ['', 'null', None]: 
        
               suffix = os.path.splitext(entry['href'])[-1] 
        
               if suffix in ['.nc', '.h5', '.hdf']: 
        
                   entry['type'] = 'application/netcdf'

have something like 'application/zarr; consolidated=*' used to decide whether the driver will handle the Zarr group as consolidated or not (which in most cases is just the difference between .zmetadata or .zgroup appended to the href).

This seems like an interesting idea! I don't know much about media types, it seems there is lengthy discussion on allowable formatting for COG mimetypes here (and linked issues) radiantearth/stac-spec#251

Would be keen to hear @andersy005 and @wildintellect 's thoughts here as well!

martindurant · 2020-12-22T18:34:24Z

If you do end up using the contents type to discriminate, please link with intake/intake#494 , which is a similar idea, but not very fleshed out.

charlesbluca · 2020-12-22T20:19:16Z

Thanks for sharing the discussion on mimetypes going on within STAC spec - it would probably be good to get some perspective from the general STAC maintainers on what they consider allowable in terms of specifying consolidated versus non-consolidated Zarr (or if it even needs to be done if it can be guessed based on URL).

@martindurant, do you think it would be worthwhile to make progress on the mimetype handling within Intake and use that to handle the functionality of Intake-STAC's initialization of a StacEntry? It seems like they are both accomplishing similar tasks.

martindurant · 2020-12-22T20:33:24Z

do you think it would be worthwhile to make progress on the mimetype handling within Intake and use that to handle the functionality of Intake-STAC's

Maybe? It wasn't too clear how to continue there, since MIME is not a very good spec, but where we do have a good spec, as being discussed here, it would make sense to use a registry for dispatch. It does't make much difference whether this registry lives in intake or here until a broader set of types might be added.

scottyhq added discussion enhancement New feature or request labels Jan 19, 2021

TomAugspurger mentioned this issue Jun 21, 2021

Load collection-level assets into xarray #90

Merged

scottyhq closed this as completed in #90 Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for Zarr datasets #70

Adding support for Zarr datasets #70

charlesbluca commented Dec 21, 2020

scottyhq commented Dec 22, 2020 •

edited

Loading

charlesbluca commented Dec 22, 2020

scottyhq commented Dec 22, 2020

martindurant commented Dec 22, 2020

charlesbluca commented Dec 22, 2020

martindurant commented Dec 22, 2020

Adding support for Zarr datasets #70

Adding support for Zarr datasets #70

Comments

charlesbluca commented Dec 21, 2020

scottyhq commented Dec 22, 2020 • edited Loading

charlesbluca commented Dec 22, 2020

scottyhq commented Dec 22, 2020

martindurant commented Dec 22, 2020

charlesbluca commented Dec 22, 2020

martindurant commented Dec 22, 2020

scottyhq commented Dec 22, 2020 •

edited

Loading