Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for Zarr datasets #70

Closed
charlesbluca opened this issue Dec 21, 2020 · 6 comments · Fixed by #90
Closed

Adding support for Zarr datasets #70

charlesbluca opened this issue Dec 21, 2020 · 6 comments · Fixed by #90
Labels
discussion enhancement New feature or request

Comments

@charlesbluca
Copy link

As part of Pangeo's general integration of STAC, we currently have a STAC Catalog roughly mirroring Pangeo's Intake catalogs, as well as support for rendering Zarr metadata with STAC Browser. Another major step forward with this integration would be adding support to load Zarr datasets through Intake-STAC.

What steps need to be taken to make something like this happen? At the moment, Zarr datasets are represented in STAC as Collections with a single asset - a link to the consolidated metadata file of the Zarr dataset, with a role of zarr-consolidated-metadata; an example of this here:

{
  "stac_version": "1.0.0-beta.2",
  "stac_extensions": [
    "collection-assets"
  ],
  "id": "sea_surface_height",
  "title": "sea-surface altimetry data from The Copernicus Marine Environment",
  "description": "",
  "keywords": [],
  "extent": {
    "spatial": {
      "bbox": [
        []
      ]
    },
    "temporal": {
      "interval": [
        []
      ]
    }
  },
  ...
  "assets": {
    "zmetadata": {
      "href": "https://storage.googleapis.com/pangeo-cmems-duacs/.zmetadata",
      "description": "Consolidated metadata file for Zarr store",
      "type": "application/json",
      "roles": [
        "metadata",
        "zarr-consolidated-metadata"
      ]
    }
  }
}

Some random obstacles that come to mind:

  • How can storage options (such as requester pays status) be specified for an individual dataset?
  • Should we only be focusing on consolidated Zarr datasets, or generalize Zarr representation in STAC to encompass non-consolidated datasets as well?
  • Spatial/temporal extent can probably be decided by looking at the Zarr metadata, but if this is impossible should it just default to the widest possible ranges?
@scottyhq
Copy link
Collaborator

scottyhq commented Dec 22, 2020

Thanks for opening this @charlesbluca ! We definitely want to get this functionality into intake-stac. Some specific responses and other thoughts below.

First, how is intake-STAC currently structured? We open the entire catalog (or collection) but each asset is mapped to an intake driver (many specified in the intake-xarray library). Your example asset has "type": "application/json", which currently maps to just being opened as a textfile. I understand the metadata is JSON but would it be better to use 'application/zarr': 'zarr' ?

'application/json': 'textfiles',

I think functionality will need to be coordinated with @martindurant's open PR in intake-xarray to use xr.open_dataset(engine='zarr')

How can storage options (such as requester pays status) be specified for an individual dataset?

Not sure this needs to be in the metadata, but this would be specified as a global fsspec/s3fs setting in your code.

Should we only be focusing on consolidated Zarr datasets, or generalize Zarr representation in STAC to encompass non-consolidated datasets as well?

In your current cataloging efforts is there any Collection or Item that points to a simple asset with a .zarr 'href' ? I think we'd want to come up with a solution that works for both

Spatial/temporal extent can probably be decided by looking at the Zarr metadata, but if this is impossible should it just default to the widest possible ranges?

My understanding is that any STAC Item needs a bbox and datetime property, but a collection (your example) does not. I'm not up to speed on the STAC Zarr discussion, so hopefully @rabernat or @matthewhanson can clarify but the widest possible ranges seem reasonable to me

@charlesbluca
Copy link
Author

Thanks for these clarifications!

The current efforts don't have any .zarr href assets - just for clarity, those are essentially equivalent to an href to the directory containing the Zarr group, right?

I'm interested in the arguments provided within the type for some of the drivers; for example:

'image/vnd.stac.geotiff; cloud-optimized=true': 'rasterio',

Are things like cloud-optimized=true just there for user clarity, or could they be used to control the behavior of the driver itself? My thought was that we could use .zarr style hrefs as the standard, and have something like 'application/zarr; consolidated=*' used to decide whether the driver will handle the Zarr group as consolidated or not (which in most cases is just the difference between .zmetadata or .zgroup appended to the href).

As for the extent, I was basing the current use of the widest possible ranges on some notebooks @rabernat put together, so I think I'll go ahead with that method unless there's any problems later on.

I'll certainly check out the open PR to intake-xarray.

@scottyhq
Copy link
Collaborator

The current efforts don't have any .zarr href assets - just for clarity, those are essentially equivalent to an href to the directory containing the Zarr group, right?

Yes. Just thinking of how we might get a PR started for this with a couple simple test files, something with consolidated metadata and another without. Note also, if the STAC Asset omits 'type' we currently make a guess based on URL suffix:

if entry_type in ['', 'null', None]:
suffix = os.path.splitext(entry['href'])[-1]
if suffix in ['.nc', '.h5', '.hdf']:
entry['type'] = 'application/netcdf'

have something like 'application/zarr; consolidated=*' used to decide whether the driver will handle the Zarr group as consolidated or not (which in most cases is just the difference between .zmetadata or .zgroup appended to the href).

This seems like an interesting idea! I don't know much about media types, it seems there is lengthy discussion on allowable formatting for COG mimetypes here (and linked issues) radiantearth/stac-spec#251

Would be keen to hear @andersy005 and @wildintellect 's thoughts here as well!

@martindurant
Copy link
Member

If you do end up using the contents type to discriminate, please link with intake/intake#494 , which is a similar idea, but not very fleshed out.

@charlesbluca
Copy link
Author

Thanks for sharing the discussion on mimetypes going on within STAC spec - it would probably be good to get some perspective from the general STAC maintainers on what they consider allowable in terms of specifying consolidated versus non-consolidated Zarr (or if it even needs to be done if it can be guessed based on URL).

@martindurant, do you think it would be worthwhile to make progress on the mimetype handling within Intake and use that to handle the functionality of Intake-STAC's initialization of a StacEntry? It seems like they are both accomplishing similar tasks.

@martindurant
Copy link
Member

do you think it would be worthwhile to make progress on the mimetype handling within Intake and use that to handle the functionality of Intake-STAC's

Maybe? It wasn't too clear how to continue there, since MIME is not a very good spec, but where we do have a good spec, as being discussed here, it would make sense to use a registry for dispatch. It does't make much difference whether this registry lives in intake or here until a broader set of types might be added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants