Skip to content

Latest commit

 

History

History
228 lines (168 loc) · 8.03 KB

0002-use-meta-yaml-to-track-feedstock-metadata.md

File metadata and controls

228 lines (168 loc) · 8.03 KB

1. Use meta.yaml to track feedstock metadata

Date: 2021-03-27

Status

Proposed

Context

We need a way to keep track of metadata associated with each feedstock repo.

Goals

  • The metadata should describe everything we need to know about the recipes that is not contained in the recipe python code itself.
  • It should facilitate generation of a STAC collection for the data that comes out of the recipe.
  • It should tell the bakery everything it needs to know to run the recipe.

Inspiration

Decision

We will track metadata in a meta.yaml which lives at the top level of the the feedstock/ directory within the generated feedstock repo.

The format and contents of this file are specified as follows.

Top Level Data

id: noaa-oisst
version: "1.0" 
title: "NOAA Optimum Interpolated SST"
description: "Analysis-ready Zarr datasets derived from NOAA OISST NetCDF"
pangeo_forge_recipes_version: "0.1"
pangeo_forge_metadata_spec_version: "2"
name description valid characters
id top-level ID for the feedstock. must be unique and must match feedstock repo name lower case ascii alpha-numeric, no spaces, only dashes allowed
version feedstock version {MAJOR}.{̨MINOR}
description - Unicode

Version

We use a two-element version of semantic versioning: MAJOR.MINOR. The recipe version starts at 1.0. Updates to the minor version number are considered minor changes; when a new minor version is released, bakeries will overwrite the previous minor version of the feedstock's datasets. Instead, the bakery should always retain the latest major version. Updates that involve concatenation of new elements (e.g. extend the temporal range) are considered minor version updates. Changes are minor if they will be backwards-compatible with user code which relies on the previous version of the dataset. Changes are major if they have the potential to break existing user code. The following table provides examples for each category:

Change Minor Major
Updating metadata (e.g. to correct units, descriptions, etc.) ☑️
Adding new variables or timesteps ☑️
Changing existing variable names ☑️
Changing preprocessing functions (that alter existing data) ☑️

recipes section

The recipes section explains what recipes are contained in the repo. The recipes are defined in python code in a .py file (python module). We need a way to point to python objects in from meta.yaml. For the simple case of one recipe, it looks like this:

recipes:
  - id: avhrr-only
    object: "recipe:recipe"

To find the code for this recipe, the parsing library would have to do something link

from recipe import recipe

If the repo provides multiple recipes in the same module, it might look like this

recipes:
  - id: avhrr-only
    object: "recipe:avhrr_only"
  - id: amsr-avhrr
    object: "recipe:amsr_avhrr"

(We could also have multiple distinct modules.)

Some feedstocks may want to provide many recipes in one .py file. In this case, it is not feasible to enumerate every one explicitly in meta.yaml. Instead, we can point at a python dict directly using dict_object. The keys of the dict will become the recipe ids, and the values the recipe objects. For example:

recipes:
  dict_object: "recipe:all_recipes"

Rules for this section:

  1. id must be unique
  2. id must use restricted character set compatible with S3
  3. object uses entrypoints syntax to specify a module and object name to import The object must be an instance of a Recipe object.
  4. Each entry in the recipes section must have either id and object or dict_object.

provenance section

The provenance section describes where the data came from.

providers section

This section mirrors the STAC providers object.

provenance:
  providers:
    - name: "NOAA NCEI"
      description: "National Oceanographic & Atmospheric Administration National Centers for Environmental Information"
      roles:
        - producer
        - licensor
      url: https://www.ncdc.noaa.gov/oisst

The bakeries will add additional provenance (e.g. "processor") when they generate data.

license section

This section mirrors the STAC license section:

license(s) as a SPDX License identifier. Alternatively, use proprietary (see below) if the license is not on the SPDX license list or various if multiple licenses apply. In all cases links to the license texts SHOULD be added, see the license link relation type. If no link to a license is included and the license field is set to proprietary, the Collection is private, and consumers have not been granted any explicit right to use the data.

However, we don't want to use STAC "links", so we need to specialize a bit. Instead, we will add an optional url key for proprietary license.

Examples

provenance:
  ...
  license: "CC-BY-NC-4.0"
provenance:
  ...
  license: "proprietary"
  url: https://www.ncdc.noaa.gov/oisst/license.txt  # not a real link

Presumably other entries will be added to this before it becomes a STAC catalog (processor, host, etc.).

maintainers section

Who created the recipe. A list with a least one entry.

maintainers:
  - name: "Ryan Abernathey"
    orcid: "0000-0001-5999-4917"
    github: rabernat

Only github is required.

bakeries section

Tells how to bake the recipe.

Resource specifications should conform to dask worker spec.

bakeries:
  - id: "pangeo-aws-west-1"  # must come from a valid list of bakeries
    resources:  # mapped to worker settings somehow
      memory: "10 GB"  # this is optional, should have a default of 4GB

Recipe related files

The files used by the Recipes outlined in the meta.yaml file must be adjacent or in subdirectories adjacent to the meta.yaml file when the PR to staged-recipes is raised.

For example, both of the examples below are valid structures that we expect:

Directly Adjacent

a
└── pr
    └── contents
        └── pr-feedstock
            ├── meta.yaml # metadata file is directly adjacent to the recipe related files
            ├── recipe.py
            ├── recipe_2.py
            └── sidecar_file.txt

Adjacent Subdirectories

a
└── pr
    └── contents
        ├── meta.yaml # metadata file is adjacent to subdirectories container recipe related files
        ├── recipe_1
        │   ├── recipe.py
        │   └── utils
        │       └── utils.py
        └── recipe_2.py

The feedstock-creation-action will use the parent directory of meta.yaml to copy all Recipe related files into the generated Feedstock repository.

Consequences

This format specification will be used by several different parts of the system.

  • Users writing new recipes will need to write their meta.yaml, ideally starting from a nice template or example. More documentation is needed.
  • The feedstock github actions will need to parse meta.yaml in order to dispatch recipes to bakeries. Some of the metadata may need to be injected directly into the dataset attrs.
  • The catalog database will need to parse meta.yaml in order to fill in catalog metadata for the recipes added by the feedstock.