stac-extensions · duckontheweb · May 12, 2021 · Aug 10, 2020 · Nov 13, 2020 · Nov 13, 2020
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -16,4 +16,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Fixed
 
-[Unreleased]: <https://github.com/stac-extensions/template/compare/v1.0.0...HEAD>
+## [v0.1.0] - 2021-04-29
+
+Initial independent release.
+
+[Unreleased]: <https://github.com/stac-extensions/ml-aoi/compare/v0.1.0...HEAD>
+[v0.1.0]: <https://github.com/stac-extensions/ml-aoi/tree/v0.1.0>
diff --git a/README.md b/README.md
@@ -1,53 +1,103 @@
-# Template Extension Specification
+# STAC ML AOI Extension
 
-- **Title:** Template
-- **Identifier:** <https://stac-extensions.github.io/template/v1.0.0/schema.json>
-- **Field Name Prefix:** template
+- **Title:** ML AOI
+- **Identifier:** <https://stac-extensions.github.io/ml-aoi/v0.1.0/schema.json>
+- **Field Name Prefix:** ml-aoi
 - **Scope:** Item, Collection
 - **Extension [Maturity Classification](https://github.com/radiantearth/stac-spec/tree/master/extensions/README.md#extension-maturity):** Proposal
-- **Owner**: @your-gh-handles @person2
+- **Owner**: @echeipesh
 
-This document explains the Template Extension to the [SpatioTemporal Asset Catalog](https://github.com/radiantearth/stac-spec) (STAC) specification.
-This is the place to add a short introduction.
+This document explains the ML AOI Extension to the [SpatioTemporal Asset Catalog](https://github.com/radiantearth/stac-spec) (STAC) specification.
 
-- Examples:
-  - [Item example](examples/item.json): Shows the basic usage of the extension in a STAC Item
-  - [Collection example](examples/collection.json): Shows the basic usage of the extension in a STAC Collection
-- [JSON Schema](json-schema/schema.json)
-- [Changelog](./CHANGELOG.md)
+An Item and Collection extension to provide labeled training data for machine learning models.
+This extension relies on but is distinct from existing `label` extension.
+STAC items using `label` extension link label assets with the source imagery for which they are valid, often as result of human labelling effort.
+By contrast STAC items using `ml-aoi` extension link label assets with raster items for each specific machine learning model is being trained.
+
+In addition to linking labels with feature items the `ml-aoi` extension addresses some of the common configurations for ML workflows.
+The use of this extension is intended to make model training process reproducible as well as providing model provenance once the model is trained.
 
 ## Item Properties and Collection Fields
 
 | Field Name           | Type                      | Description |
 | -------------------- | ------------------------- | ----------- |
-| template:new_field   | string                    | **REQUIRED**. Describe the required field... |
-| template:xyz         | [XYZ Object](#xyz-object) | Describe the field... |
-| template:another_one | \[number]                 | Describe the field... |
+| `ml-aoi:split`       | string                    | Assigns item to one of `train`, `test`, or `validate` sets |
 
 ### Additional Field Information
 
-#### template:new_field
+#### ml-aoi:split
+
+This field is optional. If not provided, its expected that the split property will be added later before consuming the items.
+
+#### bbox and geometry
+
+- `ml-aoi` Multiple items may reference the same label and image item by scoping the `bbox` and `geometry` fields. TODO: Better describe scoping 
+   of overlap between raster and label items?
+- `ml-aoi` Items `bbox` field may overlap when they belong to different `ml-aoi:split` set.
+- `ml-aoi` Items in the same Collection should never have overlapping `geometry` fields.
+
+## Links
+
+`ml-aoi` Item must link to both label and raster STAC items valid for its area of interest.
+These Link objects should set `rel` field to `derived_from` for both label and feature items.
+
+`ml-aoi` Item should be contain enough metadata to make it consumable without the need for following the label and feature link item links. In 
+reality this may not be practical because the use-case may not be fully known at the time the Item is generated. Therefore it is critical that 
+source label and feature items are linked to provide the future consumer the option to collect additional metadata from them.
+
+| Field Name    | Type   | Name | Description                 |
+| ------------- | ------ | ---- | --------------------------- |
+| `ml-aoi:role` | string | Role | `label` or `feature`        |
+
+### Labels
+
+An `ml-aoi` Item must link to exactly one STAC item that is using `label` extension.
+Label links should provide `ml-aoi:role` field set to `label` value.
+
+### Features
 
-This is a much more detailed description of the field `template:new_field`...
+An `ml-aoi` Item must link to at least one raster STAC item.
+Feature links should provide `ml-aoi:role` field set to `feature` value.
 
-### XYZ Object
+Linked feature STAC items may use `eo` but that is not required.
+It is up to the consumer of `ml-aoi` Items to decide how to use the linked feature rasters.
 
-This is the introduction for the purpose and the content of the XYZ Object...
+## Assets
 
-| Field Name  | Type   | Description |
-| ----------- | ------ | ----------- |
-| x           | number | **REQUIRED**. Describe the required field... |
-| y           | number | **REQUIRED**. Describe the required field... |
-| z           | number | **REQUIRED**. Describe the required field... |
+Item should directly include assets for label and feature rasters.
 
-## Relation types
+| Field Name                 | Type   | Name              | Description                                  |
+| -------------------------- | ------ | ----------------- | -------------------------------------------- |
+| `ml-aoi:role`              | string | Role              | `label` or `feature`                  |
+| `ml-aoi:reference-grid`    | bool   | Reference Grid    | This raster provides reference pixel grid for model training |
+| `ml-aoi:resampling-method` | string | Resampling Method | Resampling method for non-reference-grid feature rasters        |
 
-The following types should be used as applicable `rel` types in the
-[Link Object](https://github.com/radiantearth/stac-spec/tree/master/item-spec/item-spec.md#link-object).
+Resampling method should be one of the values [supported by gdalwarp](https://gdal.org/programs/gdalwarp.html#cmdoption-gdalwarp-r)
 
-| Type                | Description |
-| ------------------- | ----------- |
-| fancy-rel-type      | This link points to a fancy resource. |
+### Labels
+
+Assets for the label item can be copied directly from the label item with their asset name preserved.
+Label assets should provide `ml-aoi:role` field set to `label` value.
+
+### Features
+
+Assets for the raster item can be copied directly from the label item with their asset name preserved.
+Feature assets should provide `ml-aoi:role` field set to `feature` value.
+
+When multiple raster features are included their resolutions and pixel grids are not likely to align.
+One raster may be specify `ml-aoi:reference-grid` field set to `true` to indicate that all other features
+should be resampled to match its pixel grid during model training.
+Other raster assets should be resampled to the reference pixel grid.
+
+## Collection
+
+All `ml-aoi` Items should belong to a Collection that designates a specific model training input.
+There is one-to-one mapping between a single ml-aoi collection and a machine-learning model.
+
+### Collection fields
+
+The consumer of `ml-aoi` catalog needs to understand the available label classes and features without crawling the full catalog.
+When member Items include multiple feature rasters it is possible that not all of them will overlap every AOI.
 
 ## Contributing
 
@@ -79,3 +129,13 @@ If the tests reveal formatting problems with the examples, you can fix them with
 ```bash
 npm run format-examples
 ```
+
+## Design Decisions
+
+Central choices and rational behind them is outlined in the ADR format:
+
+| ID   | ADR |
+|------|-----|
+| 0002 | [Use Case](docs/0002-use-case-definition.md) |
+| 0003 | [Test/Train/Validation Split](docs/0003-test-train-validation-split.md) |
+| 0004 | [Sourcing Multiple Label Items](docs/0004-multiple-label-items.md) |
diff --git a/docs/0001-record-architecture-decisions.md b/docs/0001-record-architecture-decisions.md
@@ -0,0 +1,19 @@
+# 1. Record architecture decisions
+
+Date: 2020-08-08
+
+## Status
+
+Accepted
+
+## Context
+
+We need to record the architectural decisions made on this project.
+
+## Decision
+
+We will use Architecture Decision Records, as [described by Michael Nygard](http://thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions).
+
+## Consequences
+
+See Michael Nygard's article, linked above. For a lightweight ADR toolset, see Nat Pryce's [adr-tools](https://github.com/npryce/adr-tools).
diff --git a/docs/0002-use-case-definition.md b/docs/0002-use-case-definition.md
@@ -0,0 +1,42 @@
+# 2. Us- case definition
+
+Date: 2020-08-10
+
+## Status
+
+Accepted
+
+## Context
+
+We define the initial use case for `ml-aoi` spec that exposes assumptions and reasoning for specific layout choices:
+providing training data source for Raster Vision model training process.
+
+`ml-aoi` STAC Items represent a reified relation between feature rasters and ground-truth label in a machine learning training dataset.
+Each `ml-aoi` Item roughly correspond to a "scene" or a training example.
+
+### Justification for new extension
+
+Current known STAC extensions are not suitable for this purpose. The closest match is the STAC `label` extension.
+`label` extension provides a way to define either vector or raster labels over area.
+However, it does not provide a mechanism to link those labels with feature images;
+links with `rel` type `source` point to imagery from which labels were derived.
+Sometimes this imagery will be used as feature input for model training, but not always.
+The concept of source label imagery and input feature imagery are semantically distinct.
+For instance it is possible to apply a single source of ground-truth building labels to train a model on either Landsat or Sentinel-2 scenes.
+
+### Catalog Lifetime
+
+`ml-aoi` Item links to both raster STAC item and label STAC item.
+In this relationship the source raster and label items are static and long lived, being used by several `ml-aoi` catalogs.
+By contrast `ml-aoi` catalog is somewhat ephemeral, it captures the training set in order to provide model reproducibility and provenance.
+There can be any number of `ml-aoi` catalogs linking to the same raster and label items, while varying selection, training/testing/validation split 
+and class configuration.
+
+## Decision
+
+We will adopt the use and development of `ml-aoi` extension in future machine-learning projects.
+
+## Consequences
+
+We will not longer attempt to use `label` extension as a sole source of training data for ML models.
+We will continue development of tools to both produce and consume `ml-aoi` extension catalogs.
diff --git a/docs/0003-test-train-validation-split.md b/docs/0003-test-train-validation-split.md
@@ -0,0 +1,61 @@
+# 3. Test-train-validation split
+
+Date: 2020-08-10
+
+## Status
+
+Accepted
+
+## Context
+
+During model training its important to have a consistent split between training, testing and validation data.
+
+- Training subset is used to tune model weights.
+- Test subset is used to monitor training progress and hyper-parameter turning.
+- Validation subset is used to judge overall model performance.
+
+Best practices dictate that it is critical that these datasets do not overlap.
+The which items are selected for this split will effect model performance and should be captured in the `ml-aoi` catalog.
+
+In context of a STAC catalog there are multiple ways to express the data split.
+This ADR explores available options and their consequences.
+
+### Split by Collection
+
+Split could be generated by generating a separate collection for each set. This is a flexible approach.
+However, the grouping of these collections into one cohesive training set would have to be done by convention, for instance by prefix on collection `id`.
+Additionally these collections could not be easily visualized together.
+Most (all?) existing STAC viewers are focused on browsing or viewing one collection at a time.
+
+Additionally the convention of how to associate training with testing with validation set would have to be propagated into downstream tooling.
+Further it would be easy to include a single item in both training and testing set without realizing it.
+This is not a good choice for these reasons.
+
+### Split by Link property
+
+The top-most `ml-aoi` collection has to link to each item or child catalogs.
+These links could have additional property that designates the split.
+This approach keeps all the items with in the same collection, which is good.
+
+However, when ingested into STAC API this link property is often lost and is not easily queried.
+Thus the split set membership would not be visible to through STAC API, which is bad.
+This is not a good choice for that reason.
+
+### Split by Item property
+
+Each item could have an extension specific property (ex: `ml-aoi:split`) that designates set membership.
+This approach addresses the short-comings of the previous methods.
+
+This property can be easily searched for after item is ingested into STAC API.
+Following this method it is not possible to include a single item in multiple sets.
+Collection can be viewed by tools that do not understand `ml-aoi` extension.
+
+## Decision
+
+Test, Train, Validation split should be handled by `ml-aoi:split` Item property.
+Keeping the all items, regardless of the role, grouped in a single collection provides best integration with other STAC tools.
+Expected use case is visual inspection of items on a single map with role membership used to color the footprint polygons.
+
+## Consequences
+
+Future `ml-aoi` catalogs should include `ml-aoi:split` property.
diff --git a/docs/0004-multiple-label-items.md b/docs/0004-multiple-label-items.md
@@ -0,0 +1,52 @@
+# 4. Multiple label items
+
+Date: 2020-08-11
+
+## Status
+
+Proposed
+
+## Context
+
+Should each `ml-aoi` Item be able to bring in multiple labels?
+This would be a useful feature for training multi-class classifiers.
+One can imagine having a label STAC item for buildings and separate STAC item for fields.
+STAC Items Links object is an array, so many label items could be linked to from a single `ml-aoi` STAC Item.
+
+### Limiting to single label link
+
+Limiting to single label link however is appealing because the label item metadata could be copied over to `ml-aoi` Item.
+This would remove the need to follow the link for the label item during processing.
+In practice this would make each `ml-aoi` Item also a `label` Item, allowing for its re-use by tooling that understands `label`.
+
+If multi-class label dataset would be required there would have to be a mechanical pre-processing step of combining
+existing labels into a single STAC `label` item. This could mean either union of GeoJSON FeatureCollections per item or
+a configuration of a more complex STAC `label` Item that links to multiple label assets.
+
+### Allowing multiple labels
+
+The main appeal of consuming multi-label `ml-aoi` items is that it would allow referencing multiple label sources,
+some which could be external, without the need for pre-processing and thus minimizing data duplication.
+
+If multiple labels were to be allowed the `ml-aoi` the pre-processing step above would be pushed into `ml-aoi` consumer.
+The consumer would need appropriate metadata in order to decipher how the label structure.
+This would require either crawling the full catalog or some kind of meta-label structure that combines the metadata
+from all the included labels into a single structure that could be interpreted by the consumer.
+
+## Decision
+
+`ml-aoi` Items should be limited to linking to only a single label item.
+Requiring the consumer to interpret multiple label items pushed unreasonable complexity on the user.
+Additionally combining labels likely requires series of processing and validation steps.
+Each one of those would likely require judgment calls and exceptions.
+For instance when combining building and fields label datasets the user should check that no building and field polygons overlap.
+
+It is not realistic to expect all possible requirements of that process to be expressed by a simple metadata structure.
+Therefore it is better to explicitly require the label combination as a separate process done by the user.
+The resulting label catalog can capture that design and iteration required for that process anyway.
+
+## Consequences
+
+`ml-aoi` Items can copy all `label` extension properties from the `label` Item.
+In effect `ml-aoi` Items extends `label` item by adding links to feature imagery.
+This formulation lines up with original problem statement for `ml-aoi` extension.