Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge azavea/stac-ml-aoi-extension #1

Merged
merged 15 commits into from
May 12, 2021
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Fixed

[Unreleased]: <https://github.com/stac-extensions/template/compare/v1.0.0...HEAD>
## [v0.1.0] - 2021-04-29

Initial independent release.

[Unreleased]: <https://github.com/stac-extensions/ml-aoi/compare/v0.1.0...HEAD>
[v0.1.0]: <https://github.com/stac-extensions/ml-aoi/tree/v0.1.0>
120 changes: 90 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,53 +1,103 @@
# Template Extension Specification
# STAC ML AOI Extension
kbgg marked this conversation as resolved.
Show resolved Hide resolved

- **Title:** Template
- **Identifier:** <https://stac-extensions.github.io/template/v1.0.0/schema.json>
- **Field Name Prefix:** template
- **Title:** ML AOI
- **Identifier:** <https://stac-extensions.github.io/ml-aoi/v0.1.0/schema.json>
- **Field Name Prefix:** ml-aoi
- **Scope:** Item, Collection
- **Extension [Maturity Classification](https://github.com/radiantearth/stac-spec/tree/master/extensions/README.md#extension-maturity):** Proposal
- **Owner**: @your-gh-handles @person2
- **Owner**: @echeipesh

This document explains the Template Extension to the [SpatioTemporal Asset Catalog](https://github.com/radiantearth/stac-spec) (STAC) specification.
This is the place to add a short introduction.
This document explains the ML AOI Extension to the [SpatioTemporal Asset Catalog](https://github.com/radiantearth/stac-spec) (STAC) specification.

- Examples:
- [Item example](examples/item.json): Shows the basic usage of the extension in a STAC Item
- [Collection example](examples/collection.json): Shows the basic usage of the extension in a STAC Collection
- [JSON Schema](json-schema/schema.json)
- [Changelog](./CHANGELOG.md)
An Item and Collection extension to provide labeled training data for machine learning models.
This extension relies on but is distinct from existing `label` extension.
kbgg marked this conversation as resolved.
Show resolved Hide resolved
STAC items using `label` extension link label assets with the source imagery for which they are valid, often as result of human labelling effort.
kbgg marked this conversation as resolved.
Show resolved Hide resolved
By contrast STAC items using `ml-aoi` extension link label assets with raster items for each specific machine learning model is being trained.
kbgg marked this conversation as resolved.
Show resolved Hide resolved

In addition to linking labels with feature items the `ml-aoi` extension addresses some of the common configurations for ML workflows.
The use of this extension is intended to make model training process reproducible as well as providing model provenance once the model is trained.
kbgg marked this conversation as resolved.
Show resolved Hide resolved

## Item Properties and Collection Fields

| Field Name | Type | Description |
| -------------------- | ------------------------- | ----------- |
| template:new_field | string | **REQUIRED**. Describe the required field... |
| template:xyz | [XYZ Object](#xyz-object) | Describe the field... |
| template:another_one | \[number] | Describe the field... |
| `ml-aoi:split` | string | Assigns item to one of `train`, `test`, or `validate` sets |

### Additional Field Information

#### template:new_field
#### ml-aoi:split

This field is optional. If not provided, its expected that the split property will be added later before consuming the items.
kbgg marked this conversation as resolved.
Show resolved Hide resolved

#### bbox and geometry

- `ml-aoi` Multiple items may reference the same label and image item by scoping the `bbox` and `geometry` fields. TODO: Better describe scoping
of overlap between raster and label items?
- `ml-aoi` Items `bbox` field may overlap when they belong to different `ml-aoi:split` set.
- `ml-aoi` Items in the same Collection should never have overlapping `geometry` fields.

## Links

`ml-aoi` Item must link to both label and raster STAC items valid for its area of interest.
These Link objects should set `rel` field to `derived_from` for both label and feature items.

`ml-aoi` Item should be contain enough metadata to make it consumable without the need for following the label and feature link item links. In
reality this may not be practical because the use-case may not be fully known at the time the Item is generated. Therefore it is critical that
source label and feature items are linked to provide the future consumer the option to collect additional metadata from them.

| Field Name | Type | Name | Description |
| ------------- | ------ | ---- | --------------------------- |
| `ml-aoi:role` | string | Role | `label` or `feature` |

### Labels

An `ml-aoi` Item must link to exactly one STAC item that is using `label` extension.
Label links should provide `ml-aoi:role` field set to `label` value.

### Features

This is a much more detailed description of the field `template:new_field`...
An `ml-aoi` Item must link to at least one raster STAC item.
Feature links should provide `ml-aoi:role` field set to `feature` value.

### XYZ Object
Linked feature STAC items may use `eo` but that is not required.
It is up to the consumer of `ml-aoi` Items to decide how to use the linked feature rasters.

This is the introduction for the purpose and the content of the XYZ Object...
## Assets

| Field Name | Type | Description |
| ----------- | ------ | ----------- |
| x | number | **REQUIRED**. Describe the required field... |
| y | number | **REQUIRED**. Describe the required field... |
| z | number | **REQUIRED**. Describe the required field... |
Item should directly include assets for label and feature rasters.

## Relation types
| Field Name | Type | Name | Description |
| -------------------------- | ------ | ----------------- | -------------------------------------------- |
| `ml-aoi:role` | string | Role | `label` or `feature` |
| `ml-aoi:reference-grid` | bool | Reference Grid | This raster provides reference pixel grid for model training |
| `ml-aoi:resampling-method` | string | Resampling Method | Resampling method for non-reference-grid feature rasters |

The following types should be used as applicable `rel` types in the
[Link Object](https://github.com/radiantearth/stac-spec/tree/master/item-spec/item-spec.md#link-object).
Resampling method should be one of the values [supported by gdalwarp](https://gdal.org/programs/gdalwarp.html#cmdoption-gdalwarp-r)

| Type | Description |
| ------------------- | ----------- |
| fancy-rel-type | This link points to a fancy resource. |
### Labels

Assets for the label item can be copied directly from the label item with their asset name preserved.
Label assets should provide `ml-aoi:role` field set to `label` value.

### Features

Assets for the raster item can be copied directly from the label item with their asset name preserved.
Feature assets should provide `ml-aoi:role` field set to `feature` value.

When multiple raster features are included their resolutions and pixel grids are not likely to align.
One raster may be specify `ml-aoi:reference-grid` field set to `true` to indicate that all other features
should be resampled to match its pixel grid during model training.
Other raster assets should be resampled to the reference pixel grid.

## Collection

All `ml-aoi` Items should belong to a Collection that designates a specific model training input.
There is one-to-one mapping between a single ml-aoi collection and a machine-learning model.

### Collection fields

The consumer of `ml-aoi` catalog needs to understand the available label classes and features without crawling the full catalog.
When member Items include multiple feature rasters it is possible that not all of them will overlap every AOI.

## Contributing

Expand Down Expand Up @@ -79,3 +129,13 @@ If the tests reveal formatting problems with the examples, you can fix them with
```bash
npm run format-examples
```

## Design Decisions

Central choices and rational behind them is outlined in the ADR format:

| ID | ADR |
|------|-----|
| 0002 | [Use Case](docs/0002-use-case-definition.md) |
| 0003 | [Test/Train/Validation Split](docs/0003-test-train-validation-split.md) |
| 0004 | [Sourcing Multiple Label Items](docs/0004-multiple-label-items.md) |
19 changes: 19 additions & 0 deletions docs/0001-record-architecture-decisions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# 1. Record architecture decisions

Date: 2020-08-08

## Status

Accepted

## Context

We need to record the architectural decisions made on this project.

## Decision

We will use Architecture Decision Records, as [described by Michael Nygard](http://thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions).

## Consequences

See Michael Nygard's article, linked above. For a lightweight ADR toolset, see Nat Pryce's [adr-tools](https://github.com/npryce/adr-tools).
42 changes: 42 additions & 0 deletions docs/0002-use-case-definition.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# 2. Us- case definition

Date: 2020-08-10

## Status

Accepted

## Context

We define the initial use case for `ml-aoi` spec that exposes assumptions and reasoning for specific layout choices:
providing training data source for Raster Vision model training process.

`ml-aoi` STAC Items represent a reified relation between feature rasters and ground-truth label in a machine learning training dataset.
Each `ml-aoi` Item roughly correspond to a "scene" or a training example.

### Justification for new extension

Current known STAC extensions are not suitable for this purpose. The closest match is the STAC `label` extension.
`label` extension provides a way to define either vector or raster labels over area.
However, it does not provide a mechanism to link those labels with feature images;
links with `rel` type `source` point to imagery from which labels were derived.
Sometimes this imagery will be used as feature input for model training, but not always.
The concept of source label imagery and input feature imagery are semantically distinct.
For instance it is possible to apply a single source of ground-truth building labels to train a model on either Landsat or Sentinel-2 scenes.

### Catalog Lifetime

`ml-aoi` Item links to both raster STAC item and label STAC item.
In this relationship the source raster and label items are static and long lived, being used by several `ml-aoi` catalogs.
By contrast `ml-aoi` catalog is somewhat ephemeral, it captures the training set in order to provide model reproducibility and provenance.
There can be any number of `ml-aoi` catalogs linking to the same raster and label items, while varying selection, training/testing/validation split
and class configuration.

## Decision

We will adopt the use and development of `ml-aoi` extension in future machine-learning projects.

## Consequences

We will not longer attempt to use `label` extension as a sole source of training data for ML models.
We will continue development of tools to both produce and consume `ml-aoi` extension catalogs.
61 changes: 61 additions & 0 deletions docs/0003-test-train-validation-split.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# 3. Test-train-validation split

Date: 2020-08-10

## Status

Accepted

## Context

During model training its important to have a consistent split between training, testing and validation data.

- Training subset is used to tune model weights.
- Test subset is used to monitor training progress and hyper-parameter turning.
- Validation subset is used to judge overall model performance.

Best practices dictate that it is critical that these datasets do not overlap.
The which items are selected for this split will effect model performance and should be captured in the `ml-aoi` catalog.

In context of a STAC catalog there are multiple ways to express the data split.
This ADR explores available options and their consequences.

### Split by Collection

Split could be generated by generating a separate collection for each set. This is a flexible approach.
However, the grouping of these collections into one cohesive training set would have to be done by convention, for instance by prefix on collection `id`.
Additionally these collections could not be easily visualized together.
Most (all?) existing STAC viewers are focused on browsing or viewing one collection at a time.

Additionally the convention of how to associate training with testing with validation set would have to be propagated into downstream tooling.
Further it would be easy to include a single item in both training and testing set without realizing it.
This is not a good choice for these reasons.

### Split by Link property

The top-most `ml-aoi` collection has to link to each item or child catalogs.
These links could have additional property that designates the split.
This approach keeps all the items with in the same collection, which is good.

However, when ingested into STAC API this link property is often lost and is not easily queried.
Thus the split set membership would not be visible to through STAC API, which is bad.
This is not a good choice for that reason.

### Split by Item property

Each item could have an extension specific property (ex: `ml-aoi:split`) that designates set membership.
This approach addresses the short-comings of the previous methods.

This property can be easily searched for after item is ingested into STAC API.
Following this method it is not possible to include a single item in multiple sets.
Collection can be viewed by tools that do not understand `ml-aoi` extension.

## Decision

Test, Train, Validation split should be handled by `ml-aoi:split` Item property.
Keeping the all items, regardless of the role, grouped in a single collection provides best integration with other STAC tools.
Expected use case is visual inspection of items on a single map with role membership used to color the footprint polygons.

## Consequences

Future `ml-aoi` catalogs should include `ml-aoi:split` property.
52 changes: 52 additions & 0 deletions docs/0004-multiple-label-items.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# 4. Multiple label items

Date: 2020-08-11

## Status

Proposed

## Context

Should each `ml-aoi` Item be able to bring in multiple labels?
This would be a useful feature for training multi-class classifiers.
One can imagine having a label STAC item for buildings and separate STAC item for fields.
STAC Items Links object is an array, so many label items could be linked to from a single `ml-aoi` STAC Item.

### Limiting to single label link

Limiting to single label link however is appealing because the label item metadata could be copied over to `ml-aoi` Item.
This would remove the need to follow the link for the label item during processing.
In practice this would make each `ml-aoi` Item also a `label` Item, allowing for its re-use by tooling that understands `label`.

If multi-class label dataset would be required there would have to be a mechanical pre-processing step of combining
existing labels into a single STAC `label` item. This could mean either union of GeoJSON FeatureCollections per item or
a configuration of a more complex STAC `label` Item that links to multiple label assets.

### Allowing multiple labels

The main appeal of consuming multi-label `ml-aoi` items is that it would allow referencing multiple label sources,
some which could be external, without the need for pre-processing and thus minimizing data duplication.

If multiple labels were to be allowed the `ml-aoi` the pre-processing step above would be pushed into `ml-aoi` consumer.
The consumer would need appropriate metadata in order to decipher how the label structure.
This would require either crawling the full catalog or some kind of meta-label structure that combines the metadata
from all the included labels into a single structure that could be interpreted by the consumer.

## Decision

`ml-aoi` Items should be limited to linking to only a single label item.
Requiring the consumer to interpret multiple label items pushed unreasonable complexity on the user.
Additionally combining labels likely requires series of processing and validation steps.
Each one of those would likely require judgment calls and exceptions.
For instance when combining building and fields label datasets the user should check that no building and field polygons overlap.

It is not realistic to expect all possible requirements of that process to be expressed by a simple metadata structure.
Therefore it is better to explicitly require the label combination as a separate process done by the user.
The resulting label catalog can capture that design and iteration required for that process anyway.

## Consequences

`ml-aoi` Items can copy all `label` extension properties from the `label` Item.
In effect `ml-aoi` Items extends `label` item by adding links to feature imagery.
This formulation lines up with original problem statement for `ml-aoi` extension.
Loading